Sei sulla pagina 1di 25

Page 1 of 38

Contents
Contents............................................................................................................. ..................................... 1 Introduction ....................................................................................................... ..................................... 3 Stepwise regression ......................................................................................................... ........................ 5 Regression Analysis ............................................................................................................. .................... 7 Overall quality of the Regression Model .................................................................................................. 9 Adequacy of the model ................................................................................................................ ........... 9 CLRM Assumptions ...................................................................................................... .......................... 10

Constant term: ....................................................................................................... 10 Homoscedasticity: ................................................................................... ............... 11 Autocorrelation: ...................................................................................... ............... 13 Normality: ............................................................................................... ............... 16 Coefficient stability: ............................................................................................... 17 Multicollinearity: ...................................................................................... .............. 19
Regression Analysis - Limited Dependent Variable (LDV) model ............................................................. 20 Time Series ARIMA Model ................................................................................................................ ..... 24 The Reliability of the Model ................................................................................................................ ... 32 Forecasting ........................................................................................................ .................................... 33

Stepwise regression model forecast ....................................................................... 33 Time Series ARIMA model forecast ......................................................................... 36

Conclusion ......................................................................................................... .................................... 38

Page 2 of 38 In the following discussions, this legend will be used to distinguish between our remarks, actions taken in gretl and gretl output. LEGEND: black project remarks and explanations, Crimson Gretl: actions taken Purple Gretl: data output. Page 3 of 38

Introduction
This project is designed to provide the students with an insight into the practical aspects of the Regression Analysis and the construction of the related models as taught throughout the ECON2900 course. Our group chose to use Oracle Corporation (ORCL) traded on the Nasdaq. Its weekly historical prices data were obtained from Yahoo Finance1. Oracle Corporation develops, manufactures, markets, hosts, and supports database and middleware software, applications software, and hardware systems. It licenses database and middleware software, including database and database management, application server and cloud application, service-oriented architecture and business process management, business intelligence, identity and access management, data integration, Web experience management, portals, and content management and social network software, as well as development tools and Java, a software development platform; and applications software comprising enterprise resource planning, customer relationship management, financials, governance, risk and compliance, procurement, supply chain management, enterprise portfolio project and enterprise performance management, business intelligence analytic applications, Web commerce, and industry-specific applications software. The company also provides customers with rights to unspecified software product upgrades and maintenance releases; Internet access to technical content; and Internet and telephone access

to technical support personnel. In addition, it offers computer server, storage and networking product, and hardware-related software, such as Oracle Solaris Operating System; Oracle engineered systems; storage products, which comprise tape, disk, and networking solutions for open systems and mainframe server environments; and hardware systems support solutions,
1

http://ca.finance.yahoo.com/q?s=orcl&ql=1

Page 4 of 38 including software updates for the software components, as well as product repair, maintenance, and technical support services. Further, the company provides consulting solutions in business and IT strategy alignment, enterprise architecture planning and design, initial product implementation and integration, and ongoing product enhancement and upgrade services; Oracle managed cloud services; and education services. Oracle Corporation was founded in 1977 and is headquartered in Redwood City, California. This project is set to build three models based on the performance of Oracle Inc. in stock market and forecast stock price by using these models. Data was collected from Yahoo finance using weekly prices of the ORCL stock from January 1st, 2010 to March 11th, 2013. The sample was set to January 1st, 2010 to March 11th, 2013, leaving ten data sets for forecasting. Page 5 of 38

Stepwise regression
Upon downloading the data we saved it in excel file. The data was then uploaded into Gretl to begin the initial stepwise regression analysis. Starting with the ordinary least squares analysis we set the P (AdjClose) as the dependent variable and lags 1, 2, 3, 5, 7 & 10 of the dependent variables as the independent variables from there we conducted a stepwise regression analysis of the data. As instructed, the dependent variable was deemed to be the Adjusted closing price (P) and lags of the independent variables with delays 1, 2, 3, 5, 7, and 10. The lag variables were created in the following manner. Gretl highlight P Add Lags of selected variables Number

of lags to create: 1 OK . This created P_1. This was repeated for lags 2, 3, 5, 7 and 10. Using the lagged variables as the independent variables the following model was formed: Gretl Model OLS P > Dependent variable P_1, P_2, P_3, P_5, P_7 and P_10 > Independent Variables OK The result was:
Model 1: OLS, using observations 2010/03/15-2012/12/31 (T = 147) Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.30185 0.821165 1.5854 0.11514 P_1 1.05082 0.0852309 12.3291 <0.00001 *** P_2 -0.0620704 0.123925 -0.5009 0.61725 P_3 -0.0627223 0.105141 -0.5966 0.55177 P_5 0.0131631 0.0829574 0.1587 0.87416 P_7 -0.0199001 0.0730385 -0.2725 0.78567 P_10 0.0379392 0.0474398 0.7997 0.42522 Mean dependent var 28.81014 S.D. dependent var 3.410737 Sum squared resid 158.6542 S.E. of regression 1.064540 R-squared 0.906588 Adjusted R-squared 0.902585 F(6, 140) 226.4563 P-value(F) 1.79e-69 Log-likelihood -214.1916 Akaike criterion 442.3832 Schwarz criterion 463.3163 Hannan-Quinn 450.8886 rho -0.007490 Durbin-Watson 1.996059 Excluding the constant, p-value was highest for variable 11 (P_5)

Looking at model 1 P_5 was the variable with the highest p-value so it was omitted first by: Page 6 of 38 Gretl: model 1 Tests Omit variables P_5> OK
Model 2: OLS, using observations 2010/03/15-2012/12/31 (T = 147) Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.3064 0.817824 1.5974 0.11241 P_1 1.05006 0.0848016 12.3826 <0.00001 *** P_2 -0.0625038 0.123465 -0.5062 0.61348 P_3 -0.0550059 0.0928953 -0.5921 0.55471 P_7 -0.0128 0.0575268 -0.2225 0.82424 P_10 0.0373215 0.0471161 0.7921 0.42962 Mean dependent var 28.81014 S.D. dependent var 3.410737 Sum squared resid 158.6828 S.E. of regression 1.060853 R-squared 0.906571 Adjusted R-squared 0.903258 F(5, 141) 273.6343 P-value(F) 1.04e-70 Log-likelihood -214.2048 Akaike criterion 440.4097 Schwarz criterion 458.3523 Hannan-Quinn 447.6999 rho -0.006657 Durbin-Watson 1.994042 Excluding the constant, p-value was highest for variable 12 (P_7)

Looking at model 2 P_7 was the variable with the highest p-value so it was omitted first by: Gretl: model 2 Tests Omit variables P_7 > OK
Model 3: OLS, using observations 2010/03/15-2012/12/31 (T = 147)

Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.29904 0.814416 1.5951 0.11292 P_1 1.04971 0.0845024 12.4222 <0.00001 *** P_2 -0.0610218 0.122872 -0.4966 0.62022 P_3 -0.0613445 0.0881232 -0.6961 0.48749 P_10 0.0299988 0.0336038 0.8927 0.37352 Mean dependent var 28.81014 S.D. dependent var 3.410737 Sum squared resid 158.7385 S.E. of regression 1.057297 R-squared 0.906538 Adjusted R-squared 0.903906 F(4, 142) 344.3353 P-value(F) 5.37e-72 Log-likelihood -214.2306 Akaike criterion 438.4613 Schwarz criterion 453.4134 Hannan-Quinn 444.5365 rho -0.006010 Durbin-Watson 1.992475 Excluding the constant, p-value was highest for variable 9 (P_2)

Looking at model 3 P_2 was the variable with the highest p-value so it was omitted first by: Page 7 of 38 Gretl: model 3 Tests Omit variables P_2 > OK
Model 4: OLS, using observations 2010/03/15-2012/12/31 (T = 147) Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.30687 0.812116 1.6092 0.10977 P_1 1.01909 0.0576381 17.6808 <0.00001 *** P_3 -0.0921735 0.0623824 -1.4776 0.14173 P_10 0.0301451 0.0335139 0.8995 0.36991 Mean dependent var 28.81014 S.D. dependent var 3.410737 Sum squared resid 159.0142 S.E. of regression 1.054508 R-squared 0.906376 Adjusted R-squared 0.904412 F(3, 143) 461.4627 P-value(F) 2.60e-73 Log-likelihood -214.3582 Akaike criterion 436.7164 Schwarz criterion 448.6781 Hannan-Quinn 441.5766 rho 0.022208 Durbin's h 0.373940 Excluding the constant, p-value was highest for variable 13 (P_10)

Looking at model 4 P_10 was the variable with the highest p-value so it was omitted first by: Gretl: model 4 Tests Omit variables P_10 > OK This leaves the two remaining variables to be P_1 and P_3 and the Model 5 is our final model. The result of final model is as follows:
Model 5: OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.44247 0.703493 2.0504 0.04205 ** P_1 1.02175 0.0558484 18.2950 <0.00001 *** P_3 -0.0701282 0.0560652 -1.2508 0.21293 Mean dependent var 28.56877 S.D. dependent var 3.515110 Sum squared resid 162.2909 S.E. of regression 1.036713 R-squared 0.914153 Adjusted R-squared 0.913016 F(2, 151) 803.9722 P-value(F) 3.13e-81 Log-likelihood -222.5542 Akaike criterion 451.1084 Schwarz criterion 460.2193 Hannan-Quinn 454.8093

rho 0.021377 Durbin's h 0.365708

Regression Analysis

Page 8 of 38 The model 4 regression is presented as: P = 1.44247 + 1.02175 (P_1) +-0.0701282 (P_3) The coefficient of determination (R) stands very strong at 0.914153. It shows that 91.4% of the dependent variable (P) is explained by the independent variables (One and three week historical data). This means that only 8.6% of the variation is residual or not explainable by the historical data. Observing in the past models with all lags 1, 2, 3, 5, 7, and 10 (model 1), we can see that the coefficient of determination has not changed much from 0.90658 to 0.914153. We can attribute this to the fact that the lags are previous stock prices (historical data). The standard error of $3.52 is very small relative to the average closing price of stock at $28.57 and the standard errors of the independent variables are also quite small. F-stat is very large at 803.9722 which is a good indication for the model. Gretl: model 5 Graphs Fitted, actual plot Against time Page 9 of 38

Overall quality of the Regression Model


To determine that this regression is a good regression we will test the variables to ensure they are explanatory. The corresponding hypothesis is formulated as: Ho: 1= 3= 0 H1: at least one of 1 3 0 We tested the regression from Gretl as: Gretl Tools p-value finder F dfn = 2, dfd = 151, Value = 803.9722 The result is:
F(2, 151): area to the right of 803.972 = 3.13491e-081 (to the left: 1)

Based on the Gretl output the null hypothesis is rejected since 3.13491e-081 is much smaller than any significant value. We can now determine that there is at least one explanatory variable in the model. This indicates that this model is good.

Adequacy of the model


One implicit requirement of a Classical Linear Regression Model (CLRM) is that the functional form is linear (or adequate)2. To determine whether the model is adequate we used the Ramsey RESET test. The corresponding hypothesis is formulated as: H0: adequate functional form H1: inadequate functional form The Ramsey RESET test was run from gretl by the following steps:
Brook, Chris, 2008, Introductory Econometrics for Finance, 2nd edition, Cambridge University Press, NY, p. 178
2

Page 10 of 38 Gretl: model 8 Tests Ramsays RESET squares and cubes OK Running the RESET test from Gretl we obtain:
Auxiliary regression for RESET specification test OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: P coefficient std. error t-ratio p-value -------------------------------------------------------const 31.2317 40.8262 0.7650 0.4455 P_1 -3.06124 5.34716 -0.5725 0.5678 P_3 0.206003 0.368306 0.5593 0.5768 yhat^2 0.148353 0.187320 0.7920 0.4296 yhat^3 -0.00181411 0.00221599 -0.8186 0.4143 Test statistic: F = 0.546947, with p-value = P(F(2,149) > 0.546947) = 0.58 > %5

Focusing on the p-value, we can conclude that we do not have enough evidence to reject the null hypothesis at any reasonable significance level (ince 0.58 is much larger than the typical significance level of 5%). This means that the functional form of the model is adequately linear for a CLRM.

CLRM Assumptions
The tests are performed here to verify that all the CLRM assumptions apply to our model (Final model).

Constant term:

The first CLRM assumption is: E(ut) = 0. This assumption states that the mean of residual will be zero provided that there is a constant term in the regression3. If the estimated regression line was being forced to the origin (0,0), the coefficient of determination (R) would be meaningless4. In our regression there is a constant term

in the regression is 0.8535. Therefore the estimated line of the regression is not being forced through the origin. In addition considering the summary statistics of the residuals we came to the same conclusion. Mean of the residuals is -2.30696e-016 which is practically 0.
3 4

Brooks, 2008 at 131 Brooks, 2008 at 132

Page 11 of 38 Gretl: model 5 Save Residuals


Summary Statistics, using the observations 2010/01/04 - 2012/12/31 for the variable uhat5 (154 valid observations) Mean Median Minimum Maximum -2.30696e-016 0.0515355 -3.02117 3.02048 Std. Dev. C.V. Skewness Ex. kurtosis 1.02991 4.46439e+015 -0.0999381 0.463898

Homoscedasticity:
The second assumption is: var(ut) = < . This is also known as the assump on of homoscedasticity. This means that errors have a constant variance. If there is no constant variance within its errors, it is known as heteroscedastic. To observe whether or not the model is homoscedastic, a graph was developed using the following steps: Gretl: model 5 Graphs Residual plot Against time Plot of the residuals: Page 12 of 38 Based upon the plot of residuals we can conclude that the residuals look reasonable except for few large outliers. Overall the residuals look more or less like a uniform cloud centered around the 0 line. To accurately determine whether the residuals has upheld the homoscedasticity assumption the Whites test was run. Whites test is the best approach to verify this assumption because it makes few assumptions about the heteroscedastic form5. Formula corresponding with Whites test6: The corresponding hypothesis is formulated as: H0: homoscedastic H1: heteroscedastic The Whites test was run from Gretl as: Gretl: model 5 Tests Heteroskedasticity Whites test The result was:

White's test for heteroskedasticity OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: uhat^2 coefficient std. error t-ratio p-value -------------------------------------------------------const -6.97056 9.11703 -0.7646 0.4457 P_1 0.523348 0.917904 0.5702 0.5694 P_3 0.0416192 0.938757 0.04433 0.9647 sq_P_1 0.0375564 0.0383008 0.9806 0.3284 X2_X3 -0.0996116 0.0677765 -1.470 0.1438 sq_P_3 0.0521412 0.0379513 1.374 0.1716 Unadjusted R-squared = 0.056007 Test statistic: TR^2 = 8.625024, with p-value = P(Chi-square(5) > 8.625024) = 0.124988 >5%

Focusing on the p-value, we can conclude that we do not have enough evidence to reject the null hypothesis at any reasonable significance level (since 0.124988 is larger
5 6

Ch04 Regression diagnostics.ppt, slide 9 Brooks, 2008 at 134

Page 13 of 38 than, say, the typical significance level of = 0.05). That means that the residuals are homoscedastic. If the residuals were to be heteroscedastic, and this fact is ignored then the estimation and inference will no longer have the minimum variance among the class of unbiased estimators. The consequences of heteroscedasticity would be7: (i) Standard error would not be correctly estimating making certain tests inaccurate. (ii) Predictions are inefficient (iii) Coefficient of determination (R) would not be valid. A potential solution for this is to use an alternative method such as Generalized Least Squares, or transforming the variables into logs.

Autocorrelation:

The next assumption is cov(ui,uj)=0. This assumes that the errors are not correlated with each other8. If the residual errors are correlated the model it known to be autocorrelated. To determine whether the model upheld the assumption of the uncorrelated residuals, we used both the Breusch-Godfrey (BG) test and the DurbinWatson (DW) tests. The BG test is a test for autocorrelation of 2nd order9,

The corresponding hypothesis is: H0: 1 = 2 = 0 or - no autocorrelation H1: at least one 1 or 2 0 or - there is some autocorrelation We ran the BG test from Gretl as: Gretl: model 5 Tests Autocorrelation lag order for the test: 5 OK The result was:
Econ2900 Ch04 Regression diagnostics.ppt, slide 16 Brooks, 2008 at 139 9 Brooks, 2008 at 148
7 8

Page 14 of 38
Breusch-Godfrey test for autocorrelation up to order 5 OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: uhat coefficient std. error t-ratio p-value -------------------------------------------------------const -1.15825 2.00087 -0.5789 0.5636 P_1 0.334861 0.607419 0.5513 0.5823 P_3 -0.295744 0.546386 -0.5413 0.5891 uhat_1 -0.314629 0.619766 -0.5077 0.6125 uhat_2 -0.370596 0.626638 -0.5914 0.5552 uhat_3 -0.0374010 0.125756 -0.2974 0.7666 uhat_4 -0.0997560 0.100404 -0.9935 0.3221 uhat_5 -0.00135092 0.0891343 -0.01516 0.9879 Unadjusted R-squared = 0.008030 Test statistic: LMF = 0.236363, with p-value = P(F(5,146) > 0.236363) = 0.946 >any significance level, say, 5% Alternative statistic: TR^2 = 1.236562, with p-value = P(Chi-square(5) > 1.23656) = 0.941 Ljung-Box Q' = 0.824172, with p-value = P(Chi-square(5) > 0.824172) = 0.975

Using the p-value of test statistic:LMF we can conclude we do not have enough evidence to reject the null hypothesis (with 0.946 > any significance level) which indicates that the model absolutely holds to the assumption that there is no autocorrelation. Next the autocorrelation was with the Durbin-Watson method. DW test has a test statistic10 given by: Durbin-Watson test is a test for autocorrelation of first order, i.e. The corresponding hypothesis is formulated as
10

Brooks, 2008 at 144

Page 15 of 38 H0: 1 = 2 = 0 or - no autocorrelation H1: at least one 1 or 2 0 or - there is some autocorrelation We ran the DW test from Gretl as:

Gretl: model 5 Tests Durbin-Watson p-value The result was:


Durbin-Watson statistic = 1.92985 p-value = 0.287315 >5%

Using the p-value we can conclude we do not have enough evidence to reject the null hypothesis since the p-value of 0.287315 is larger than a significant level of 5%. This means the residuals are not autocorrelated. Another way to use the DW statistic to determine whether there is autocorrelation of errors within the model is to look at the DW and following the chart below: To find the dL and dU the following steps were taken in gretl: Gretl Tools Statistical tables DW n= 154, k = 2 OK
5% critical values for Durbin-Watson statistic, n = 154, k = 2 dL = 1.7103 dU = 1.7629

Using at the DW stat of 1.92985 and with the dU valued at 1.7103 we observe that dU < DWstat < 4 dU (1.7103< 1.92985< 2.2371). Therefore we can conclude that we do not reject the null hypothesis which conveys there is no evidence of autocorrelation. All tests for autocorrelation show no evidence of autocorrelation therefore it is concluded that there is no autocorrelation in our model. Page 16 of 38

Normality:

CLRM also assumes normality of the residual errors (ut ~ N(0,)). To determine whether the normality assumption is upheld, a frequency distribution was created. The corresponding hypothesis is formulated as: H0: normal distribution H1: non-normal distribution We ran the normality test as: Gretl: model 6 Tests Normality of residual The result was:
Frequency distribution for uhat5, obs 4-157 number of bins = 13, mean = -2.30696e-016, sd = 1.03671 interval midpt frequency rel. cum. < -2.7694 -3.0212 1 0.65% 0.65% -2.7694 - -2.2660 -2.5177 3 1.95% 2.60% -2.2660 - -1.7625 -2.0142 3 1.95% 4.55% -1.7625 - -1.2590 -1.5108 8 5.19% 9.74% * -1.2590 - -0.75555 -1.0073 17 11.04% 20.78% *** -0.75555 - -0.25208 -0.50382 26 16.88% 37.66% ******

-0.25208 - 0.25139 -0.00034480 36 23.38% 61.04% ******** 0.25139 - 0.75486 0.50313 29 18.83% 79.87% ****** 0.75486 - 1.2583 1.0066 14 9.09% 88.96% *** 1.2583 - 1.7618 1.5101 9 5.84% 94.81% ** 1.7618 - 2.2653 2.0135 6 3.90% 98.70% * 2.2653 - 2.7687 2.5170 1 0.65% 99.35% >= 2.7687 3.0205 1 0.65% 100.00% Test for null hypothesis of normal distribution: Chi-square(2) = 2.775 with p-value 0.24971 > any significance level

Page 17 of 38 Using the p-value, we do not have enough evidence to reject the null hypothesis at any significance level (since 0.24971 is much larger than a significance level). This means the distribution is normal.

Coefficient stability:

In order to determine if the parameters of the regression are stable, we ran the Chow test with a split in the middle of the observations. We accepted the automatic split date calculated by Gretl. The corresponding equation11 with the Chow test is: where:
11

Brooks, 2008 at 180

Page 18 of 38 RSS = RSS for whole sample RSS1 = RSS for sub-sample 1 RSS2 = RSS for sub-sample 2 T = number of observations 2k = number of regressors in the unrestricted regression (since it comes in two parts) k = number of regressors in (each part of the) unrestricted regression The corresponding hypothesis is formulated as: H0: 1 = 2 or constant coefficient H1: 1 2 or non-constant coefficient The Chow test was run from Gretl by: Gretl: model 6 Tests Chow test 2011/06/20 OK The result was:
Augmented regression for Chow test OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: P coefficient std. error t-ratio p-value --------------------------------------------------------const 0.902918 0.802869 1.125 0.2626 P_1 1.08225 0.0929573 11.64 1.20e-022 *** P_3 -0.112510 0.0934843 -1.204 0.2307 splitdum 3.28373 1.90024 1.728 0.0861 *

sd_P_1 -0.128564 0.117936 -1.090 0.2774 sd_P_3 0.0186657 0.119806 0.1558 0.8764 Mean dependent var 28.56877 S.D. dependent var 3.515110 Sum squared resid 158.5131 S.E. of regression 1.034908 R-squared 0.916151 Adjusted R-squared 0.913319 F(5, 148) 323.4171 P-value(F) 9.41e-78 Log-likelihood -220.7407 Akaike criterion 453.4813 Schwarz criterion 471.7030 Hannan-Quinn 460.8829 rho 0.014868 Durbin-Watson 1.938357 Chow test for structural break at observation 2011/06/20 F(3, 148) = 1.17573 with p-value 0.3211 >any significance level

Page 19 of 38 Using the p-value, we conclude that we do not have enough evidence to reject the hypothesis at significance level , say, of 5% since p-value 0.3211>0.05. This means that the coefficients are constant and 1 = 2.

Multicollinearity:

To check that there correlation between explanatory variables a correlation matrix was used. The corresponding hypothesis is formulated as: H0: no correlation between variables H1: correlation between variables We ran the normality test as: Gretl View Correlation Matrix lag_1, lag_3 OK The result was:
Corr(Lag_1, Lag_3) = 0.91879911 Under the null hypothesis of no correlation: t(162) = 29.6267, with two-tailed p-value 0.0000 Correlation coefficients, using the observations 2010/01/04 - 2013/03/11 (missing values were skipped) 5% critical value (two-tailed) = 0.1519 for n = 167 Lag_1 Lag_3 1.0000 0.9188 Lag_1 1.0000 Lag_3

Therefore, using p-value of 0.0000, we need to reject the null hypothesis at any significance level. This concludes that there is a correlation between the explanatory variables, P_1 (=Lag_1) and P_3 (=Lag_3). We know that the variables are correlated because one is the same as the other, only delayed by one set. Therefore they are not orthogonal12. One consequences of having similar explanatory variables is that the regression will be very sensitive to small changes in specification. Other consequences include large confidence intervals which may give erroneous conclusions to significant

tests13.
12 13

Brooks, 2008 at 170 Brooks, 2008 at 172

Page 20 of 38

Regression Analysis - Limited Dependent Variable (LDV) model


Both the logit and probit model approaches are able to overcome the limitation of the LPM that it can produce estimated probabilities that are negative or greater than one. They do this by using a function that effectively transforms the regression model so that the fitted values are bounded within the (0,1) interval. Visually, the fitted regression model will appear as an S-shape rather than a straight line, as was the case for the LPM. The only difference between the two approaches is that under the logit approach, the cumulative logistic function is used to transform the model, so that the probabilities are bounded between zero and one. But with the probit model, the cumulative normal distribution is used instead. For the majority of the applications, the logit and probit models will give very similar characterisations of the data because the densities are very similar. For our project we used the Logit model as it uses the cumulative logistic distribution to transform the model so that the probabilities follow the S-shape. Page 21 of 38 With the logistic model, 0 and 1 are asymptotes to the function. The logit model is not linear (and cannot be made linear by a transformation) and thus is not estimable using OLS. Instead, maximum likelihood is usually used to estimate the parameters of the model In preparing for producing our LDV model, we converted the AdjClose data we had in the first place into up/down values. To do so we defined a new variable and from there we follow the steps below. p=AdjClose

1. Rename for convenience the AdjClose 2. Add Define new variable 3. P = AdjClose 4. Select the newly created variable P 5. Add First differences of selected variables 6 Add Define new variable 7. Mov = (d_P > 0)? 1 : 0 After, we select possibly significant lags 8. Select the variable d_ P ,since the Variable P gives us a Non Stationary time series 9. Variable Correlogram The result was:
Autocorrelation function for d_P LAG ACF PACF Q-stat. [p-value] 1 0.0780 0.0780 0.9677 [0.325] 2 0.0145 0.0085 1.0015 [0.606] 3 -0.0081 -0.0099 1.0121 [0.798] 4 -0.0963 -0.0956 2.5159 [0.642]

Page 22 of 38
5 -0.0288 -0.0141 2.6512 [0.754] 6 -0.0583 -0.0536 3.2102 [0.782] 7 -0.0163 -0.0086 3.2540 [0.861] 8 0.0271 0.0212 3.3767 [0.909] 9 -0.1695 ** -0.1806 ** 8.1957 [0.515] 10 -0.1179 -0.1084 10.5414 [0.394] 11 -0.0384 -0.0264 10.7923 [0.461] 12 0.0347 0.0413 10.9979 [0.529]

The significant lags are identified as lag1, lag4, lag6, lag9 and lag10. Finally, we build the model 10. Models Nonlinear models Logit Binary 11. Use variable Mov as dependent variable, click Lags to add the significant lags. 12. Check under Lags of dependent variable and also under or and type 1,4,6,9,10 under specific lags, Ok.
Model 3: Logit, using observations 2010/03/22-2012/12/31 (T = 146) Dependent variable: Mov Standard errors based on Hessian Coefficient Std. Error z Slope* const 1.35744 0.532104 2.5511 Mov_1 -0.296546 0.351026 -0.8448 -0.0735609 Mov_4 -0.395082 0.348306 -1.1343 -0.097821 Mov_6 -0.305114 0.34929 -0.8735 -0.0756642 Mov_9 -0.662876 0.349021 -1.8992 -0.163041 Mov_10 -0.585166 0.350272 -1.6706 -0.144248 Mean dependent var 0.534247 S.D. dependent var 0.500543 McFadden R-squared 0.037092 Adjusted R-squared -0.022398 Log-likelihood -97.11576 Akaike criterion 206.2315 Schwarz criterion 224.1332 Hannan-Quinn 213.5054 Evaluated at the mean

Number of cases 'correctly predicted' = 87 (59.6%) f(beta'x) at mean of independent vars = 0.501 Likelihood ratio test: Chi-square(5) = 7.48198 [0.1872] Predicted 0 1 Actual 0 32 36 1 23 55 Excluding the constant, p-value was highest for variable 24 (Mov_2)

The coefficient of determination (R) for this model is low at 0.037092, which usually indicates a bad model but for this type of model (logit), R can be very misleading. Page 23 of 38 In order to judge the performance of our model, we use the confusion matrix as it compares correctly vs. incorrectly classified observations for the forecasting weeks. To do this, we look at the last portion of our gretl result (PredictedActual) and then we compare. For our model, the price of the stock increases when 1 appears and it decreases when 0 does. Hence: In 32 cases the model predicted that the price of the stock would decrease and it actually has decreased (Actual 0, Predicted 0). In 23 cases the model predicted that the price would decrease, but in contrast the price actually has increased (Actual 1, Predicted 0) In 36 cases the model predicted that the price would increase, but the price actually has decreased (Actual 0. Predicted 1) In 55 cases the model predicted that the price would increase and the price actually has increased (Actual 1, Predicted 1) So in 59.6% of the time, our model predicted correctly (55+31/ 146) as it says in our results above Number of cases 'correctly predicted' = 87 (59.6%) Page 24 of 38

Time Series ARIMA Model


An analysis of the stock trend over a period of 3years (Jan 4, 2010 Dec 12, 2012) reveals the graph shown below. In order to run a time series model the series has to

be stationary. In other words, a series is stationary if the distribution of its values remain the same as time progresses, implying that the probability that y falls within a particular interval is the same now as at any time in the past or the future14. Using the data available an attempt to run an Autoregressive Integrated Moving Average (ARIMA) time series test will be done. In the analysis of data, a correlogram, which is an image of correlation statistics will be used to show the time lags and also determine if there is sufficient data to use a time series analysis. Gretl variable correlogram
14

Brooks , 2008 page 208

Page 25 of 38
Autocorrelation function for P LAG ACF PACF Q-stat. [p-value] 1 0.9471 *** 0.9471 *** 152.5071 [0.000] 2 0.8914 *** -0.0547 288.4120 [0.000] 3 0.8355 *** -0.0303 408.5362 [0.000] 4 0.7771 *** -0.0545 513.1006 [0.000] 5 0.7262 *** 0.0412 604.9763 [0.000] 6 0.6774 *** -0.0126 685.4180 [0.000] 7 0.6289 *** -0.0275 755.1762 [0.000] 8 0.5851 *** 0.0131 815.9426 [0.000] 9 0.5406 *** -0.0335 868.1518 [0.000] 10 0.5076 *** 0.0853 914.4651 [0.000] 11 0.4842 *** 0.0638 956.8838 [0.000] 12 0.4694 *** 0.0671 997.0047 [0.000]

According to the Correlogram model the data is inconclusive (rapidly decaying and all the lags are significant in ACF at any significant level) and Partial Autocorrelation Function(PACF) is also periodic, fluctuating within the boundaries of significance which tells us that except the first lag all other lags are insignificant (at the 10% Page 26 of 38 significance level). Actually the low decay (in stationary TS it should decay much faster like within the first couple of lags) of ACF is a sign telling that the time series of the levels is nonstationary. In addition to the graphical analysis we can use Augmented DickeyFuller (ADF) and KPSS test for unit root test and figure out that if P and its first difference d_P are stationary or not. The hypothesis test for ADF is as follows:

Variable Unit root test Augmented Dickey-Fuller test


Augmented Dickey-Fuller test for P including 9 lags of (1-L)P (max was 12) sample size 147 unit-root null hypothesis: a = 1 test without constant model: (1-L)y = (a-1)*y(-1) + ... + e 1st-order autocorrelation coeff. for e: -0.021 lagged differences: F(9, 137) = 0.864 [0.5594] estimated value of (a - 1): 0.00241071 test statistic: tau_nc(1) = 0.774681 asymptotic p-value 0.8807 > any significance level test with constant model: (1-L)y = b0 + (a-1)*y(-1) + ... + e 1st-order autocorrelation coeff. for e: -0.017 lagged differences: F(9, 136) = 0.786 [0.6301] estimated value of (a - 1): -0.0422346 test statistic: tau_c(1) = -1.47694 asymptotic p-value 0.5456 > any significance level with constant and trend model: (1-L)y = b0 + b1*t + (a-1)*y(-1) + ... + e 1st-order autocorrelation coeff. for e: -0.014 lagged differences: F(9, 135) = 0.773 [0.6413] estimated value of (a - 1): -0.0669891 test statistic: tau_ct(1) = -1.89637 asymptotic p-value 0.6563 > any significance level

Page 27 of 38 Using the p-value, we see all the 3 versions of ADF test point out to the fact that we do not have enough evidence to reject the null hypothesis at any reasonable level of significance. Thus we can conclude that the level of time series is not stationary. KPSS test is another test for the stationarity that we can use. The hypothesis test for KPSS is as follows: Variable Unit root test KPSS test
KPSS test for P T = 157 Lag truncation parameter = 12 Test statistic = 0.592168 10% 5% 1% Critical values: 0.349 0.464 0.737 Interpolated p-value 0.031

Since the t-statistic 0.592168 is larger than the corresponding critical points of 10% and 5% level of significance we can conclude that we should reject the null hypothesis which says that the series is stationary so it is not stationary. According to the Time Series (TS) model shown, the graph is nonstationary. Therefor

to stabilize it, we will use First differences of the P (AdjustedClose) called d_P for our time series. First differences of the P (AdjustedClose) called d_P will give us a more stable and stationary model as defined by Brooks and as shown by the graph below. Click on P on Gretl add first differences of selected variables The command above will give us first differences of the P which will be used for our model. On Gretl we, the first differences of the P are shown by the graph below. Double click on d_P on the small pop up window click the graph tab to get graph shown below. Page 28 of 38 The graph shown above is stationary and will be used for our time series analysis. We also run a correlogram for autocorrelation and the model will also be used for the ARIMA model. On Gretl click once on d_P variable correlogram to show the model below. The results of ADF test for d_P are as follows
Augmented Dickey-Fuller test for d_P including 8 lags of (1-L)d_P (max was 12) sample size 147 unit-root null hypothesis: a = 1 test without constant model: (1-L)y = (a-1)*y(-1) + ... + e 1st-order autocorrelation coeff. for e: -0.020 lagged differences: F(8, 138) = 0.819 [0.5870] estimated value of (a - 1): -1.2286 test statistic: tau_nc(1) = -4.84237 asymptotic p-value 1.579e-006 < any significance level test with constant model: (1-L)y = b0 + (a-1)*y(-1) + ... + e

Page 29 of 38

1st-order autocorrelation coeff. for e: -0.021 lagged differences: F(8, 137) = 0.861 [0.5510] estimated value of (a - 1): -1.26649 test statistic: tau_c(1) = -4.92789 asymptotic p-value 2.861e-005 < any significance level with constant and trend model: (1-L)y = b0 + b1*t + (a-1)*y(-1) + ... + e 1st-order autocorrelation coeff. for e: -0.021 lagged differences: F(8, 136) = 0.851 [0.5594] estimated value of (a - 1): -1.26507 test statistic: tau_ct(1) = -4.89539 asymptotic p-value 0.0001 < any significance level

Using the p-value, we see all the 3 versions of ADF test point out to the fact that we should reject the null hypothesis. Thus we can conclude that the first difference level of time series is stationary. The result of KPSS test for d_P is as follows:
KPSS test for d_P T = 156 Lag truncation parameter = 12 Test statistic = 0.0773472 10% 5% 1% Critical values: 0.349 0.464 0.737

Since the t-statistic 0.077342 is much smaller than the corresponding critical points of 10% and 5% level of significance we can conclude that we should not reject the null hypothesis which says that the time series is stationary. Page 30 of 38
Autocorrelation function for d_P LAG ACF PACF Q-stat. [p-value] 1 0.0780 0.0780 0.9677 [0.325] 2 0.0145 0.0085 1.0015 [0.606] 3 -0.0081 -0.0099 1.0121 [0.798] 4 -0.0963 -0.0956 2.5159 [0.642] 5 -0.0288 -0.0141 2.6512 [0.754] 6 -0.0583 -0.0536 3.2102 [0.782] 7 -0.0163 -0.0086 3.2540 [0.861] 8 0.0271 0.0212 3.3767 [0.909] 9 -0.1695 ** -0.1806 ** 8.1957 [0.515] 10 -0.1179 -0.1084 10.5414 [0.394] 11 -0.0384 -0.0264 10.7923 [0.461] 12 0.0347 0.0413 10.9979 [0.529]

According to Brooks, the correlogram shown above is a PACF and ACF that has a damped sine wave structure, in other words the lagged variables are both positive and negative. Using the Correlogram for predicting lags and differences we are now ready to run an ARIMA model. From examining the above graph and numbers given Page 31 of 38 by gretl we have concluded that the significant lag is lag 9 for both ACF (determines the order of MA) and for PACF (determines the order of AR). From there we now need to build our ARIMA (1, 1, 1) model using these lags considering the 1st differencing mode (I=1) and 1 for the number of both auto-regressive terms (lags 9 for PACF) and for the number of moving average terms (lags 9 for ACF) .

On Gretl Model Timeseries ARIMA P is dependent variable, d_P is independent variable. Using Correlogram model as a guide, check AR specific lag box . AR lag of specific order 9 will be typed in the box. Also check the box for MA and lag 9 will be used. First differences will also be used and so on the difference command we will use 1. The result was:
Model 14: ARIMA, using observations 2010/01/11-2012/12/31 (T = 156) Dependent variable: (1-L) P Standard errors based on Hessian Coefficient Std. Error z p-value const 0.0653198 0.065973 0.9901 0.32213 phi_9 0.0994883 0.359441 0.2768 0.78194 theta_9 -0.28576 0.343907 -0.8309 0.40602 Mean dependent var 0.068141 S.D. dependent var 1.043372 Mean of innovations -0.001125 S.D. of innovations 1.021499 Log-likelihood -224.8412 Akaike criterion 457.6825 Schwarz criterion 469.8819 Hannan-Quinn 462.6373 Real Imaginary Modulus Frequency AR Root 1 -1.2144 0.4420 1.2923 0.4444 Root 2 -1.2144 -0.4420 1.2923 -0.4444 Root 3 1.2923 0.0000 1.2923 0.0000 Root 4 0.2244 -1.2727 1.2923 -0.2222 Root 5 0.2244 1.2727 1.2923 0.2222 Root 6 0.9899 0.8307 1.2923 0.1111 Root 7 0.9899 -0.8307 1.2923 -0.1111 Root 8 -0.6461 -1.1192 1.2923 -0.3333 Root 9 -0.6461 1.1192 1.2923 0.3333 MA Root 1 -1.0800 0.3931 1.1493 0.4444 Root 2 -1.0800 -0.3931 1.1493 -0.4444 Root 3 1.1493 0.0000 1.1493 0.0000 Root 4 0.1996 -1.1319 1.1493 -0.2222 Root 5 0.1996 1.1319 1.1493 0.2222 Root 6 0.8804 0.7388 1.1493 0.1111 Root 7 0.8804 -0.7388 1.1493 -0.1111 Root 8 -0.5747 -0.9953 1.1493 -0.3333

Page 32 of 38

Root 9 -0.5747 0.9953 1.1493 0.3333

The Reliability of the Model


Data analysis often requires selection over several possible models that could fit the data. As well mentioned by Dr. Ivan ACF/PACF are, at best, only indicators of significant lags. As he was suspicious in the case of weekly data we uses as whether the price of 9

weeks ago is the most strongly parameter that would affect the price of today, we tried to use other candidates like lag 1 with ACF of 0.0780 and PACF of 0.0780 to construct a new model but in all cases the value Akaike criterion of 457.6825 was the lowest among other alternative models. To do so and n choosing the best fit between different curves, supposing the errors are distributed normally, we can calculate the Akaike index AIC=2+2Np(Np+1)NbNp1, for all the fits and choose the smallest. The smallest the Akaike criterion the more fitted the model is. In our case the model we already has the lowest Akaike criterion compared to other alternatives we could have which attests to the point that we can rely on this as a best time-series model. Page 33 of 38

Forecasting
Stepwise regression model forecast
Using the last model from the regression analysis:
Model 16: OLS, using observations 2010/01/25-2012/12/31 (T = 154) Dependent variable: P Coefficient Std. Error t-ratio p-value const 1.44247 0.703493 2.0504 0.04205 ** P_1 1.02175 0.0558484 18.2950 <0.00001 *** P_3 -0.0701282 0.0560652 -1.2508 0.21293 Mean dependent var 28.56877 S.D. dependent var 3.515110 Sum squared resid 162.2909 S.E. of regression 1.036713 R-squared 0.914153 Adjusted R-squared 0.913016 F(2, 151) 803.9722 P-value(F) 3.13e-81 Log-likelihood -222.5542 Akaike criterion 451.1084 Schwarz criterion 460.2193 Hannan-Quinn 454.8093 rho 0.021377 Durbin's h 0.365708

We can now forecast the 10-week test prices with a forecast range of 2013/01/07 to 2013/03/11 by: Gretl: model 16 Analysis Forecast forecast range Start: 2013/01/07, End 2013/03/11, OK Page 34 of 38 With the Regression model we get the following result:
For 95% confidence intervals, t(151, 0.025) = 1.976 Obs P prediction std. error 95% interval 2013/01/07 34.8600 34.4376 1.03671 (32.3893, 36.4860) 2013/01/14 35.1100 34.3134 1.48216 (31.3849, 37.2418) 2013/01/21 35.3800 34.0749 1.83526 (30.4488, 37.7010) 2013/01/28 36.2100 33.8434 2.10607 (29.6822, 38.0046) 2013/02/04 34.9000 33.6155 2.32347 (29.0248, 38.2062)

2013/02/11 34.8100 33.3994 2.50147 2013/02/18 34.7500 33.1949 2.64991 2013/02/25 34.6300 33.0019 2.77527 2013/03/04 35.7100 32.8198 2.88214 2013/03/11 36.3000 32.6481 2.97394 Forecast evaluation statistics Mean Error 1.7311 Mean Squared Error 3.8515 Root Mean Squared Error 1.9625

(28.4570, (27.9592, (27.5185, (27.1253, (26.7722,

38.3418) 38.4306) 38.4852) 38.5143) 38.5240)

Page 35 of 38
Mean Absolute Error 1.7311 Mean Percentage Error 4.8768 Mean Absolute Percentage Error 4.8768 Theil's U 3.0889 Bias proportion, UM 0.77806 Regression proportion, UR 0.13937 Disturbance proportion, UD 0.082575

The mean squared error (MSE) measures how close a fitted line is to the data points. It measures this by taking the difference between an estimator and the true value of the quantity being estimated. The smaller the MSE, the closer the forecast is to the data. Taking the best two lags, the Regression model has a MSE of 3.8515 which means the forecasted line is fairly close to the data points. The problem with MSE is that is it gives more weight to larger errors so a better measure of forecast accuracy is the mean absolute error. The mean absolute error (MAE) measures how close forecasts are to the eventual outcomes. If this value is 0 (zero), the fit (forecast) is perfect. The MSE is a good performance metric for many applications as there is good reason to suppose that noise process is Gaussian15. Sometimes it is better to use the MAE if we don't want our performance metric to be overly sensitive to outliers. The Regression model has a MAE of 1.7311 which almost a half of the the MSE value. Actually, looking at both MAE and MSE gives us additional information about the distribution of the errors: if MSE is close to MAE, the model makes many relatively small errors if MSE is close to square MAE, the model makes few but large errors In our case MSE is 3.8515, MAE is 1.7311 and MAE square is 2.9967 which says that

we have couple of large errors which affects the entire result.


use-when-usingsvmmse-or-mae

15http://stats.stackexchange.com/questions/22344/which-performance-measure-to-

Page 36 of 38

Time Series ARIMA model forecast

Using the preferred model from the Time Series model:


Model 14: ARIMA, using observations 2010/01/11-2012/12/31 (T = 156) Dependent variable: (1-L) P Standard errors based on Hessian Coefficient Std. Error z p-value const 0.0653198 0.065973 0.9901 0.32213 phi_9 0.0994883 0.359441 0.2768 0.78194 theta_9 -0.28576 0.343907 -0.8309 0.40602 Mean dependent var 0.068141 S.D. dependent var 1.043372 Mean of innovations -0.001125 S.D. of innovations 1.021499 Log-likelihood -224.8412 Akaike criterion 457.6825 Schwarz criterion 469.8819 Hannan-Quinn 462.6373 Real Imaginary Modulus Frequency AR Root 1 -1.2144 0.4420 1.2923 0.4444 Root 2 -1.2144 -0.4420 1.2923 -0.4444 Root 3 1.2923 0.0000 1.2923 0.0000 Root 4 0.2244 -1.2727 1.2923 -0.2222 Root 5 0.2244 1.2727 1.2923 0.2222 Root 6 0.9899 0.8307 1.2923 0.1111 Root 7 0.9899 -0.8307 1.2923 -0.1111 Root 8 -0.6461 -1.1192 1.2923 -0.3333 Root 9 -0.6461 1.1192 1.2923 0.3333 MA Root 1 -1.0800 0.3931 1.1493 0.4444 Root 2 -1.0800 -0.3931 1.1493 -0.4444 Root 3 1.1493 0.0000 1.1493 0.0000 Root 4 0.1996 -1.1319 1.1493 -0.2222 Root 5 0.1996 1.1319 1.1493 0.2222 Root 6 0.8804 0.7388 1.1493 0.1111 Root 7 0.8804 -0.7388 1.1493 -0.1111 Root 8 -0.5747 -0.9953 1.1493 -0.3333 Root 9 -0.5747 0.9953 1.1493 0.3333

We can now forecast the 10-week test prices with a forecast range of 2013/01/07 to 2013/03/11 by: Gretl: model 21 Analysis Forecast forecast range Start: 2013/01/07, End 2012/03/11, OK Page 37 of 38 With the time series model we get the following result:
For 95% confidence intervals, z(0.025) = 1.96 Obs P prediction std. error 95% interval 2013/01/07 34.8600 34.8125 1.02150 (32.8104, 36.8146) 2013/01/14 35.1100 34.9405 1.44462 (32.1091, 37.7719) 2013/01/21 35.3800 34.8747 1.76929 (31.4070, 38.3425) 2013/01/28 36.2100 34.7669 2.04300 (30.7627, 38.7711) 2013/02/04 34.9000 34.9024 2.28414 (30.4255, 39.3792) 2013/02/11 34.8100 34.9414 2.50215 (30.0373, 39.8456)

2013/02/18 34.7500 34.7062 2.70263 2013/02/25 34.6300 34.9015 2.88924 2013/03/04 35.7100 34.6729 3.06450 2013/03/11 36.3000 34.7519 3.17523 Forecast evaluation statistics

(29.4092, (29.2387, (28.6666, (28.5285,

40.0033) 40.5643) 40.6792) 40.9752)

Page 38 of 38

Mean Error 0.43891 Mean Squared Error 0.5934 Root Mean Squared Error 0.77032 Mean Absolute Error 0.51996 Mean Percentage Error 1.2159 Mean Absolute Percentage Error 1.4496 Theil's U 1.2111 Bias proportion, UM 0.32464 Regression proportion, UR 0.25206 Disturbance proportion, UD 0.4233

Conclusion

The best forecasted model, indicated with the least amount of mean squared error and mean absolute error is the Time-Series model. The Mean Squared Error is 0.5934 for our Time-Series model which is much lower compared to the 3.8515 of the Regression model. In addition mean absolute error is also 0.519965 which is lower compared to 1.7311 of the Regression Analysis model. Thus, we can conclude that the ARIMA model has the best forecasting accuracy out of the two models forecasting the value of Adjusted closing price of Oracle stock price: Regression analysis and Time-series Analysis. ARIMA (1,1,1) Regression Model MSE 0.5934 3.8515 MAE 0.51996 1.7311

Potrebbero piacerti anche