Sei sulla pagina 1di 14

Fitting and Predicting a Time Series Model

April 11, 2013

(Midterm 2)

Sueja Goldhahn 23185225 suejagoldhahn@berkeley.edu Prof. Guntuboyina Intro to Time Series Analysis

Summary As a midterm take-home assignment, the task is to fit the best possible model to a given time series dataset and predict the outcome for the next year. The time series dataset was chosen from a group of 5 datasets given for the assignment. The information of the data is unknown, other than the fact that the data is weekly data from Google Trends, obtained on March 20, 2013. The dataset consists of 429 data points, comprised of 8 years and 13 weeks of information from the first week of January 4, 2004 to March 24, 2012, as seen in Figure 1. The data for the next year, from March 25, 2012 to March 23, 2013, are to be predicted. After analyzing the data closely, the model chosen to be the best fit is a multiplicative seasonal autoregressive moving average model, ( ( )

). Using this model, the resulting prediction for the next year is given in Figure

11 and 12. This report will describe the methods used to choose the best model for the time series dataset, and explain why I believe this model is the best fit to the dataset. Then I will go on to explain the techniques used to predict the model for the next year.

Figure 1

Figure 2

Method Used to Fit the Model

The first thing that needs to be observed in the data is trend and variability. By observing the data, it is very apparent that there is a quadratic trend, along with seasonality. The variability in the data stays constant throughout, indicating that a transformation of the data would not be necessary. This would suggest that the data needs to be differenced once to eliminate the quadratic trend, and then differenced again to eliminate the seasonality. The second time will be differenced with lag equal to 52, which is the number of weeks in a year. After being differenced, the trend should now be eliminated, and the data should look like white noise. The differenced data is displayed as Figure 2. The plot visibly seems to have no structure. To check for the optimal orders of differencing, the standard deviation of the data should be considered. A correctly differenced data should have a small standard deviation. A table of standard deviations for each order of differencing is displayed in Figure 3, showing that the standard deviation is indeed the smallest for this order of differencing. Figure 3

Standard Deviation of Differenced Data Order 0 Lag 52: Order 1 Order 2 Order 0 6.782 3.895 4.051

Lag 1: Order 1 3.556 2.145 3.289 Order 2 6.387 9.936 17.475

Once the correct order of differencing is attained, the next step is to look at the autocorrelation and partial autocorrelation functions to determine the autoregressive and moving average terms. The autocorrelation function is displayed in Figure 4, and the partial autocorrelation function is displayed in Figure 5.

Figure 4

Figure 5

Examining the autocorrelation function gives several clues as to what fit to use in this model. The first thing to note is that the model has a seasonal MA term as a result of the negative 52nd lag. The 104th lag is significant, with the value at that point being 0.109, lying above the standard deviation. Also, there is some asymmetry around these lag points. Hence, the two values to consider in the model is SMA(2) and SMA(3).

Next thing to note is that lag-1 is significant and the autocorrelations cut off after lag-1, which is very characteristic of an MA model. It is possible that there also contains an AR value mixed into the equation due to the shrinking variability in the autocorrelations as the lags increase. The partial autocorrelation shows characteristics of an MA model as a result of the slow decay. From these observations, a few good models to test out are:

( ( ( (

) ) ) )

( ( ( (

) ) ) )

After fitting the data to each of these models, I had chosen two of the best fits based on their AIC score: Refer to the appendix for the results of each model under Results of Each Model. ( ( ) ) ( ( ) )

To see how close the theoretical autocorrelation functions of each model match up to the actual autocorrelations, I had plotted the two autocorrelations together in Figure 6-9. This helps to see how well the models fit to the data, and if they are any good at computing the phi and theta values. To see the phi and theta values used, as well as the function, please refer to the appendix under Results of Each Model.

Figure 6

Figure 7

Figure 8

Figure 9

The blue points are the autocorrelation from the data, and the red points are the theoretical autocorrelations. Both models seem to capture the structure of the autocorrelations, which tells me that the models chosen are a good fit. The second model, which has a smaller AIC, looks as though the theoretical autocorrelations fit better than the first model.

Choosing the Best Model for Predicting

To determine which of the two competing models to use, the models must be tested for how well it predicts the data. That is done through cross-validation. I will first explain the methodology used in cross-validating the data, and then show the results of the Cross Validation score. Given that there are 8 years and 13 weeks in the dataset, Cross-validation is done through predicting values of those years and taking the sum of squares of errors in the predictions. It is optimal to predict as many years within the data as possible. The number of data points required in predicting the model limits the number of years that I am able to predict. Obviously, I would not be able to predict the first year of data, as there are no data points to predict from. The minimum number of years of data needed to predict are 4 years, so I used the data points 1 through 221 to predict data points from 222 to 273. Then I used the data points 1 through 273 to predict data points from 274 to 325, and so on. The accuracy in the predictions is found through comparing the predicted data to the actual data by taking the difference and using the sums of squares method. This is done for each of the 4 years that are predicted, and then averaged to attain the Cross-Validation score. The model with the smallest CV score is the best model for the dataset, and will be used to predict the next years data. The CV score results are shown in Figure 10 and 11, along with the plots of the actual data (in black) and predicted data (in red). It is evident that the models are able to predict the data accurately. Although the second model has a slightly

better AIC score, the CV score is significantly larger than the CV score for the first model. Therefore, the best model for this data set is the first model, ARIMA(1,1,1)X(0,1,3). This model will be used to forecast the next years data in the following section.

Figure 10
ARIMA(1,1,1)X(0,1,3) CV = 479.9672

Figure 11
ARIMA(1,1,2)X(0,1,3) CV = 495.4493

The Forecast Now that the best model for the dataset is chosen, the next year will be forecasted using the model. The data, including the forecast, is displayed in Figure 11. The forecasted data visually look accurate. The 95% confidence interval of the forecasted data is exhibited in Figure 12.

Figure 11

Figure 12

# APPENDIX # data: d1 # plot(d1,type="l",main = "Time Series Data", ylab="",xlab="Weeks",sub="January 04, 2004 to March 24, 2012") #Difference for seasonality and trend d = diff(d1, lag = 52) d = diff(d) #Check to see that there is no trend left in the model plot(d,type="l",main = "Differenced Data", ylab="",xlab="") #Check acf and pacf acf(d, lag.max = 200,main="Autocorrelation Function") pacf(d, lag.max = 200,main="Partial Autocorrelation Function")

#Results of Each Model: # ARIMA(0,1,1)X(0,1,3) arima(d1, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52)) #aic = 1508.32 # ma1 sma1 sma2 sma3 # -0.6419 -0.3395 0.2599 0.1885 #s.e. 0.0515 0.0610 0.0750 0.0770 #ARIMA(0,1,1)X(0,1,2) arima(d1, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2), period = 52)) #aic = 1512.92 # ma1 sma1 sma2 # -0.6290 -0.3430 0.2827 #s.e. 0.0513 0.0653 0.0748 #ARIMA(1,1,1)X(0,1,3) arima(d1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52)) #aic = 1504.52 # ar1 ma1 sma1 sma2 sma3 # 0.2149 -0.7890 -0.3429 0.2711 0.1873 #s.e. 0.0815 0.0547 0.0604 0.0766 0.0772 #ARIMA(1,1,2)X(0,1,3) arima(d1, order = c(1, 1, 2), seasonal = list(order = c(0, 1, 3), period = 52)) #aic=1504.34 # ar1 ma1 ma2 sma1 sma2 sma3 # 0.6451 -1.2347 0.3173 -0.3372 0.2759 0.1881 #s.e. 0.1904 0.2061 0.1557 0.0603 0.0758 0.0771

#Theoretical Autocorrelation of ARIMA(1,1,1)X(0,1,3) #phi and theta values: ph = .2149 th = c(-.789, rep(0, 50), -0.3429, -.3429*-.789, rep(0, 51), 0.2711, .2711*-.789,rep(0,51),.1873,.1873*-.789) acf(d, lag.max=175,main = "Theoretical Autocorrelation of ARIMA(1,1,1)X(0,1,3)", col="blue") ACF = ARMAacf(ar = ph, ma = th, lag.max = 175) points(x=0:175,y=ACF, col="red",type="h") pacf(d, lag.max=175,main = "Theoretical Partial Autocorrelation of ARIMA(1,1,1)X(0,1,3)", col="blue") PACF = ARMAacf(ar = ph, ma = th, lag.max = 175, pacf = T) points(x=1:175,y=PACF, col="red",type="h") #Theoretical Autocorrelation of ARIMA(1,1,2)X(0,1,3) #phi and theta values: ph = .6451 th = c(-1.2347,.3173,rep(0,49),-.3372, -.3372*-1.2347,-.3372*.3173,rep(0,49),.2759, .2759*-1.2347,.2759*.3173,rep(0,49),.1881,.1881*-1.2347,.1881*.3173) acf(d, lag.max=175,main = "Theoretical Autocorrelation of ARIMA(1,1,2)X(0,1,3)", col="blue") ACF = ARMAacf(ar = ph, ma = th, lag.max = 175) points(x=0:175,y=ACF, col="red",type="h") pacf(d, lag.max=175,main = "Theoretical Partial Autocorrelation of ARIMA(1,1,2)X(0,1,3)", col="blue") PACF = ARMAacf(ar = ph, ma = th, lag.max = 175, pacf = T) points(x=1:175,y=PACF, col="red",type="h")

#Cross-validation #model 1: ARIMA(1,1,1)X(0,1,3) pred = rep(0,208) CV = rep(0,4) for (i in 0:3){ k = 221 + i * 52 nd1 = d1[1:k] d1fit = arima(nd1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52)) d1fc = predict(d1fit, n.ahead = 52) pred[(1 + i * 52):((i + 1) * 52)] = as.numeric(d1fc$pred) CV[ i + 1 ] = sum((d1[(k + 1):(221 + (i + 1) * 52)]-as.numeric(d1fc$pred))^2) } mean(CV)

plot( c(222:429), d1[222:429], type = "l", main = "Comparison of Prediction to Actual Data", xlab = "Time", ylab = "") points( c(222:429), pred, col = "red",type="l") #model 2: ARIMA(1,1,2)X(0,1,3) pred = rep(0,208) CV = rep(0,4) for (i in 0:3){ k = 221 + i * 52 nd1 = d1[1:k] d1fit = arima(nd1, order = c(1, 1, 2), seasonal = list(order = c(0, 1, 3), period = 52)) d1fc = predict(d1fit, n.ahead = 52) pred[(1 + i * 52):((i + 1) * 52)] = as.numeric(d1fc$pred) CV[ i + 1 ] = sum((d1[(k + 1):(221 + (i + 1) * 52)]-as.numeric(d1fc$pred))^2) } mean(CV) plot( c(222:429), d1[222:429], type = "l", main = "Comparison of Prediction to Actual Data", xlab = "Time", ylab = "") points( c(222:429), pred, col = "red",type="l")

#The Forecast Using Model ARIMA(1,1,1)X(0,1,3) d1fit = arima(d1, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 3), period = 52)) d1fc = predict(d1fit, n.ahead = 52) U = d1fc$pred + 2*d1fc$se L = d1fc$pred - 2*d1fc$se newx = 1:481 newy = c(d1, d1fc$pred) plot(newx, newy, type = "l",main="Data Including the Forecast",xlab="Weeks",ylab="", sub="January 04, 2012 to March 23, 2013") plot(430:481, d1fc$pred, type = "l",ylim = c(60,110), main="Forecast and 95% Confidence Interval for the Next Year",xlab="Weeks", ylab="", sub="March 25, 2012 to March 23, 2013") points(newx[430:481], d1fc$pred + 2*d1fc$se, col = "blue", type = "l") points(newx[430:481], d1fc$pred - 2*d1fc$se, col = "blue", type = "l")

Potrebbero piacerti anche