Sei sulla pagina 1di 12

Time Series Student Project

December 26, 2011

NEAS Seminar - Katy Siu

Introduction This project performs a time series analysis of the daily temperature of Allentown, PA. The project will attempt to fit an ARIMA model to the data by utilizing correlogram, regression analysis, testing of residuals and visually inspecting predicted values versus actual data. I chose to study the temperatures of Allentown because ski season is around the corner. Even though not an avid skier myself, my family and friends are and they always like to predict the weather in the winter time. A few of their favorite ski slopes are very close in proximity to Allentown. I chose to analyze the high temperature of the day since it is the high temperatures of the days we feel mostly. High temperatures are recorded in those hours when we are most active in the day. Source Data I use the data set that NEAS provides on its website, specifically station 360106 data. I noted that the data spans from 5/1/1948 to 12/31/2005. However, not all data points are valid. Firstly, I removed all the invalid data points, those records with -999 degrees. Then I noted that there is a notable gap in the data where a months worth of data is missing, for the month of June in 1992. There are also two days missing in the data (2/9/1954 and 2/10/1954). A two-day gap is very immaterial to the analysis and I used linear interpolation to fill in temperatures of these two days. I decided to cut off the data at 5/31/1992 and use only the 6/1/1948 to 5/31/1992 data points. This selected period has 44 years of data with 16071 data points which is sufficient for a robust analysis. The selected dataset after the removal of invalid points and linear interpolation and selected cut-off at 5/31/1992 can be found on the analysis data tab of the excel file1. Since 44 years are a bit too much include in one graph, I separated the data into two graphs. Figure 1a and Figure 1b below graph all the data points for the periods 6/1/1948 to 5/31/1970 and 6/1/1970 5/31/1992 respectively.

Please note that in order to save storage space and computer processing power, I copied and pasted values down the columns of certain formulas (particularly vlookups, sumproduct, sumif, countif) in the excel file. The first row still contains the formula for the grader to follow my calculation but the rows below contain only paste values

Figure 1a

Figure 1b Clearly, the graphs show seasonality as expected by our common knowledge, that temperatures are high in the summer months and low in the winter months. We also see that the data is quite volatile. Hence, smoothing and de-seasonsonalizing of the data are required. Smoothing and De-seasonalizing the Data To smooth the data, I start with a 3 day centered moving average. The temperature for each day is recalculated as a 44 year average of the day before and the day after. The smoothed data can be found in column P of the Smoothed and Adjd Data tab of the excel file. Figure 2a shows the temperature smoothed by 3 day centered moving averages for the period 6/1/1948 to 5/31/1949.

Figure 2a The graph still shows more volatility that I would like. I then tried to smooth the data with a 5 day centered moving average (see column Q of Seasonally Adjd Data tab) and Figure 2b graphs the smoothed data for the same period.

Figure 2b 5 day centered moving average gives a smoother curve than 3 day centered moving average. I next tried 7 day centered moving average (column R in the same tab in excel file) and Figure 2c graphs the smoothed data for the same period.

Figure 2c As shown in Figure 2c, I believe the data looks sufficiently smoothed without any undesirable flattening that warns of over-smoothing. Intuitively, using a 7 day centered moving average for the data set also seems sufficient. As an example, the smoothed temperature on July 4th, 1948 is calculated as the average of the 308 data points of the high temperatures on July 1st-3rd, 5th-7th, over the 44 years. Next, I de-seasonalize the data in two steps. First, I calculated a seasonal index by dividing the smoothed temperature and by the average temperature for the complete dataset (see column S for seasonal index). I then calculate the seasonally adjusted temperature by dividing the individual raw data points by the seasonal index (see column T for seasonally adjusted temperatures). As an example, the seasonal index for July 4th is 83.718/61.172 = 1.369, where 83.718 is the smoothed temperature for July 4th and 61.172 is the overall average of the dataset. The seasonally adjusted temperature for July 4th, 1948 is calculated as 90/1.369 = 65.736, where 90 is the raw high temperature of July 4th, 1948 and 1.369 is the seasonal index for July 4th. The following three graphs show the effects of the smoothing and de-seasonalizing of the data for the year 1991.

Figure 3a

Figure 3b

Figure 3c We see that the raw data in Figure 3a displays significant volatility. Figure 3b shows data that is smoothed sufficiently that we can work with. Then Figure 3c shows the flattening of the curve in Figure 3b, where temperatures are higher in the middle than in both ends of the year. This pattern is removed by de-seasonalizing the data. Figure 4a and Figure 4b plots all the seasonally adjusted data points respectively for the periods 6/1/1948 to 5/31/1970 and 6/1/1970 5/31-1992.

Figure 4a

Figure 4b The above two graphs show the seasonally adjusted temperature (column T in excel file). Some spikes are noted in the graphs. These can be explained by days when there was an uncharacteristically high or low temperature compared to the smoothed average for the day of the month. It can be noted also that the pattern looking similar to a sine wave as seen in Figure 1a and Figure 1b is now removed. We can also note that there is no observable trend in these two graphs which suggests that the data is stationary. We will next look at the correlogram of the data to further confirm if the data is stationary. Sample Autocorrelation and Correlogram Sample autocorrelations are then calculated (columns V through Y in excel file). The sample autocorrelations for the first 200 lags are plotted in Figure 5 below:

Figure 5 We see that correlogram declines fairly rapidly but does not drop to zero immediately and then it hovers around zero as the lag increases, which indicates that (1) the data is stationary (2) the data follows an autoregressive process but not a moving average process (which will have a correlogram that drops suddenly) or an ARMA process (which will have a correlogram that declines less quickly, in a geometric fashion)

I then calculate the standard deviation of white noise error term of 1/sqrt(T). Since there are 16071 data points, the standard deviation is 1/sqrt(16071) = 0.789%. I use a 95% confidence interval, which implies + or 1.96 standard deviation. I also select to use 2 * standard deviations, which is more suitable for volatile data like daily temperatures. The bounds are therefore 2*1.96 * 0.789% = 3.092 % (up or down). We see that the sample autocorrelation of the first few lags are significantly greater than zero, which shows that the time series is not generated by a white noise process. From the correlogram, we also see that the first three points, 0.61. 0.33, 0.22, are great in values, which suggest that either AR(1), AR(2) or AR(3) model would fit the data well. Model Specification Regression Analysis I then perform regression analysis using excels regression add-in. I ran three regression for AR(1), AR(2) and AR(3) models respectively (see columns Z to AH in the excel file for the underlying known xs and unknown ys used in the regressions). The results of the three regressions are summarized below: AR(1)
RegressionStatistics MultipleR 0.613238661 RSquare 0.376061656 AdjustedRSquare 0.376022825 StandardError 8.390556064 Observations 16070 ANOVA df Regression Residual Total 1 16068 16069 SS MS F SignificanceF 681805.7952 681805.8 9684.545 0 1131210.194 70.40143 1813015.989

Intercept x1

Coefficients StandardError tStat 23.65620294 0.386863094 61.14877 0.613240155 0.006231477 98.41008

AR(2)
SUMMARYOUTPUT RegressionStatistics MultipleR 0.615702161 RSquare 0.379089151 AdjustedRSquare 0.379011856 StandardError 8.370600349 Observations 16069 ANOVA df Regression Residual Total 2 16066 16068 SS MS F SignificanceF 687279.0176 343639.5 4904.445 0 1125695.622 70.06695 1812974.64

Intercept x1 x2

Coefficients StandardError tStat 25.30656498 0.428518276 59.05598 0.655996371 0.007870202 83.35191 0.0697402 0.007870495 8.86097

AR(3)
SUMMARYOUTPUT RegressionStatistics MultipleR 0.617951703 RSquare 0.381864307 AdjustedRSquare 0.381748869 StandardError 8.352364649 Observations 16068 ANOVA df Regression Residual Total 3 16064 16067 SS MS 692305.5828 230768.5 1120656.691 69.762 1812962.274 F SignificanceF 3307.94 0

Intercept xi x2 x3

Coefficients StandardError tStat 23.62139042 0.471722223 50.07479 0.660661578 0.007872278 83.92254 0.1133908 0.009399489 12.0635 0.066543024 0.007872735 8.452338

We see that AR(1) has a very high T- statistics for its x1 variable and that both AR(2) and AR(3) have high T-statistics for x1 variable as well. We can also note that introducing x2 in AR(2) and introducing x2 and x3 in AR(3) increases R^2 but not by a significant amount. AR(1) has the highest F statistics overall. It is important to note also that AR(3) has an F statistic that is significantly lower than the F statistics of the other two model. I therefore conclude at this point that AR(3) is not the most appropriate model for this data and further testing of AR(3) model is deemed no longer necessary. Durbin-Watson Test For both the AR(1) and AR(2) model, I then proceed to test residuals for serial correlation by performing the Durbin-Watson test. The null hypothesis is that there is no serial correlation in the residuals. A Durbin-Watson statistic of 2 indicates no correlation. In other words, a Durbin-Watson statistic substantially less than 2 indicates correlation. For AR(1), Durbin-Watson statistic calculated is 1.914 (cell G23 of AR(1) tab in excel file), which is less than 2 and suggests no correlation. I also used excels built-in correlation function and the correlation calculated is 0.04276 (cell C16096 in AR(1) tab in excel file). Autocorrelation calculated by formula is 0.04275 (cell C16097 in AR(1) tab in excel file). Neither of these correlation measures calculated are significantly different from zero. For AR(2), Durbin-Watson statistic calculated is 1.991(cell G23 of AR(2) tab in excel file), which is less than 2 and suggests no correlation. We can then conclude that the residuals are not serially correlated for both AR(1) and AR(2) models. Box-Pierce Test Next, I test if the residuals resemble white noise process by perform the Box-Pierce test. I selected to use K = 100. I noted that the study notes recommend using K of 15 to 40 but since this data is much higher in volume than the illustrative worksheet, I decide to use K = 100. For AR(1), critical value at 90% for degrees of freedom = 100-1=99 is 117.4 (cell L23 of AR(1) tab in excel file). The Q-Statistic calculated is 0.0216 (cell K23), which is much less than the critical value. We thus accept the null hypothesis that the residuals resemble white noise process. For AR(2), critical value at 90% for degrees of freedom = 100-2=98 is 116.3 (cell L23 of AR(2) tab in excel file). The Q-Statistic calculated is 0.0296 (cell K23), which is much less than the critical value. We accept the null hypothesis that the residuals resemble white noise process

Selection Both Durbin-Watson and Box-Pierce tests show that AR(1) and AR(2) are acceptable models. I choose AR(1) noting principle of parsimony. Conclusion The AR(1) model is selected and specifically it is: Yt= 0.613Yt-1+ 23.656 Intuitively, this makes the most sense because we know that the temperature of a day depends on the temperature of the previous days. Finally, I plotted a graph for the year 1991 and visually inspect actual versus predicted per the model.

Figure 6 As shown in Figure 6, the selected model predicts the temperature fairly well. The AR(1) model appears to be simple. Yet intuitively this is the model that makes the most sense, at least for this part of the United States. As we can see in Figure 6, the selected model appears satisfactory.

Potrebbero piacerti anche