PROJECT - Time Series Forecasting by Akshay Kharote PDF

PROJECT – TIME SERIES FORECASTING
Australian Monthly Gas Production - Report

TABLE OF CONTENTS
1. Project Objective …………………………………………………………………...………1

2. Data assumptions ………………………………………………...…….…….…….……...1
3. Steps for ARIMA & Auto ARIMA Analysis…………….…..……………..………..…..…2
i. Load the data & Visualization ………………………………………………………………..3
ii. Preprocessing the data ………………………………………………………………………..6
iii. Check/Make series stationary ………………………………………………….…………..11
iv. Determine d value …………………………………………………………………..………..13
v. Determine the p and q values …………………………………………...………………….13
vi. Fit ARIMA Model/Calculate MAPE/RSME …………………………….…………………..15
vii. Compare models using accuracy measures ………………………………………………15
viii. Make prediction …………………………………………………………………..…………..41
ix. Predict values on validation set …………………………………………………………….42
x. Auto ARIMA Model ……………………………………………………….……………..
4. Appendix A – Source Code……………………………………………….……….…..…43

I. Project Objective
Forecast the Australian Gas Production over the next 12 periods.
The objective of the report is to analyze the Australian Gas Production (1956-1995) and forecasting the
Gas production over next 12 periods (1 year) after analyzing and modeling the Time series data.
This exploration report will consist of the following:
 Importing the time series dataset in R
 Plot, examine, and prepare series for modeling

 Understanding the components of Time series
 Graphical exploration
 Extract the seasonality component from the time series

 Test for stationarity and apply appropriate transformations
 Choose the order of an ARIMA model
 Forecast using ARIMA and Auto ARIMA models
 Establish accuracy of the model
II. Data Assumptions
 The Australian Gas production time series data was downloaded from ‘Forecast’ package in R.
 Components of Time Series are not known.
 Stationarity of Time Series are not known.
 Seasonality of Time Series is not known.
1
III. Steps for ARIMA Analysis
1. Load the data & Visualization
2. Preprocessing the data
3. Check/Make series stationary
Do a formal Hypothesis Test (Augmented Dickey-Fuller Test, adf.test in r: Ha: TS is

stationary), If series non-stationary then stationarize it (Take difference of consecutive
terms in a series: diff(dataset) in R)
4. Determine d value
5. Determine the p and q values
Create ACF & PACF plots - Explore Auto correlations and Partial Correlations (Decide the
order of Autoregression in ACF & PACF) – Determine d value, Create ACF(p) & PACF(q)
plots, Determine p & q values.
ARIMA (p,d,q) identifies a non-seasonal model which needs to be differenced d times to
make it stationary and contains p AR terms and q MA terms.
6. Fit ARIMA Model
ARIMA controls – (p,d,q)-> (0,1,2) .. Adjust the values of p,d,q until the residual are un
correlated.
Adding seasonal component (if required)
ARIMA (p,d,q) (P,D,Q) [frequency]
7. Compare models using accuracy measures

After Forecasting, run accuracy tests followed by Hypothesis to check status of residuals
(Histogram, acf and box.test [Ljung Box])
8. Make prediction
9. Predict values on validation set
10. Calculate MAPE/RSME
Auto ARIMA Model
Auto ARIMA involves the same steps involved in building an ARIMA Model except steps 3 to
5 since they are automatically calculated by Auto ARIMA model, hence called Auto ARIMA.
2
1. Load the data & Visualization
setwd("E:/P5")
getwd()
## [1] "E:/P5"
library(tseries)
library(timeSeries)
library(forecast)
library(zoo)
#Loading the data
data(gas, package = "forecast")
#Plot
plot(gas, main = "Plot of Australian Gas Production")
The
production
of Gas in
Australia
has
increased
significantly
over a long
period of
time (40
years).
There is a
significant
upward
trend which
can be
observed
and there
seems to be
some
seasonality
but there is extremely high variance which can be observed looking at the plot. The timeline
involved is 40 years therefore it has to be seen how significant is the historical data.
3
Histogram
A large number of lower values (<10000)

i.e. depicting lower production of gas are
from the early years, prior to 1970’s.
Whether these lower values would aid in
accurately forecasting the production in
1996 remains to be seen.
summary(gas)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1646 2675 16788 21415 38629 66600
head(gas)
## Jan Feb Mar Apr May Jun
## 1956 1709 1646 1794 1878 2173 2321
tail(gas)
## Mar Apr May Jun Jul Aug
## 1995 46287 49013 56624 61739 66600 60054
The lowest gas production was recorded in February 1956 i.e. in the early years of
production, whereas the highest monthly production till date was recorded in July 1995.
Therefore, there exists a huge gap in the production over the years. (Current levels not
anywhere near the Mean - 21415 or the Median - 16788)
4
Monthplot(gas)
The monthplot for the Australian gas production data shows a clear increasing trend in
production of gas during each month from 1956-1995. It can be observed that there is clear
upward trend with any visible fluctuations/variations seen mainly during last 5-10 years.
Frequency
A time series with one observation each month has a monthly sampling frequency or
monthly periodicity and so is called a monthly time series. Data periodicity is described
by specifying periodic time intervals into which the dates of the observations fall. Using the
frequency () function we can determine the periodicity of the time series.
> frequency(gas)
[1] 12
5
2. Preprocessing the data
Visual Analysis
Visual inspection of the plot helps us understand that there is an upward trend with a semi-
annual seasonality which is mainly observed throughout the time series looking at the plot.
Now, the seasonal component at the beginning of the series is smaller than the seasonal
component later in the series.
To account for this, you’d need to log-transform the data as follows:
Log transformation
Plot a graph of the data against time. If it looks like the variation increases with the level of
the series, take logs. Otherwise model the original data.
#Log transformation
loggas <- log(gas)
plot(loggas, main = "Plot of log(gas)")
Compared to the plot of the original time series data, we can observe that once we have done
the Log transformation the variation is less skewed and quite uniform throughout.
6
Decomposition
Now we have to decide whether an additive or a multiplicative model would describe the
data appropriately. Since the size of seasonal fluctuations and random fluctuations
increases in the time series as time goes on, it indicates that an additive model is NOT
appropriate. That way, our data could be described by multiplicative model rather than an
additive model. Since, we have taken already done log transformation we can use Additive
model for decomposition.
At first glance, the time series data includes a seasonal (semi-annual) component, a trend
(upward) component and residual or error. The extent of each component can be deduced
by decomposition of the data using the stl () function.
Decomposing can also allow us to remove seasonal trends in our data. To illustrate why
this might be useful:
> loggasdec <- stl(loggas, s.window= "p")

plot(loggasdec)
From the decomposed plotting we can observe that there is definitely a trend as noted in our
visual inspection along-with a semiannual seasonal component and residuals or white noise.
But the trend component is most significant.
7
Seasonal plot
As observed earlier, during our visual inspection there is semi-annual seasonality present in the
time series data along with an upward trend. Deseasonalization involves removal of seasonal
component from the time series which would help us understand the effect of other components
on the time series.
Then the deseasonalized plot would be compared to original plot to better understand its
impact.
> plot(gas.sa, type="l", main= "Seasonal Adjusted") # seasonal adjusted
> seasonplot(gas.sa, 12, col=rainbow(12), year.labels=TRUE, main="Seasonal pl

ot: Australian Gas Production") # seasonal frequency set as 12 for monthly da
ta.
#Deseasonalize
Deseasonloggas <- (loggasdec$time.series[,2]+loggasdec$time.series[,3])
ts.plot(Deseasonloggas, loggas, col=c("red","blue"), main = "Comparison of lo
ggas and Deseasonalized loggas")
8
#Plotting actual values with Exponentiation
Deseasongas<-(exp(loggasdec$time.series[,2])+exp(loggasdec$time.series[,3]))
ts.plot(Deseasongas, gas, col=c("red","blue"), main = "Comparison of gas and
Deseasonalized gas")
9
#Plotting seasonality only, took first 12 months data
logseason=loggasdec$time.series[1:12,1]
plot(logseason,type="l")
#Exponentiate to get actual value

Gasseason<-exp(loggasdec$time.series[1:12,1])
plot(Gasseason, type="l")
10
3. Check/Make series stationary
Stationarity
Fitting an ARIMA model requires the series to be stationary. A series is said to be

stationary when its mean, variance, and autocovariance are time invariant.
So, the first thing to do is to determine if our time series is stationary (i.e., if the mean is
generally constant throughout the time series, as opposed to going up or down over time).
First, we’ll do this with a visual inspection.
OK, this doesn’t look stationary

at all, as the mean tends to go
up over time. We can do a
formal test to determine
stationarity (or lack thereof) in a
more empirical way.
Adf.test
For this, we can use the

augmented Dickey-Fuller (ADF)
test, which tests the null
hypothesis that the series is
non-stationary. This is included
in the “tseries” package.
Hypothesis
H0 – Non-Stationary
Ha - Stationary
If P-value is more than 0.05, alternative (Ha) hypothesis is rejected and null (H0) hypothesis is
accepted that the data is non-stationary or vice-versa.
11
Check for stationarity of data
adf.test(gas)
##
## Augmented Dickey-Fuller Test
##
## data: gas
## Dickey-Fuller = -2.7131, Lag order = 7, p-value = 0.2764
## alternative hypothesis: stationary
#Non-Stationar
It looks like our p value is above .05, meaning that our data is indeed Non-stationary. P-value is
significant, hence, alternative (Ha) hypothesis is rejected and null (H0) hypothesis is accepted
that the data is indeed non-stationary. This confirms the results of our visual inspection.
Stationarize – Differencing
Given that we have non-stationary data, we will need to “difference” the data until we obtain a
stationary time series. We can do this with the “diff” function in R.
adf.test(diff1)
##
## data: diff1
12
4. Determine d value
After differencing, we again ran the adf.test to check for stationarity of data. p is below 0.05.
Looks like our data is indeed stationary.
P-value is not significant, hence, alternative (Ha) hypothesis is accepted and null (H0)
hypothesis is rejected, now the data is indeed stationary.
Given that we had to difference the data once, the d value for our ARIMA model is 1.
5. Determine p and q values
#Autocorrelation of lag 50
acf(diff1, lag=50, main= "Auto Correlation(q)")
13
pacf(diff1, lag=50, main = "Partial Auto correlation (p)")
From the above ACF (q) and PACF (p) correlation plots. We observe that there is a large
amount of correlation that exists
Looking at ACF plot we can see that there is seasonal pattern which can be observed.
The p and q values from the ACF & PACF plots would be 2 and 2 respectively.
14
6. Fit ARIMA Model / Calculate MAPE/RSME / Compare models
using accuracy measures
#ARIMA (p,d,q)
gas.arima.fit<-arima(gas, c(2,1,2)) #With AR & MA, with differencing
summary(gas.arima.fit)
## Call:
## arima(x = gas, order = c(2, 1, 2))
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.1355 0.0005 0.1261 0.2753
## s.e. 0.4295 0.1481 0.4269 0.0752
## sigma^2 estimated as 6801981: log likelihood = -4410.63, aic = 8831.27
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 70.11654 2605.32 1526.492 0.3193329 6.992583 0.8840275
## ACF1
## Training set 9.96889e-05
hist(gas.arima.fit$residuals, col = "blue")
15
#Testing the fit with original series
ts.plot(gas, fitted(gas.arima.fit), col=c("blue", "red"))
#Test auto correlation in residuals to check the fit

acf(gas.arima.fit$residuals)
16
#Portmanteau test : Ljung Box method used :H0 : residuals are independent
Box.test(gas.arima.fit$residuals,lag = 30, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: gas.arima.fit$residuals
## X-squared = 661.61, df = 30, p-value < 2.2e-16
The 1st ARIMA Model ran above on (p,d,q,) measures of (2,1,2) gave the following
results:
 MAPE - 6.992583 (less than 10 is great)
 Histogram – Normally distributed
 Compared to original plot – Good
 Auto correlation in residuals – Correlation exists Lag 4 onward at various lags
 Box-Ljung Test – P- value is significantly less than 0.05, then the residuals are
dependant
#Adding seasonal component if required

gas.arima.fit.s<-arima(gas, c(2,1,2), seasonal = list(order=c(1,1,2), period=
12))
gas.arima.fit.s
##
## Call:
## arima(x = gas, order = c(2, 1, 2), seasonal = list(order = c(1, 1, 2), per
iod = 12))
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1 sma2
## 0.229 0.2133 -0.7067 -0.134 -0.4313 -0.1708 -0.3013
## s.e. 0.315 0.1176 0.3157 0.242 1.4959 1.4856 0.9190
##
summary(gas.arima.fit.s)
##
## Call:
iod = 12))
##
17
## Coefficients:
## 0.229 0.2133 -0.7067 -0.134 -0.4313 -0.1708 -0.3013
## s.e. 0.315 0.1176 0.3157 0.242 1.4959 1.4856 0.9190
##
##
## Training set 27.4472 1577.849 894.845 0.2790015 3.9086 0.5182258
## ACF1
## Training set -0.000577524
hist(gas.arima.fit.s$residuals, col = "blue")
18
ts.plot(gas, fitted(gas.arima.fit.s), col=c("blue", "red"))
acf(gas.arima.fit.s$residuals)
19
Box.test(gas.arima.fit.s$residuals, lag = 30, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: gas.arima.fit.s$residuals
## X-squared = 74.75, df = 30, p-value = 1.092e-05
The 2nd model ran above is a SARIMA Model ran on (p,d,q,) (P,D,Q) measures of
(2,1,2) (1,1,2) since there exists a seasonal component which gave the following
results:
 Auto correlation in residuals – Correlation exists Lag 4 onward at fewer lags
compared to previous model. A better outcome compared to previous model.
dependent. P-value is slightly better than previous model.
#auto-arima
fitauto=auto.arima(gas, seasonal = TRUE, trace = T)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,1,2)(1,1,1)[12] : 7975.595
## ARIMA(0,1,0)(0,1,0)[12] : 8195.028
## ARIMA(1,1,0)(1,1,0)[12] : 8058.801
## ARIMA(0,1,1)(0,1,1)[12] : 7967.259
## ARIMA(0,1,1)(0,1,0)[12] : 8099.389
## ARIMA(0,1,1)(1,1,1)[12] : 7981.315
## ARIMA(0,1,1)(0,1,2)[12] : 7969.124
## ARIMA(0,1,1)(1,1,0)[12] : 8022.501
## ARIMA(0,1,1)(1,1,2)[12] : 7983.475
## ARIMA(0,1,0)(0,1,1)[12] : 8046.576
## ARIMA(1,1,1)(0,1,1)[12] : 7962.588
## ARIMA(1,1,1)(0,1,0)[12] : 8099.616
## ARIMA(1,1,1)(1,1,1)[12] : 7976.808
20
## ARIMA(1,1,1)(0,1,2)[12] : 7964.626
## ARIMA(1,1,1)(1,1,0)[12] : 8022.009
## ARIMA(1,1,1)(1,1,2)[12] : 7978.712
## ARIMA(1,1,0)(0,1,1)[12] : 7989.065
## ARIMA(2,1,1)(0,1,1)[12] : 7959.768
## ARIMA(2,1,1)(0,1,0)[12] : 8084.882
## ARIMA(2,1,1)(1,1,1)[12] : 7973.938
## ARIMA(2,1,1)(0,1,2)[12] : 7961.533
## ARIMA(2,1,1)(1,1,0)[12] : 8023.31
## ARIMA(2,1,1)(1,1,2)[12] : Inf
## ARIMA(2,1,0)(0,1,1)[12] : 7984.927
## ARIMA(3,1,1)(0,1,1)[12] : 7962.327
## ARIMA(2,1,2)(0,1,1)[12] : 7961.417
## ARIMA(1,1,2)(0,1,1)[12] : 7960.058
## ARIMA(3,1,0)(0,1,1)[12] : 7977.814
## ARIMA(3,1,2)(0,1,1)[12] : 7964.384
## Now re-fitting the best model(s) without approximations...
## ARIMA(2,1,1)(0,1,1)[12] : 8163.291
##
## Best model: ARIMA(2,1,1)(0,1,1)[12]
summary(fitauto)
## Series: gas
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
## sigma^2 estimated as 2587081: log likelihood=-4076.58
## AIC=8163.16 AICc=8163.29 BIC=8183.85
##
## Training set 27.72266 1579.457 893.4504 0.272275 3.900233 0.4789312
21
## ACF1
## Training set 0.002271673
hist(fitauto$residuals, col="green")
ts.plot(gas,fitted(fitauto), col=c("green","blue"))
22
acf(fitauto$residuals)
Box.test(fitauto$residuals, lag=30, type = "Ljung-Box")

## Box-Ljung test
## data: fitauto$residuals
checkresiduals(fitauto)
23
After 2 failed attempts, we ran the Auto ARIMA function to deduce the 3rd model
and gave us the Best model: ARIMA(2,1,1)(0,1,1)[12] - a SARIMA Model which
gave the following results:
 Auto correlation in residuals – Very much similar to previous SARIMA model
dependent. P-value is still not better than the previous models.
#Box Cox
gas1=BoxCox(gas,lambda = BoxCox.lambda(gas))
summary(gas1)
## 10.21 11.13 14.94 14.25 16.86 18.20
tsdisplay(gas1,lag.max = 150, plot.type = c("histogram"))
24
fitauto1<-auto.arima(gas1, seasonal = TRUE, trace = T)
##
##
## ARIMA(2,1,2)(1,1,1)[12] : -633.6623
## ARIMA(0,1,0)(0,1,0)[12] : -399.6777
## ARIMA(1,1,0)(1,1,0)[12] : -536.1276
## ARIMA(0,1,1)(0,1,1)[12] : -657.6893
## ARIMA(0,1,1)(0,1,0)[12] : -471.4437
## ARIMA(0,1,1)(1,1,1)[12] : -643.515
## ARIMA(0,1,1)(0,1,2)[12] : -661.887
## ARIMA(0,1,1)(1,1,2)[12] : -647.0125
## ARIMA(0,1,0)(0,1,2)[12] : Inf
## ARIMA(1,1,1)(0,1,2)[12] : -658.9943
## ARIMA(0,1,2)(0,1,2)[12] : -659.9056
## ARIMA(1,1,0)(0,1,2)[12] : Inf
## ARIMA(1,1,2)(0,1,2)[12] : -656.8534
##
##
## ARIMA(0,1,1)(0,1,2)[12] : Inf
## ARIMA(0,1,2)(0,1,2)[12] : Inf
## ARIMA(1,1,1)(0,1,2)[12] : Inf
## ARIMA(0,1,1)(0,1,1)[12] : -700.5158
##
## Best model: ARIMA(0,1,1)(0,1,1)[12]
summary(fitauto1)
## Series: gas1
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.3755 -0.8586
## s.e. 0.0450 0.0450
25
##
## sigma^2 estimated as 0.01235: log likelihood=353.28
## AIC=-700.57 AICc=-700.52 BIC=-688.15
##
## ME RMSE MAE MPE MAPE
## Training set 0.001163337 0.1093533 0.0766936 0.01790397 0.5343536
## MASE ACF1
## Training set 0.3532688 0.002647187
hist(fitauto1$residuals, col="green")
ts.plot(gas1,fitted(fitauto1), col=c("green","blue"))
26
acf(fitauto1$residuals)
27
Box.test(fitauto1$residuals, lag=30, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: fitauto1$residuals
## X-squared = 62.654, df = 30, p-value = 0.0004342
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)(0,1,1)[12]
## Q* = 64.344, df = 20, p-value = 1.484e-06
##
## Model df: 4. Total lags used: 24
28
Even after Auto ARIMA failed to give us an accurate model we decided to go for
Box Cox Transformation (Best model: ARIMA(0,1,1)(0,1,1)[12]) which gave the
following results:
 MAPE - 0.5343536 (lowest)
 Auto correlation in residuals – Very much similar to previous Auto ARIMA model
#Subset of dataset
gassub<-window(gas,start=c(1990,1))
plot(gassub)
#Check for stationarity of data

adf.test(gassub)
## ## Augmented Dickey-Fuller Test
## ## data: gassub
29
#Stationary
#Auto correlation of lag 30
acf((gassub), lag=30)
pacf((gassub), lag=30)
30
plot(gassub)
#arima (p,d,q)
gas.arima.fit<-arima(gassub, c(0,0,2))
31
##
## Call:
## arima(x = gassub, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## 0.7535 0.6040 47943.139
## s.e. 0.2255 0.1659 1368.765
##
## sigma^2 estimated as 23411514: log likelihood = -674, aic = 1356.01
##
## Training set 69.1481 4838.545 3996.105 -1.304102 8.588294 1.015244
## ACF1

ts.plot(gassub, fitted(gas.arima.fit), col=c("blue", "red"))
32
fitted(gas.arima.fit)
## Jan Feb Mar Apr May Jun Jul
## 1990 45842.57 43030.72 43553.54 46669.04 45350.29 51170.59 56281.25
## 1991 39165.67 38971.49 43302.54 43362.56 44659.88 51285.83 49773.07
## 1992 41886.55 41223.01 44268.62 43040.01 44866.15 51621.74 56061.28
## 1993 44069.48 40954.01 41454.90 38291.14 44231.68 54631.62 53415.67
## 1994 45211.37 42063.41 43585.63 49479.11 47431.35 51641.71 57160.95
## 1995 40847.17 41773.77 48169.73 46341.62 48818.80 55437.92 57405.53
## Aug Sep Oct Nov Dec
## 1990 51617.45 53347.22 46193.00 42863.00 46416.83
## 1991 50628.70 54939.46 46557.13 42842.83 46176.57
## 1992 54135.53 53712.14 52842.70 45157.23 42431.61
## 1993 51915.17 52040.22 49078.11 46779.15 46938.07
## 1994 55647.59 57353.02 53250.14 48189.04 49562.87
## 1995 58677.16
33
##
## Box-Ljung test
##
Eventually decided to go for a subset for the original time series, only working with data
from 1990 onward, since we encountered lot of dependent residuals along-with
inaccurate models, white noise with ARIMA as well as Auto ARIMA. Hence, we shall
follow the same steps with the subset of the data to build a robust model and make an
accurate forecast with highest confidence. (Best model: ARIMA(0,0,2) which gave the
following results:
 Histogram – Fairly Normally distributed
 Auto correlation in residuals – Significant correlation are observed, no improvement
compared to previous Auto ARIMA model
34
gas.arima.fit.s<-arima(gassub, c(0,1,1), seasonal = list(order=c(1,1,0), p
eriod=12))
gas.arima.fit.s
##
## Call:
## arima(x = gassub, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 0
), period = 12))
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
1
##
## Call:
## arima(x = gassub, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 0
), period = 12))
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
1
##
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.4853723
## ACF1
## Training set -0.01701014
35
ts.plot(gassub, fitted(gas.arima.fit.s), col=c("blue", "red"))
36

##
## Box-Ljung test
##
Model – Subset of data including a seasonal component SARIMA(0,1,1)(1,1,0) which gave

the following results:
 MAPE - 4.139159 (less than 10 is great and better than previous result)
 Histogram –Normally distributed
 Compared to original plot – Very Good
 Auto correlation in residuals – No correlation oberved, best fit so far.
 Box-Ljung Test – P- value is significantly more than 0.05, hence the residuals are
independent. P-value is the best till now.
#auto-arima
fitauto=auto.arima(gassub, seasonal = TRUE, trace = T)
##
37
## ARIMA(2,1,2)(1,1,1)[12] : Inf
## ARIMA(0,1,0)(0,1,0)[12] : 1081.687
## ARIMA(1,1,0)(1,1,0)[12] : 1065.345
## ARIMA(0,1,1)(0,1,1)[12] : Inf
## ARIMA(1,1,0)(0,1,0)[12] : 1074.33
## ARIMA(1,1,0)(1,1,1)[12] : Inf
## ARIMA(1,1,0)(0,1,1)[12] : Inf
## ARIMA(0,1,0)(1,1,0)[12] : 1075.677
## ARIMA(2,1,0)(1,1,0)[12] : 1065.765
## ARIMA(1,1,1)(1,1,0)[12] : 1060.953
## ARIMA(1,1,1)(0,1,0)[12] : 1072.152
## ARIMA(1,1,1)(1,1,1)[12] : Inf
## ARIMA(1,1,1)(0,1,1)[12] : Inf
## ARIMA(0,1,1)(1,1,0)[12] : 1058.684
## ARIMA(0,1,1)(0,1,0)[12] : 1070.051
## ARIMA(0,1,1)(1,1,1)[12] : Inf
## ARIMA(0,1,2)(1,1,0)[12] : 1060.965
## ARIMA(1,1,2)(1,1,0)[12] : 1063.06
##
## Best model: ARIMA(0,1,1)(1,1,0)[12]
summary(fitauto)
## Series: gassub
## ARIMA(0,1,1)(1,1,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
## AIC=1058.21 AICc=1058.68 BIC=1064.24
##
38
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.5551405
## ACF1
39
## Box-Ljung test
##
## Ljung-Box test
##
## Q* = 11.922, df = 12, p-value = 0.452
##
ts.plot(gassub,fitted(fitauto), col=c("green","blue"))
40
Final Model – Ran auto ARIMA on subset of data to find the best model which is exactly
similar to the previous SARIMA model (0,1,1)(1,1,0) which gave the exact same results:
 MAPE - 4.139159 (less than 10 is great and better than previous result)
 Histogram –Normally distributed
 Compared to original plot – Very Good
 Auto correlation in residuals – No correlation oberved, best fit so far.
 Box-Ljung Test – P- value is significantly more than 0.05, hence the residuals are
independent. P-value is the best till now.
Now we can run a forecast on this model.
7. Make prediction
## Forecast
#After ensuring model is stable and accurate, forecast for next 12 intervals
fct1=forecast(gas.arima.fit.s, h=12)
fct1$mean
## 1995
## 1996 44630.39 44826.00 50464.32 51405.98 59660.56 63580.35 68333.48
41
## 1995 58353.09 54446.75 52111.62 45010.61
## 1996 65892.39
plot(forecast(gas.arima.fit.s), h=12)
## Warning in plot.window(xlim, ylim, log, ...): "h" is not a graphical
## parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "h" is not a
## graphical parameter
## Warning in box(...): "h" is not a graphical parameter
##
## Call:
## arima(x = gassub, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 0),
period = 12))
##
## Coefficients:
## ma1 sar1
42
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
##
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.4853723
## ACF1
#######################################################################
8. Predict values on validation set
## 1996 44630.39 44826.00 50464.32 51405.98 59660.56 63580.35 68333.48

## 1995 58353.09 54446.75 52111.62 45010.61
So finally we have arrived at the best model where we used a subset of the
original data due to correlation of residuals, white noise, inaccuracy of models,
historical data dating back 40 years with high variance and it seems that it was
highly affecting the accuracy of our previous models and since we only had to
forecast for nest 12 months, the data from last 5 years was fairly stationary and
enough to build the model, predict the values and plot the forecast.
9. Appendix A – Source Code
P5.R
43
Akshay
2019-08-02
setwd("E:/P5")
getwd()
## [1] "E:/P5"
library(tseries)
## Warning: package 'tseries' was built under R version 3.5.3
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.5.3
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.5.3
library(forecast)
## Warning: package 'forecast' was built under R version 3.5.3
library(zoo)
## Warning: package 'zoo' was built under R version 3.5.3
##
## Attaching package: 'zoo'
## The following object is masked from 'package:timeSeries':
##
## time<-
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#Loading the data
data(gas, package = "forecast")
#Plot
plot(gas, main = "Plot of Australian Gas Production")
44
hist(gas, col = "blue", main = "Histogram of gas")
45
summary(gas)
## 1646 2675 16788 21415 38629 66600
head(gas)
## Jan Feb Mar Apr May Jun
## 1956 1709 1646 1794 1878 2173 2321
tail(gas)
## Mar Apr May Jun Jul Aug
## 1995 46287 49013 56624 61739 66600 60054
monthplot(gas, main = "Monthly plot of Australian Gas production")
frequency(gas)
## [1] 12
#Log transformation
loggas <- log(gas)
plot(loggas, main = "Plot of log(gas)")
46
loggasdec <- stl(loggas, s.window= "p")
plot(loggasdec)
47
loggasdec
## Call:
## stl(x = loggas, s.window = "p")
##
##
#Deseasonalize
Deseasonloggas <- (loggasdec$time.series[,2]+loggasdec$time.series[,3])
ts.plot(Deseasonloggas, loggas, col=c("red","blue"), main = "Comparison of lo
ggas and Deseasonalized loggas")
#Plotting actual values with Exponentiation

Deseasongas<-(exp(loggasdec$time.series[,2])+exp(loggasdec$time.series[,3]))
ts.plot(Deseasongas, gas, col=c("red","blue"), main = "Comparison of gas and
Deseasonalized gas")
48
#Plotting seasonality only, took first 12 months data
logseason=loggasdec$time.series[1:12,1]
plot(logseason,type="l")
49
#Exponentiate to get actual value
Gasseason<-exp(loggasdec$time.series[1:12,1])
plot(Gasseason, type="l")
##############################################
adf.test(gas)
##
##
## data: gas
#Non-Stationary
#Stationarize the series
50
diff1<-diff(gas)
plot(diff1, main = "Differenced Plot of Gas")
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
##
## data: diff1
#Autocorrelation of lag 50
acf(diff1, lag=50, main= "Auto Correlation(q)")
51
pacf(diff1, lag=50, main = "Partial Auto correlation (p)")
52
#######################################################
#ARIMA (p,d,q)
gas.arima.fit<-arima(gas, c(2,1,2)) #With AR & MA, with differencing
#gas.arima.fit<-arima(gas, c(2,1,1)) #With AR & MA, with differencing
##
## Call:
## arima(x = gas, order = c(2, 1, 2))
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.1355 0.0005 0.1261 0.2753
## s.e. 0.4295 0.1481 0.4269 0.0752
##
##
## Training set 70.11654 2605.32 1526.492 0.3193329 6.992583 0.8840275
## ACF1
## Training set 9.96889e-05
53
ts.plot(gas, fitted(gas.arima.fit), col=c("blue", "red"))
54
##
## Box-Ljung test
##
gas.arima.fit.s<-arima(gas, c(2,1,2), seasonal = list(order=c(1,1,2), period=
12))
gas.arima.fit.s
##
## Call:
55
iod = 12))
##
## Coefficients:
## 0.229 0.2133 -0.7067 -0.134 -0.4313 -0.1708 -0.3013
## s.e. 0.315 0.1176 0.3157 0.242 1.4959 1.4856 0.9190
##
##
## Call:
iod = 12))
##
## Coefficients:
## 0.229 0.2133 -0.7067 -0.134 -0.4313 -0.1708 -0.3013
## s.e. 0.315 0.1176 0.3157 0.242 1.4959 1.4856 0.9190
##
##
## Training set 27.4472 1577.849 894.845 0.2790015 3.9086 0.5182258
## ACF1
## Training set -0.000577524
56
ts.plot(gas, fitted(gas.arima.fit.s), col=c("blue", "red"))
57

##
## Box-Ljung test
##
#plot(forecast(gas.arima.fit.s, h=12))
#gas.arima.fit.s1<-arima(gas, c(0,1,2), seasonal = list(order=c(0,1,2), perio

d=12))
#gas.arima.fit.s1
#summary(gas.arima.fit.s1)
#hist(gas.arima.fit.s1$residuals, col = "blue")
#ts.plot(gas, fitted(gas.arima.fit.s1), col=c("blue", "red"))
58
#acf(gas.arima.fit.s1$residuals)
#Box.test(gas.arima.fit.s1$residuals, lag = 30, type = "Ljung-Box")
#plot(forecast(gas.arima.fit.s1, h=12))
#auto-arima
fitauto=auto.arima(gas, seasonal = TRUE, trace = T)
##
##
## ARIMA(2,1,2)(1,1,1)[12] : 7975.595
## ARIMA(0,1,0)(0,1,0)[12] : 8195.028
## ARIMA(1,1,0)(1,1,0)[12] : 8058.801
## ARIMA(0,1,1)(0,1,1)[12] : 7967.259
## ARIMA(0,1,1)(0,1,0)[12] : 8099.389
## ARIMA(0,1,1)(1,1,1)[12] : 7981.315
## ARIMA(0,1,1)(0,1,2)[12] : 7969.124
## ARIMA(0,1,1)(1,1,0)[12] : 8022.501
## ARIMA(0,1,1)(1,1,2)[12] : 7983.475
## ARIMA(0,1,0)(0,1,1)[12] : 8046.576
## ARIMA(1,1,1)(0,1,1)[12] : 7962.588
## ARIMA(1,1,1)(0,1,0)[12] : 8099.616
## ARIMA(1,1,1)(1,1,1)[12] : 7976.808
## ARIMA(1,1,1)(0,1,2)[12] : 7964.626
## ARIMA(1,1,1)(1,1,0)[12] : 8022.009
## ARIMA(1,1,1)(1,1,2)[12] : 7978.712
## ARIMA(1,1,0)(0,1,1)[12] : 7989.065
## ARIMA(2,1,1)(0,1,1)[12] : 7959.768
## ARIMA(2,1,1)(0,1,0)[12] : 8084.882
## ARIMA(2,1,1)(1,1,1)[12] : 7973.938
## ARIMA(2,1,1)(0,1,2)[12] : 7961.533
## ARIMA(2,1,1)(1,1,0)[12] : 8023.31
## ARIMA(2,1,1)(1,1,2)[12] : Inf
## ARIMA(2,1,0)(0,1,1)[12] : 7984.927
## ARIMA(3,1,1)(0,1,1)[12] : 7962.327
59
## ARIMA(2,1,2)(0,1,1)[12] : 7961.417
## ARIMA(1,1,2)(0,1,1)[12] : 7960.058
## ARIMA(3,1,0)(0,1,1)[12] : 7977.814
## ARIMA(3,1,2)(0,1,1)[12] : 7964.384
##
##
## ARIMA(2,1,1)(0,1,1)[12] : 8163.291
##
## Best model: ARIMA(2,1,1)(0,1,1)[12]
summary(fitauto)
## Series: gas
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
##
## AIC=8163.16 AICc=8163.29 BIC=8183.85
##
## Training set 27.72266 1579.457 893.4504 0.272275 3.900233 0.4789312
## ACF1
60
ts.plot(gas,fitted(fitauto), col=c("green","blue"))
61

##
## Box-Ljung test
##
62
##
## Ljung-Box test
##
## Q* = 64.344, df = 20, p-value = 1.484e-06
##
#plot(forecast(fitauto, h=12))
#forecastfit = forecast(fitauto, h=12)
#automean=forecastfit$mean
#automean
#forecastfit
#Box Cox
gas1=BoxCox(gas,lambda = BoxCox.lambda(gas))
63
summary(gas1)
## 10.21 11.13 14.94 14.25 16.86 18.20
tsdisplay(gas1,lag.max = 150, plot.type = c("histogram"))
fitauto1<-auto.arima(gas1, seasonal = TRUE, trace = T)

##
##
## ARIMA(2,1,2)(1,1,1)[12] : -633.6623
## ARIMA(0,1,0)(0,1,0)[12] : -399.6777
## ARIMA(1,1,0)(1,1,0)[12] : -536.1276
## ARIMA(0,1,1)(0,1,1)[12] : -657.6893
## ARIMA(0,1,1)(0,1,0)[12] : -471.4437
## ARIMA(0,1,1)(1,1,1)[12] : -643.515
64
## ARIMA(0,1,1)(0,1,2)[12] : -661.887
## ARIMA(0,1,1)(1,1,2)[12] : -647.0125
## ARIMA(0,1,0)(0,1,2)[12] : Inf
## ARIMA(1,1,1)(0,1,2)[12] : -658.9943
## ARIMA(0,1,2)(0,1,2)[12] : -659.9056
## ARIMA(1,1,0)(0,1,2)[12] : Inf
## ARIMA(1,1,2)(0,1,2)[12] : -656.8534
##
##
## ARIMA(0,1,1)(0,1,2)[12] : Inf
## ARIMA(0,1,2)(0,1,2)[12] : Inf
## ARIMA(1,1,1)(0,1,2)[12] : Inf
## ARIMA(0,1,1)(0,1,1)[12] : -700.5158
##
## Best model: ARIMA(0,1,1)(0,1,1)[12]
summary(fitauto1)
## Series: gas1
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.3755 -0.8586
## s.e. 0.0450 0.0450
##
## sigma^2 estimated as 0.01235: log likelihood=353.28
## AIC=-700.57 AICc=-700.52 BIC=-688.15
##
## ME RMSE MAE MPE MAPE
## Training set 0.001163337 0.1093533 0.0766936 0.01790397 0.5343536
## MASE ACF1
## Training set 0.3532688 0.002647187
65
hist(fitauto1$residuals, col="green")
ts.plot(gas1,fitted(fitauto1), col=c("green","blue"))
66
acf(fitauto1$residuals)
67
Box.test(fitauto1$residuals, lag=30, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: fitauto1$residuals
68
##
## Ljung-Box test
##
## Q* = 64.344, df = 20, p-value = 1.484e-06
##
#############################################################################
#############################
#Subset of dataset
gassub<-window(gas,start=c(1990,1))
plot(gassub)
69
adf.test(gassub)
## Warning in adf.test(gassub): p-value smaller than printed p-value
##
##
## data: gassub
#Stationary
#Auto correlation of lag 30
acf((gassub), lag=30)
70
pacf((gassub), lag=30)
71
plot(gassub)
#arima (p,d,q)
gas.arima.fit<-arima(gassub, c(0,0,2)) #With AR & MA, without differencing
#gas.arima.fit<-arima(gassub, c(2,1,1)) #With AR & MA, with differencing
#gas.arima.fit<-arima(gassub, c(2,1,2)) #With AR & MA, with differencing
##
## Call:
## arima(x = gassub, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## 0.7535 0.6040 47943.139
## s.e. 0.2255 0.1659 1368.765
72
##
## sigma^2 estimated as 23411514: log likelihood = -674, aic = 1356.01
##
## Training set 69.1481 4838.545 3996.105 -1.304102 8.588294 1.015244
## ACF1
ts.plot(gassub, fitted(gas.arima.fit), col=c("blue", "red"))
73
fitted(gas.arima.fit)
## 1990 45842.57 43030.72 43553.54 46669.04 45350.29 51170.59 56281.25
## 1991 39165.67 38971.49 43302.54 43362.56 44659.88 51285.83 49773.07
## 1992 41886.55 41223.01 44268.62 43040.01 44866.15 51621.74 56061.28
## 1993 44069.48 40954.01 41454.90 38291.14 44231.68 54631.62 53415.67
## 1994 45211.37 42063.41 43585.63 49479.11 47431.35 51641.71 57160.95
## 1995 40847.17 41773.77 48169.73 46341.62 48818.80 55437.92 57405.53
## 1990 51617.45 53347.22 46193.00 42863.00 46416.83
## 1991 50628.70 54939.46 46557.13 42842.83 46176.57
## 1992 54135.53 53712.14 52842.70 45157.23 42431.61
## 1993 51915.17 52040.22 49078.11 46779.15 46938.07
## 1994 55647.59 57353.02 53250.14 48189.04 49562.87
## 1995 58677.16
74
##
## Box-Ljung test
##
gas.arima.fit.s<-arima(gassub, c(0,1,1), seasonal = list(order=c(1,1,0), peri
od=12))
gas.arima.fit.s
##
## Call:
period = 12))
75
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
##
## Call:
period = 12))
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
##
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.4853723
## ACF1
76
ts.plot(gassub, fitted(gas.arima.fit.s), col=c("blue", "red"))
77

##
## Box-Ljung test
##
#auto-arima
fitauto=auto.arima(gassub, seasonal = TRUE, trace = T)
##
## ARIMA(2,1,2)(1,1,1)[12] : Inf
## ARIMA(0,1,0)(0,1,0)[12] : 1081.687
## ARIMA(1,1,0)(1,1,0)[12] : 1065.345
## ARIMA(0,1,1)(0,1,1)[12] : Inf
## ARIMA(1,1,0)(0,1,0)[12] : 1074.33
78
## ARIMA(1,1,0)(1,1,1)[12] : Inf
## ARIMA(1,1,0)(0,1,1)[12] : Inf
## ARIMA(0,1,0)(1,1,0)[12] : 1075.677
## ARIMA(2,1,0)(1,1,0)[12] : 1065.765
## ARIMA(1,1,1)(1,1,0)[12] : 1060.953
## ARIMA(1,1,1)(0,1,0)[12] : 1072.152
## ARIMA(1,1,1)(1,1,1)[12] : Inf
## ARIMA(1,1,1)(0,1,1)[12] : Inf
## ARIMA(0,1,1)(1,1,0)[12] : 1058.684
## ARIMA(0,1,1)(0,1,0)[12] : 1070.051
## ARIMA(0,1,1)(1,1,1)[12] : Inf
## ARIMA(0,1,2)(1,1,0)[12] : 1060.965
## ARIMA(1,1,2)(1,1,0)[12] : 1063.06
##
## Best model: ARIMA(0,1,1)(1,1,0)[12]
summary(fitauto)
## Series: gassub
## ARIMA(0,1,1)(1,1,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
##
## AIC=1058.21 AICc=1058.68 BIC=1064.24
##
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.5551405
## ACF1
79
80
##
## Box-Ljung test
##
##
## Ljung-Box test
##
## Q* = 11.922, df = 12, p-value = 0.452
##
ts.plot(gassub,fitted(fitauto), col=c("green","blue"))
81
## Forecast
#After ensuring model is stable and accurate, forecast for next 12 intervals
fct1=forecast(gas.arima.fit.s, h=12)
fct1$mean
## 1995
## 1996 44630.39 44826.00 50464.32 51405.98 59660.56 63580.35 68333.48
## 1995 58353.09 54446.75 52111.62 45010.61
## 1996 65892.39
plot(forecast(gas.arima.fit.s), h=12)
## Warning in plot.window(xlim, ylim, log, ...): "h" is not a graphical
## parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "h" is not a
## graphical parameter
82
## Warning in box(...): "h" is not a graphical parameter
## Call:
period = 12))
## Coefficients:
## ma1 sar1
## -0.6974 -0.5136
## s.e. 0.1167 0.1208
## Training set 448.4043 2983.901 1910.476 0.6358374 4.139159 0.4853723
## ACF1
#######################################################################
83

PROJECT - Time Series Forecasting by Akshay Kharote PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

PROJECT - Time Series Forecasting by Akshay Kharote PDF

Caricato da

Copyright:

Formati disponibili

PROJECT – TIME SERIES FORECASTING

Australian Monthly Gas Production - Report

1. Project Objective …………………………………………………………………...………1

ii. Preprocessing the data ………………………………………………………………………..6

iii. Check/Make series stationary ………………………………………………….…………..11

iv. Determine d value …………………………………………………………………..………..13

v. Determine the p and q values …………………………………………...………………….13

vi. Fit ARIMA Model/Calculate MAPE/RSME …………………………….…………………..15

vii. Compare models using accuracy measures ………………………………………………15

viii. Make prediction …………………………………………………………………..…………..41

ix. Predict values on validation set …………………………………………………………….42

x. Auto ARIMA Model ……………………………………………………….……………..

4. Appendix A – Source Code……………………………………………….……….…..…43

Forecast the Australian Gas Production over the next 12 periods.

This exploration report will consist of the following:

 Importing the time series dataset in R

 Plot, examine, and prepare series for modeling

 Extract the seasonality component from the time series

 Establish accuracy of the model

II. Data Assumptions

 Components of Time Series are not known.

 Stationarity of Time Series are not known.

 Seasonality of Time Series is not known.

2. Preprocessing the data

3. Check/Make series stationary

Do a formal Hypothesis Test (Augmented Dickey-Fuller Test, adf.test in r: Ha: TS is

5. Determine the p and q values

7. Compare models using accuracy measures

9. Predict values on validation set

10. Calculate MAPE/RSME

Auto ARIMA Model

A large number of lower values (<10000)

> loggasdec <- stl(loggas, s.window= "p")

> plot(gas.sa, type="l", main= "Seasonal Adjusted") # seasonal adjusted

> seasonplot(gas.sa, 12, col=rainbow(12), year.labels=TRUE, main="Seasonal pl

#Exponentiate to get actual value

Fitting an ARIMA model requires the series to be stationary. A series is said to be

OK, this doesn’t look stationary

For this, we can use the

5. Determine p and q values

#Test auto correlation in residuals to check the fit

#Adding seasonal component if required

Box.test(fitauto$residuals, lag=30, type = "Ljung-Box")

#Check for stationarity of data

#Testing the fit with original series

ts.plot(gassub, fitted(gas.arima.fit.s), col=c("blue", "red"))

Box.test(gas.arima.fit.s$residuals, lag = 30, type = "Ljung-Box")

Model – Subset of data including a seasonal component SARIMA(0,1,1)(1,1,0) which gave

Now we can run a forecast on this model.

8. Predict values on validation set

## 1996 44630.39 44826.00 50464.32 51405.98 59660.56 63580.35 68333.48

9. Appendix A – Source Code

#Plotting actual values with Exponentiation

#Stationarize the series

ts.plot(gas, fitted(gas.arima.fit), col=c("blue", "red"))

Box.test(gas.arima.fit.s$residuals, lag = 30, type = "Ljung-Box")

#gas.arima.fit.s1<-arima(gas, c(0,1,2), seasonal = list(order=c(0,1,2), perio

Box.test(fitauto$residuals, lag=30, type = "Ljung-Box")

fitauto1<-auto.arima(gas1, seasonal = TRUE, trace = T)

#Auto correlation of lag 30

#Testing the fit with original series

ts.plot(gassub, fitted(gas.arima.fit), col=c("blue", "red"))

Box.test(gas.arima.fit.s$residuals, lag = 30, type = "Ljung-Box")