Sei sulla pagina 1di 128

Multivariate Time Series Analysis and Forecasting

Lecture 4: Topics in Forecasting

Prof. Dr. Kai Carstensen

Christian Albrechts University Kiel

Summer Term 2017

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 1 / 128
Reading List

F.X. Diebold and R.S. Mariano (1995) Comparing predictive accuracy, Journal of Business
and Economic Statistics 13, 253-263.
R.S. Mariano (2007) Testing forecast accuracy, in: A Companion to Economic Forecasting
(eds. M. P. Clements and D. F. Hendry), Blackwell. Here is a working paper version.
F.X. Diebold and L.A. Lopez (1995) Forecast evaluation and combination, Federal Reserve
Bank of New York Research Paper No. 9525.
G. Elliott and A. Timmermann (2008) Economic forecasting, Journal of Economic
Literature 46, 3-56. Here is a working paper version.
A. Timmermann (2006) Forecast combinations, in: Handbook of Economic Forecasting,
Vol. 1, 99-134. Here is a working paper version.
A.J. Patton and A. Timmermann (2010) Generalized Forecast Errors, A Change of Measure,
and Forecast Optimality Conditions, in: Volatility and Time Series Econometrics: Essays in
Honor of Robert F. Engle, edited by T. Bollerslev, J.R. Russell and M.W. Watson, Oxford
University Press. Here is a working paper version.
A.J. Patton and A. Timmermann (2007) Testing Forecast Optimality under Unknown Loss,
Journal of the American Statistical Association, 102(480), 1172-1184. Here is a working
paper version.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 2 / 128
Example: Forecasting US inflation and unemployment

1. Example: Forecasting US inflation and unemployment

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 3 / 128
Example: Forecasting US inflation and unemployment

Green Book forecast of US unemployment rate and inflation rate

I Staff forecasts at the Fed


I Input to each regularly scheduled FOMC meeting.
I Data released with 5 year lag.
I Past forecasts available here.
I Quarterly data: 1980Q1-2009Q4.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 4 / 128
Example: Forecasting US inflation and unemployment

US unemployment rate (in percent)


1-step Green Book forecast

US unemployment rate, actual (blue) and 1step forecast (red)


12

10

2
1980 1985 1990 1995 2000 2005 2010

Forecast error of 1step Green Book forecast


1

0.5

0.5

1
1980 1985 1990 1995 2000 2005 2010

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 5 / 128
Example: Forecasting US inflation and unemployment

US inflation rate (quarter-on-quarter rate in percent, annualized)


1-step Green Book forecast

US inflation rate, actual (blue) and 1step Green Book forecast (red)
20

15

10

10
1980 1985 1990 1995 2000 2005 2010

Forecast error of 1step Green Book forecast


4

10
1980 1985 1990 1995 2000 2005 2010

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 6 / 128
Example: Forecasting US inflation and unemployment

VAR forecast of US unemployment rate and inflation rate

I 3 variables: unemployment rate, CPI inflation rate, Fed funds rate


I Quarterly data: 1955Q1-2015Q3
I Estimation sample recursively increased from 1955Q1-1984Q4 to 1955Q1-2015Q2
I 1-step forecasts for 1985Q1 to 2015Q3
I p = 4 lags
I Estimation by OLS

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 7 / 128
Example: Forecasting US inflation and unemployment

US unemployment rate (in percent)


1-step forecast from recursively estimated VAR

US unemployment rate, actual (blue) and 1step forecast (red)


10

2
1985 1990 1995 2000 2005 2010 2015 2020

Forecast error of 1step forecast from VAR with 4 lags


1.5

0.5

0.5
1985 1990 1995 2000 2005 2010 2015 2020

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 8 / 128
Example: Forecasting US inflation and unemployment

US inflation rate (quarter-on-quarter rate in percent)


1-step forecast from recursively estimated VAR

US inflation rate, actual (blue) and 1step forecast (red)


2

3
1985 1990 1995 2000 2005 2010 2015 2020

Forecast error of 1step forecast from VAR with 4 lags


4

4
1985 1990 1995 2000 2005 2010 2015 2020

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 9 / 128
Example: Forecasting US inflation and unemployment

Are these good forecasts?

Which one is better?

Should we use (one of) them in practice?

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 10 / 128
Forecasting and forecast loss

2. Forecasting and forecast loss

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 11 / 128
Forecasting and forecast loss

Notation

I h: forecast horizon
I Yt+h : random variable of information set IT +h that generates the value to be
forecast (realization: yt+h )
I Yt+h|t : forecast using the information set IT .
I The forecast is typically based on some population parameters which are unknown
to the forecaster and have to be estimated.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 12 / 128
Forecasting and forecast loss

Loss function

I The real-valued loss function describes how bad a forecast might be.
I General formulation:

L(Yt+h|t , Yt+h , Wt )

Hence, in general the loss depends on the forecast Yt+h|t , the target variable Yt+h ,
and (some) data Wt of the information set IT .
I Many important loss functions can be simplified to

L(Yt+h Yt+h|t ) = L(et+h|t ),

where et+h|t = Yt+h Yt+h|t is the forecast error.


I As an important exception, the loss may depend on the state of the economy at the
time of the forecast (boom/recession, hyperinflation/deflation). Then the
simplification does not hold.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 13 / 128
Forecasting and forecast loss

Properties of loss functions L(Yt+h|t , Yt+h , Wt )

I Normalization: L(yt+h , yt+h , wt ) = 0


I Uniqueness: L(yt+h|t , yt+h , wt ) > 0 for all yt+h|t 6= yt+h
I Existence of expected loss:
Z
E[L(Yt+h|t , Yt+h , Wt )|It ] = L(Yt+h|t , Yt+h , Wt )pY (yt+h |It )dyt+h
yt+h

exists, i.e., the integral is finite. Note that pY (yt+h |It ) is the conditional density of
Yt+h given the information set It . Existence might be an issue if the conditional
distribution is fat-tailed and/or the loss in the tails of the distribution is very large.
I Symmetry: L(yt+h + c, yt+h , wt ) = L(yt+h c, yt+h , wt ). This might be a strong
assumption in many economic applications (government budget forecast, ECBs
inflation forecast, firms demand forecast, banks asset price forecast). Nevertheless
standard in many empirical studies.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 14 / 128
Forecasting and forecast loss

Optimal forecast

I Even when conditioning on the available information It , forecast loss is a random


variable because it depends on Yt+h which is unknown in period t.

I Therefore, an optimal forecast Yt+h|t typically minimizes expected loss (other
approaches like minimizing median or maximum loss are possible but not popular).
I To simplify notation, let us write Y Yt+h|t and Y Yt+h in the following.
I The first derivative of the objective function is
Z
d d
E[L(Y , Y , Wt )|It ] = L(Y , Y , Wt )pY (y |It )dy
d Y d Y y
I Assuming interchangeability of differentiation and integration, it simplifies to
Z
dL(Y , Y , Wt )
pY (y |It )dy = E[L0 (Y , Y , Wt )|It ].
y d Y
I Then the FOC is simply

E[L0 (Y , Y , Wt )|It ] = 0.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 15 / 128
Forecasting and forecast loss

Example: Optimal forecast under MSE loss

I MSE loss is: L(Y , Y , Wt ) = L(Y Y ) = (Y Y )2


I First derivative: L0 (Y , Y , Wt ) = 2(Y Y )
I This yields the FOC:

E[L0 (Y , Y , Wt )|It ] = E[2(Y Y )|It ] = 0,

which is solved by

Y = E[Y |It ].

I Result: under MSE loss the optimal predictor of Yt+h is the conditional expectation
(given the information set available to the forecaster).
I This is a general result that does not only apply to VAR models (which was shown
in lecture 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 16 / 128
Forecasting and forecast loss

Feasible forecast

I In practice, the conditional expectation depends on unknown parameters, . Recall


the VAR model.
I To make the optimal forecast feasible, we thus have to use an estimator, .
I As shown for the VAR model, estimation increases forecast uncertainty.
I Unfortunately, combining an in some sense optimal estimator with an optimal but
infeasible predictor does not necessarily lead to a feasible predictor with smallest
possible expected loss.
I Hence, there is ample room for many different estimation-plus-forecasting
approaches which is why this field is very active.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 17 / 128
Specific loss functions

3. Specific loss functions

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 18 / 128
Specific loss functions

Mean squared error loss

LMSE (Y , Y , Wt ) = LMSE (Y Y ) = (Y Y )2

I Forecast errors are penalized quadratically.


I Is this too strong? Or too weak?
I As shown, MSE loss leads to the optimal forecast Y = E[Y |It ].

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 19 / 128
Specific loss functions

Mean absolute error loss

LMAE (Y , Y , Wt ) = LMAE (Y Y ) = |Y Y |

I Large forecast errors are penalized less heavily than under MSE loss.
I Not easily differentiable at Y = Y (no interchangeability of differentiation and
integration).
I For all continuous distributions pY (y |It ) the optimal forecast is the conditional
median of Y .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 20 / 128
Specific loss functions

Generalized error loss function

LGEL (Y , Y , Wt ; , p) = LGEL (Y Y ; , p) = [ + (1 2) 1Y <Y ] |Y Y |p

I This is a very flexible loss function that only depends on the forecast error
e = Y Y .
I It is symmetric for = 0.5.
I Not easily differentiable at Y = Y (no interchangeability of differentiation and
integration).
I Nests MAE loss ( = 0.5, p = 1) and MSE loss ( = 0.5, p = 2).
I p = 1: lin-lin loss (asymmetric piecewise linear loss).
I p = 2: quad-quad loss (asymmetric piecewise quadratic loss).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 21 / 128
Specific loss functions

Symmetric loss functions

Symmetric loss functions ( = 0.5)


0.9
MSE loss
MAE loss
0.8 GE loss (p=1.5)
GE loss (p=3)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1.5 1 0.5 0 0.5 1 1.5

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 22 / 128
Specific loss functions

Asymmetric loss functions

Aymmetric loss functions ( = 0.25)


1.4
Quadquad loss
Linlin loss
GE loss (p=1.5)
1.2 GE loss (p=3)

0.8

0.6

0.4

0.2

0
1.5 1 0.5 0 0.5 1 1.5

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 23 / 128
Specific loss functions

Linex loss

LLinex (Y , Y , Wt ; b) = LLinex (Y Y ; b) = exp(b(Y Y )) b(Y Y ) 1

I Even more asymmetric


I b < 0: approximately exponential for Y < Y , approximately linear for Y > Y
I b > 0: approximately exponential for Y > Y , approximately linear for Y < Y
I Easily differentiable
I Does not nest MSE and MAE

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 24 / 128
Specific loss functions

Linex loss functions


8
Linex loss (b=1.5)
Linex loss (b=1.75)
Linex loss (b=2)
7

0
1.5 1 0.5 0 0.5 1 1.5 2 2.5

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 25 / 128
Specific loss functions

Direction of change loss


0 if sign(Y ) = sign(Y )
LDoC (Y , Y , Wt ; b) =
1 otherwise

I Important for financial market forecasts


I Example: let Y be the yield change of a financial asset. In some cases, all you need
to get right is the direction of the yield change.
I Note: This loss function does not have a unique minimum!

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 26 / 128
Specific loss functions

Weighted MSE loss

LWMSE (Y , Y , Wt ) = Wt (Y Y )2

I Wt depends on the information set at the forecast origin, IT . Might be a state


variable.
I Weighting can be applied not only to MSE loss but to any other appropriate loss
function.
I Idea: Sometimes interest rests on good prediction at particular states of the
economy (or important events).
I Example: Precise forecasts might be particularly important in recessions (and/or
booms) much more than in normal times.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 27 / 128
Specific loss functions

I An attractive way to obtain weights is the empirical distribution (or density)


function.
I Suppose {Yt } (which is a subset of Wt ) is strictly stationary and can be interpreted
as a state variable (e.g. GDP).
I Wt = 1 f (Yt )/ max(f (Yt )), where f () is the density function of Yt , focuses on
both tails of the distribution of Yt .
I Wt = 1 F (Yt ), where F () is the cumulative distribution function of Yt , focuses
on the left tail of the distribution of Yt .
I W = F (Yt ) focuses on the right tail of the distribution of Yt .
I See van Dijk and Franses (2003) and Carstensen, Wohlrabe, Ziegler (2011).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 28 / 128
Specific loss functions

Weighted MSE loss


Example: euro area industrial production

Euro area industrial production, year-on-year growth rate 1992M22009M6

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 29 / 128
Specific loss functions

Weighted MSE loss


Example: euro area industrial production

Empirical cdf of euro area industrial production

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 30 / 128
Specific loss functions

Weighted MSE loss


Example: euro area industrial production

Boom and recession weights based on the empirical cdf

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 31 / 128
Specific loss functions

The use of loss functions in practice

I Find (and use) the optimal forecast based on a specific loss function appropriate for
the forecasting problem at hand.

I Based on a sample of forecasts of forecaster j, check her forecast efficiency by


comparing forecasts and realizations (assuming she used a specific loss function).

I Based on a sample of forecasts of forecaster j, infer her loss function by comparing


forecasts and realizations (assuming she used an optimal forecast).

I Compare different forecasts or forecasting models for a given loss function in order
to identify the best model.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 32 / 128
Finding optimal forecasts

4. Finding optimal forecasts

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 33 / 128
Finding optimal forecasts

How to find an optimal forecast in theory

I State the loss function and derive the FOC.


I Find the distribution of the target variable.
I Solve.

I For some loss functions we can state closed form solutions for the optimal forecasts
without knowing the distribution of the target variable.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 34 / 128
Finding optimal forecasts

Example 1: Forecasting a Bernoulli variable

Suppose we want to forecast an iid Bernoulli random variable x that takes a value of 1
with probability p and a value of 0 with probability 1 p.

The optimal forecast x depends on the loss function:


I MSE loss leads to x = E(x) = p.
I MAE loss leads to x = Med(x) which is 0 if p < 0.5 and 1 if p > 0.5.
I Linlin loss leads to x = 0 if p < and 1 if p > .
I Linex loss leads to x = log(1 + p exp(a) p)/a.

Proofs: shown in class or tutorial.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 35 / 128
Finding optimal forecasts

Example 1: Forecasting a Bernoulli variable

Expected loss functions for predictions of Bernoulli random variable (p=0.25)


2
MSE loss
MAE loss
1.8 Linex loss (b=1,a=1.5)
Linex loss (b=1,a=1.5)
Linlin loss (=0.8)
1.6
Linlin loss (=0.1)

1.4

1.2

0.8

0.6

0.4

0.2

0
0.2 0 0.2 0.4 0.6 0.8 1 1.2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 36 / 128
Finding optimal forecasts

Example 1: Forecasting a Bernoulli variable

Expected loss functions for predictions of Bernoulli random variable (p=0.6)


1.6
MSE loss
MAE loss
Linex loss (b=1,a=1.5)
1.4
Linex loss (b=1,a=1.5)
Linlin loss (=0.8)
Linlin loss (=0.1)
1.2

0.8

0.6

0.4

0.2

0
0.2 0 0.2 0.4 0.6 0.8 1 1.2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 37 / 128
Finding optimal forecasts

Optimal forecast under MSE loss

Loss function:

LMSE (Yt+h Yt+h|t ) = (Yt+h Yt+h|t )2

As shown above, the optimal forecast is



Yt+h|t = E[Yt+h |It ]

irrespective of the distribution of Yt+h .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 38 / 128
Finding optimal forecasts

Optimal forecast under MAE loss

Loss function:

LMAE (Yt+h Yt+h|t ) = |Yt+h Yt+h|t |

The optimal forecast is



Yt+h|t = Med[Yt+h |It ]

irrespective of the distribution of Yt+h .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 39 / 128
Finding optimal forecasts


Proof of Yt+h|t = Med[Yt+h |It ]:

Assumption: conditional distribution of Yt+h is continuous.

Step 1: rewrite expected loss, abbreviating Et () by E (|It ), Yt+h by Y , and Yt+h|t by Y ,


Z Z Y Z
Et (|Y Y |) = |y Y |ft (y )dy = (Y y )ft (y )dy + (y Y )ft (y )dy ,
Y

where ft (y ) is the conditional density of Yt+h given It . For later use, also define Ft (y ) as
the conditional cdf of Yt+h given It .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 40 / 128
Finding optimal forecasts

Step 2: Use Leibnizs rule for differentiation


Z b(x) Z b(x)
g (y , x) b(x) a(x)
g (y , x)dy = dy + g (b(x), x) g (a(x), x)
x a(x) a(x) x x x

Here:
Z Y Z Y
(Y y )
(Y y )ft (y )dy = ft (y )dy + (Y Y )ft (Y ) 1 (Y Y )ft (Y ) 0
Y Y
Z Y
= ft (y )dy = Ft (Y )

and
Z Z Z
(y Y )
(y Y )ft (y )dy = ft (y )dy + 0 0 = ft (y )dy
Y Y Y Y Y

= Pt (Y > Y ) = [1 Pt (Y Y )] = Ft (Y ) 1

Putting results together:



Et (|Y Y |) = 2Ft (Y ) 1
Y

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 41 / 128
Finding optimal forecasts

Step 3: Solve the FOC

1
2Ft (Y ) 1 = 0 Ft (Y ) =
2
By the definition of the median,

Y = Medt [Y ]

or, in full notation,



Yt+h|t = Med[Yt+h |It ].

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 42 / 128
Finding optimal forecasts

Optimal forecast under Linlin loss

Loss function:

LLL (Yt+h Yt+h|t ) = [ + (1 2) 1Yt+h <Yt+h|t ] |Yt+h Yt+h|t |, (0, 1)

The optimal forecast is



Yt+h|t = Quantile [Yt+h |It ]

irrespective of the distribution of Yt+h .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 43 / 128
Finding optimal forecasts


Proof of Yt+h|t = Quantile [Yt+h |It ]:

The proof is similar to the proof shown above for MAE loss. You will arrive at the
condition

Ft (Y ) =

which is solved by the -quantile of the conditional distribution of Yt+h .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 44 / 128
Finding optimal forecasts

Optimal forecast under Linex loss

Loss function:

LLinex (Y Y ; b) = exp(b(Y Y )) b(Y Y ) 1

The optimal forecast is

1
Yt+h|t = log E[exp(b Yt+h )|It ]
b
irrespective of the distribution of Yt+h .

2
Under conditional normality, i.e., Yt+h |It N (t+h|t , t+h|t ), the optimal forecast is

b 2
Yt+h|t = t+h|t + ,
2 t+h|t
2
where t+h|t is the conditional mean and t+h|t is the conditional variance.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 45 / 128
Finding optimal forecasts

Proof of the general result: shown in class

Proof of the result under normality:

Recall the moment generating function for a conditionally normal random variable:
2
mt (s) = E[exp(s Yt+h )|It ] = exp(t+h|t s + 0.5t+h|t s 2 ).

Thus, the general result simplifies as follows

1 1
Yt+h|t = log E[exp(b Yt+h )|It ] = log mt (b)
b b
1 h
2
i b 2
= log exp(t+h|t b + 0.5t+h|t b 2 ) = t+h|t + t+h|t
b 2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 46 / 128
Finding optimal forecasts

Example 2: Forecasting a normal variable

Suppose we want to forecast a variable Yt+h that is conditionally normal,


2
Yt+h |It N (t+h|t , t+h|t )

The optimal forecast Yt+h|t depends on the loss function:

I MSE loss leads to Yt+h|t = E(Yt+h |It ) = t+h|t .


I MAE loss leads to Yt+h|t = Med(Yt+h |It ) = t+h|t .
I Linlin loss leads to Yt+h|t = Quantile [Yt+h |It ].
2
I Linex loss leads to Yt+h|t = t+h|t + b2 t+h|t .

Under MSE and MAE loss, we only need to find the conditional mean t+h|t . Under
2
asymmetric loss, we need to to find the conditional variance t+h|t , too.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 47 / 128
Finding optimal forecasts

Example 3: Forecasting German GDP growth

Now assume the 1-year ahead forecast of German GDP growth follows a conditional
normal distribution. A forecasting model (for example, a VAR) yields

E(Yt+h |It ) = t+h|t = 2.0 (percent).


2
The conditional variance is assumed to be constant over time. Then a value of t+h|t =1
fits the empirical distribution fairly well.

What should we predict?

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 48 / 128
Finding optimal forecasts

Example 3: Forecasting German GDP growth


Loss functions

Let us think about loss functions. For many people is overprediction (=negative surprise)
more costly than underprediction (=positive surprise).

Suppose the German finance minister has a 50 percent harder time in Parliament when
MPs find out he overpredicted GDP (and thereby his budget) by 0.5 percentage points
than he has when he underpredicted GDP by 0.5 percentage points.

Let us thus parameterize Linlin and Linex loss such that L(0.5) = 1.50L(0.5).

This requires = 0.4 for Linlin loss and b = 1.2147 for Linex loss.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 49 / 128
Finding optimal forecasts

Example 3: Forecasting German GDP growth


Loss functions

Linlin with =0.4 (blue) and Linex with b=1.2147 (red) loss
2

1.8

1.6 Linlin loss ( = 0.4):


1.4 L(0.5) = 1.50L(0.5)
1.2
L(1.0) = 1.50L(1.0)

0.8 Linex loss (b = 1.21):


0.6 L(0.5) = 1.50L(0.5)
0.4
L(1.0) = 2.26L(1.0)

0.2

0
1.5 1 0.5 0 0.5 1 1.5

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 50 / 128
Finding optimal forecasts

Example 3: Forecasting German GDP growth


Optimal forecasts

The optimal forecasts Yt+h|t for different loss functions are:



I MSE loss: Yt+h|t = 2.0.

I MAE loss: Yt+h|t = 2.0.

I Linlin loss ( = 0.4): Yt+h|t = 2.0 0.25 = 1.75.
1.2
I Linex loss (b = 1.21): Yt+h|t = 2.0 2
1 = 1.4.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 51 / 128
Finding optimal forecasts

The optimal forecast under Linlin loss is calculated as follows:


We know that Yt+h|t is the -quantile of the conditional distribution. Hence,

= Pr(Y Y ) = Pr(Y t+h|t Y t+h|t ).

Noting that Y t+h|t follows a standard normal distribution (given It ), this simplifies to

= (Y t+h|t ),

where () is the standard normal cdf, and thus

1 () = Y t+h|t .

Solving for Y yields

Y = t+h|t + 1 () = 2.0 + 1 (0.4) = 2.0 0.25 = 1.75.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 52 / 128
Properties of optimal forecasts

5. Properties of optimal forecasts

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 53 / 128
Properties of optimal forecasts

MSE loss

Loss: LMSE (Yt+h Yt+h|t ) = (Yt+h Yt+h|t )2


Optimal forecast: Yt+h|t = E[Yt+h |It ]


Forecast error: et+h|t Yt+h Yt+h|t

Properties:

I Unbiasedness: E(et+h|t )=0

I Unpredictability: Cov(et+h|t , Zt ) = 0 for all Zt It

I Increasing variance: Var(et+h|t ) Var(et+h+k|t ) for all k 0.


I Note that unpredictability implies that et+h|t is autocorrelated at most of order
h 1. Consequently for h = 1, the optimal forecast error is white noise.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 54 / 128
Properties of optimal forecasts

Proof of unbiasedness:


E(et+h|t |It ) = E(Yt+h Yt+h|t |It ) = E(Yt+h |It ) E(Yt+h |It ) = 0

By the LIE,

E(et+h|t ) = E[E(et+h|t |It )] = E[0] = 0.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 55 / 128
Properties of optimal forecasts

Proof of unpredictability:


Recall that the conditional expectation is zero: E(et+h|t |It ) = 0.


As Zt It , this implies: E(et+h|t |Zt ) = 0.


Hence, the covariance is zero: Cov(et+h|t , Zt ) = 0.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 56 / 128
Properties of optimal forecasts

Proof of increasing variance:

By the law of total variance,

Var(et+h|t ) = E[Var(et+h|t |It )] + Var[ E(et+h|t |It ) ] = E[Var(et+h|t |It )].


| {z }
=0

By theorem CV.4 of Wooldridge (2002),

E[Var(et+h|t |It )] E[Var(et+h|t |Itk )] as Itk It .

By strict stationarity,

E[Var(et+h|t |Itk )] = E[Var(et+h+k|t |It )].

Taken together,

Var(et+h|t ) = E[Var(et+h|t |It )] E[Var(et+h+k|t |It )] = Var(et+h+k|t ).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 57 / 128
Properties of optimal forecasts

MAE loss

Loss function: LMAE (Yt+h Yt+h|t ) = |Yt+h Yt+h|t |


Optimal forecast: Yt+h|t = Med[Yt+h |It ]

Properties:

I Median unbiasedness: Med(et+h|t )=0
h  i
1
I Median unpredictability: E 1et+h|t
0 2 Zt = 0 for all Zt It

I Increasing imprecision: E(|et+h|t |) E(|et+h+k|t |) for all k 0.

I This implies that for asymmetric distributions, the optimal forecast under MAE loss
differs from the optimal forecast under MSE loss.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 58 / 128
Properties of optimal forecasts

Proof of median unbiasedness:

By definition of the median,



0.5 = Pt (Yt+h Yt+h|t ) = Pt (Yt+h Yt+h|t 0) = Pt (et+h|t < 0).

Hence, the conditional median is zero:



Medt (et+h|t ) = 0.

As shown in class, also the unconditional median is zero.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 59 / 128
Properties of optimal forecasts

Proof of median unpredictability: shown in class

Proof of increasing imprecision: see Patton and Timmerman (2010)

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 60 / 128
Properties of optimal forecasts

Lin-lin loss

Loss function: LLL (Yt+h Yt+h|t ) = [ + (1 2) 1Yt+h <Yt+h|t ] |Yt+h Yt+h|t |,


(0, 1)


Optimal forecast: Yt+h|t = Quantile [Yt+h |It ]

Properties:
I see below for general loss functions

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 61 / 128
Properties of optimal forecasts

Linex loss

Loss function: Llinex (Yt+h Yt+h|t ) = exp(b(Yt+h Yt+h|t )) b(Yt+h Yt+h|t ) 1

1
Optimal forecast: Yt+h|t = b
log E[exp(b Yt+h )|It ]

Properties:
I see below for general loss functions

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 62 / 128
Properties of optimal forecasts

General loss function

General loss: L(Yt+h Yt+h|t )

Optimal forecast: Under some regularity conditions, the optimal forecast satisfies

E(L(Yt+h Yt+h|t )/ Yt+h|t |It ) = 0


Generalized forecast error: t+h|t L(Yt+h Yt+h|t )/ Yt+h|t

Properties:

I Unbiasedness: E(t+h|t ) = E(t+h|t |It ) = 0

I Unpredictability: Cov(t+h|t , Zt ) = 0 for all Zt It
I Increasing loss: E(L(Yt+h Yt+h|t )) E(L(Yt+h Yt+h|tk )) for all k 0.


I Note that unpredictability implies that t+h|t is autocorrelated at most of order
h 1. Consequently for h = 1, the optimal generalized forecast error is white noise.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 63 / 128
Properties of optimal forecasts

Unknown loss function

I Often the forecast producer (e.g., the Fed) is separate from the forecast consumer
(e.g., us).
I In this case, the loss function used to obtain the forecast may be unknown.
I Still, under mild conditions, we can derive some properties an optimal forecast must
satisfy. This allows us to test whether the forecast producer did a good job...
I See Patton and Timmermann (2007) for details.
I (Note: Patton and Timmermann (2007) derive more results than the one we
consider in the following.)

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 64 / 128
Properties of optimal forecasts

Unknown loss function


Assumptions

I Assumption on the data generating process: The target variable is conditionally


homoskedastic, i.e.,
2
Yt+h = t+h + t+h , t+h |It F,h (0, ,h )
2 2
where F,h (0, ,h ) is some distribution with mean zero and variance ,h , which may
depend on h, but does not depend on It .

I Assumption on the loss function: The loss function is a function solely of the
forecast error, i.e.,

L(Yt+h Yt+h|t ) = L(et+h|t ).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 65 / 128
Properties of optimal forecasts

Unknown loss function


Properties

1. The optimal forecast takes the form:



Yt+h|t = t+h + h ,

where h is a constant that depends on L and F,h , but not on It .


2. The optimal forecast error et+h|t is independent of all Zt It , since

et+h|t = Yt+h Yt+h|t = h + t+h ,
2
where t+h |It F,h (0, ,h ). Thus,

E(et+h|t |It ) = h .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 66 / 128
Properties of optimal forecasts

Unknown loss function


Properties

3. The optimal forecast is such that, for all t,



Ft+h|t (Yt+h|t ) = qh

where qh (0, 1) depends only on the forecast horizon and the loss function. If Ft+h|t
is continuous and strictly increasing then we obtain:
1
Yt+h|t = Ft+h|t (qh ) = Quantileqh (Yt+h |It ).

4. The variable

It+h|t = 1Yt+h Y = 1et+h|t
0
t+h|t


is independent of all Zt It and thus Cov(It+h|t , Zt ) = 0.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 67 / 128
Properties of optimal forecasts

Unknown loss function


Proofs

See Patton and Timmermann (2007).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 68 / 128
Evaluating the efficiency of a forecast

6. Evaluating the efficiency of a forecast

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 69 / 128
Evaluating the efficiency of a forecast

Notation

I Forecast horizon: h
I Forecast sample: t + h = T1 , ..., T2
I Number of h-step forecasts starting from the baseline sample: T = T2 T1 + 1

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 70 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of unbiasedness

Hypotheses: H0 : E(et+h|t ) = 0 versus H1 : E(et+h|t ) 6= 0

The sample equivalent of the expected value is


T2
X
eh = T 1 et+h|t .
t+h=T1

Asymptotic distribution (as T ):

d
T eh N (0, 2 ),

where

2 = lim Var( T eh )
T

is the long-run variance of et+h|t .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 71 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of unbiasedness

Note that for h > 1, the et+h|t s are autocorrelated. Therefore, to estimate 2 you have
to use an autocorrelation-robust estimator of the variance like Newey-West (where you
can set the lag window to h 1 because higher-order autocorrelation is excluded under
the null).

Test statistic:
eh
t=
/ T

The test statistic is asymptotically N (0, 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 72 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of unpredictability

Null hypothesis: Cov(et+h|t , Zt ) = 0 for all Zt It

Implement as a regression on known forecast error and other variables Zit It :

et+h|t = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 0 = 1 = 1 = = k = 0 versus H1 : H0

Test statistic: usual Wald statistic which is asymptotically 2 (k + 2) distributed.

Note that for h > 1, the et+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 73 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of unpredictability: Mincer-Zarnowitz regression

Similar regression approach (pioneered by J. Mincer and V. Zarnowitz, 1969):

Yt+h = 0 + 1 Yt+h|t + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 0 = 0 and 1 = 1 versus H1 : H0

Test statistic: usual Wald statistic which is asymptotically 2 (2) distributed.

Note again that for h > 1, the vt+h s are autocorrelated. Therefore, you have to use an
autocorrelation-robust estimator of the variance like Newey-West (where you can set the
lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 74 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of unpredictability: augmented Mincer-Zarnowitz regression

Augment the Mincer-Zarnowitz regression by other variables Zit It :

Yt+h = 0 + 1 Yt+h|t + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 0 = 0 and 1 = 1 and 1 = = k = 0 versus H1 : H0

Test statistic: usual Wald statistic which is asymptotically 2 (k + 2) distributed.

Note again that for h > 1, the vt+h s are autocorrelated. Therefore, you have to use an
autocorrelation-robust estimator of the variance like Newey-West (where you can set the
lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 75 / 128
Evaluating the efficiency of a forecast

MSE loss
Test of increasing variance

I When forecasts for multiple horizons (e.g., h = 1, 2, 3) are available, we can test the
inequality

Var(et+h|t ) Var(et+h+k|t )

I See Patton and Timmermann (2012) for details.


I Not often found in the literature (note: variances are estimated much more imprecise
than means hence you need a lot of data for this test to be really informative).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 76 / 128
Evaluating the efficiency of a forecast

MAE loss
Test of median unbiasedness

 
Hypotheses: H0 : E 1et+h|t 0 12 = 0 versus H1 : H0

Let us denote
1
t,h = 1et+h|t 0 .
2
The sample equivalent of the expected value is
T2 T2
X X 1 #(et+h|t s 0) 1
h = T 1 t,h = T 1 1et+h|t 0 = .
2 T 2
t+h=T1 t+h=T1

Test statistic:
h d
t= N (0, 1),
/ T

where 2 is (for h > 1) an autocorrelation-robust estimator of the variance like


Newey-West (where you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 77 / 128
Evaluating the efficiency of a forecast

MAE loss
Test of median unpredictability

h  i
1
Null hypothesis: E 1et+h|t
0 2
Zt = E[t,h Zt ] = 0 for all Zt It

Implement as a regression on known forecast error and other variables Zit It :

t,h = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 0 = 1 = 1 = = k = 0 versus H1 : H0

Test statistic: usual Wald statistic which is asymptotically 2 (k + 2) distributed.

Note that for h > 1, the et+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 78 / 128
Evaluating the efficiency of a forecast

General loss function


Test of unbiasedness of the generalized forecast error

Hypotheses: H0 : E(t+h|t ) = 0 versus H1 : H0

The sample equivalent of the expected value is


T2
X
h = T 1 t+h|t .
t+h=T1

Test statistic:
h d
t= N (0, 1),
/ T

where 2 is (for h > 1) an autocorrelation-robust estimator of the variance like


Newey-West (where you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 79 / 128
Evaluating the efficiency of a forecast

General loss function


Test of unpredictability of the generalized forecast error

Null hypothesis: Cov(t+h|t , Zt ) = 0 for all Zt It

Implement as a regression on known generalized forecast error and other variables


Zit It :

t+h|t = 0 + 1 t|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 0 = 1 = 1 = = k = 0 versus H1 : H0

Test statistic: usual Wald statistic which is asymptotically 2 (k + 2) distributed.

Note that for h > 1, the t+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 80 / 128
Evaluating the efficiency of a forecast

Unknown loss function


Unpredictability of the forecast error (version 1)

Null hypothesis: H0 : E(et+h|t |It ) = h Cov(et+h|t , Zt ) = 0 for all Zt It

Implement as a regression on known forecast error and other variables Zit It :

et+h|t = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 1 = 1 = = k = 0 versus H1 : H0

Note: We do not restrict 0 to zero!

Test statistic: usual Wald statistic which is asymptotically 2 (k + 1) distributed.

Note that for h > 1, the et+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 81 / 128
Evaluating the efficiency of a forecast

Unknown loss function


Unpredictability of the forecast error (version 2)

Null hypothesis: H0 : Cov(It+h|t , Zt ) = Cov(1et+h|t 0 , Zt ) = 0 for all Zt It

Implement as a regression on known dummy and other variables Zit It :

It+h|t = 0 + 1 It|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Hypotheses: H0 : 1 = 1 = = k = 0 versus H1 : H0

Note: We do not restrict 0 to zero!

Test statistic: usual Wald statistic which is asymptotically 2 (k + 1) distributed.

Note that for h > 1, the It+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 82 / 128
Inferring the loss function

7. Inferring the loss function

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 83 / 128
Inferring the loss function

References

I Elliott, Timmermann, Komunjer (2005), Estimation and Testing of Forecast


Rationality under Flexible Loss, Review of Economic Studies 72(4), 1107-1125.
I Elliott, Komunjer, Timmermann (2008), Biases in Macroeconomic Forecasts:
Irrationality or Asymmetric Loss? Journal of European Economic Association, 6,
122-157.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 84 / 128
Comparing forecast accuracy

8. Comparing forecast accuracy

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 85 / 128
Comparing forecast accuracy

Introduction

I There are two types of forecast comparisons:


I Ex-post comparison: You observe two forecast series and the realizations.
I Out-of-sample forecast experiment: You generate as if forecasts of different forecast
methods or models to evaluate their accuracy.
I A common approach when comparing two observed forecasts is to compare their
average losses.
I A popular descriptive tool is Theils U.
I To obtain a significance statement, one may test the null hypothesis that the two
observed forecasts lead to the same loss.
I The appropriate test depends on the setting. The Diebold-Mariano test is used
frequently.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 86 / 128
Comparing forecast accuracy

Ex-post comparison of two forecast series

I You observe two (or more) h-step forecasts of Yt+h over the sample
t + h = T1 , . . . , T2 .
I Example: US 4-quarter ahead inflation forecasts published by the Fed (Green Book)
and by Consensus Economics (see here) for 1980Q1 - 2009Q4.
I You know the realized values of Yt+h .
I You ask: Which forecast is (systematically) better?
I Dimensions of comparison:
I Descriptive statistics (such as average losses, Theils U)
I Tests of forecast efficiency
I Tests of the null hypothesis that the accuracy of two different forecasts does not differ
systematically.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 87 / 128
Comparing forecast accuracy

Out-of-sample forecast experiment

I You have a sample t = 1, . . . , T2 of data Yt (and typically some covariates Xt ).


I You ask: which model would have generated the best h-step forecasts of Yt in the
past (and thus perhaps also in the future)?
I Split the sample into a baseline sample t = 1, . . . , T1 h and a forecast sample
t = T1 , . . . , T2 . Then proceed as follows.
I Estimate model 1 from the sample t = 1, . . . , T1 h and compute Y1,T1 |T1 h .
I Estimate model 1 from the sample t = 1, . . . , T1 h + 1 and compute Y1,T1 +1|T1 h+1 .
I ...
I Estimate model 1 from the sample t = 1, . . . , T2 h and compute Y1,T2 |T2 h .
I Based on Y1,T1 |T1 h , . . . , Y1,T2 |T2 h compute L1,T1 , . . . , L1,T2 .
I Replicate this for all models.
I To compare the results, use descriptive statistics (such as average losses, Theils U)
and tests of the null hypothesis that the forecast accuracy of two models does not
differ systematically.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 88 / 128
Comparing forecast accuracy

Theils U

Theils U is usually defined in terms of the root of the MSE but in principle it can be
applied to any loss function.
q P
1 T2
t+h=T1 (yt+h yt+h|t )
2
RMSE(yt+h|t ) T
U= = q P
RMSE(yt+h|t ) 1 T2
(y y )2
T t+h=T1 t+h t+h|t

where yt+h|t denotes a naive or simple baseline forecast, e.g.,


I yt+h|t = 0
I yt+h|t = yt
I yt+h|t from an AR(1) model

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 89 / 128
Comparing forecast accuracy

Diebold Mariano Test


Setup

I Test of equal accuracy of two competing (non-nested) forecasts y1,t+h|t and y2,t+h|t
for target variable yt+h , t + h = T1 , . . . , T2 .
I Denote forecast errors e1,t+h|t and e2,t+h|t , respectively.
I Accuracy of forecasts is measured by some loss function L(). Here, focus on
error-based loss functions L(et+h|t ).
I Loss differential is denoted by

dh,t = L(e1,t+h|t ) L(e2,t+h|t )

I This leads to the average loss differential


T2 T2
1 X 1 X
dh = [L(e1,t+h|t ) L(e2,t+h|t )] = dh,t , T = T2 T1 + 1.
T T
t+h=T1 t+h=T1

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 90 / 128
Comparing forecast accuracy

Diebold Mariano Test


Test statistic

I Hypotheses

H0 : E [dh,t ] = 0 vs. E [dh,t ] 6= 0

I The test statistic converges (as T ),

dh d
DM = N (0, 1).
/ T
I This is an asymptotic result. It cannot be expected to approximate the unknown
finite-sample distribution well for small forecast samples T .
I Note that the dh,t s may be autocorrelated. This holds even under MSE loss and if
h = 1 because we do not know whether the forecasts are optimal. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West.
Set the lag window at least to h 1.
I Take into account that Newey-West type variance estimators may perform poorly in
small samples. Hence, interpret the test results with caution.
Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 91 / 128
Comparing forecast accuracy

Sidestep: asymptotic properties weakly stationary processes


Assumptions

Assume a stochastic process yt satisfies the following assumptions

E [yt ] = for all t


Cov[yt , ytj ] = j for all t

X
|j | < .

I The first two conditions define a weakly stationary process.


I The third condition (absolute summability of the autocovariances) guarantees that
the stationary process is ergodic for the mean and will be used to prove the following
asymptotic results.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 92 / 128
Comparing forecast accuracy

Sidestep: asymptotic properties weakly stationary processes


Convergence of the sample mean and its covariance

If the assumptions stated above are satisfied, then


T
1 X p
y = yt
T t=1

and
d
 
T (y ) N 0, 2 ,

where the long-run variance is



n o n o X
2 = lim Var[ T (y )] = lim E[T (y )2 ] = j .
T T
j=

I The proofs can be found in Hamilton (1994, p. 186ff.). Also see Econometrics II.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 93 / 128
Comparing forecast accuracy

Sidestep: asymptotic properties weakly stationary processes


Newey-West estimator

I An estimator for the long-run variance 2 should be positive.


I Unfortunately, this is not so easy to guarantee.
I For example, the autocovariances j of a MA(h) process are zero for j > h. But the
estimator
h
X h
X
2 = j = 0 + 2 j ,
j=h j=1

where j are the sample autocovariances, is not necessarily positive.


I Newey and West (1987) showed that the alternative estimator
h   h  
2
X |j| X j
= 1 j = 0 + 2 1 j
h+1 j=1
h+1
j=h

is consistent and positive.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 94 / 128
Comparing forecast accuracy

Sidestep: asymptotic properties weakly stationary processes


Kernels

I The term
|j|
1
h+1
is called a kernel with bandwidth or lag window h (because autocovariances larger
than h are neglected).
I It leads to a downweighting of distant autocovariances. For example, for a
bandwidth of h = 3, the weights are
1 2 3 3 2 1
, , , 1, , , .
4 4 4 4 4 4
I Other kernels (Parzen kernel, quadratic kernel) have been proposed, see Hamilton
(1994, p. 281ff.) and Andrews (1991, Econometrica).
I Andrews (1991) also suggests a plug-in estimator for the bandwidth.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 95 / 128
Comparing forecast accuracy

Modified Diebold Mariano Test

I Harvey, Leybourne and Newbold (1997) propose the modified DM statistic


h i1/2
MDM = T 1/2 T + 1 2h + T 1 h(h 1) DM .

I For small values of T , the MDM tests may have better size properties when critical
values from the t-distribution with T 1 degrees of freedom are used.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 96 / 128
Forecast methods in practice

9. Forecast methods in practice

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 97 / 128
Forecast methods in practice

Introduction

I The literature on forecast methods is vast and quickly evolving, particularly with
respect to large data methods.
I We even cannot give an overview here. But a few overview papers:
I Stock and Watson (2006), Forecasting with many predictors, In: Graham et al. (ed),
Handbook of Economic Forecasting 1, 516-550.
I Stock and Watson (2010), Dynamic factor models, In: Clements and Hendry (ed),
Oxford Handbook of Economic Forecasting.
I Stock and Watson (2012), Generalized Shrinkage Methods for Forecasting Using
Many Predictors, JBES 30(4), 481-493.
I Carriero, Kapetanios, Marcellino (2010), Forecasting large datasets with Bayesian
reduced rank multivariate models, Journal of Applied Econometrics 26(5), 735-61.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 98 / 128
Forecast methods in practice

Autoregressions

I Especially in macro forecasting, simple AR models are popular

yt = a1 yt1 + + ap ytp + ut .

I Since AR models are special cases of the VAR model studied in the first half of the
semester, all those results apply.
I In particular, due to estimation uncertainty, small lag orders should be preferred.
I In fact, the AR(1) model is typically difficult to beat.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 99 / 128
Forecast methods in practice

AR-X models

I Improvements may be possible if the AR model is augmented by early indicators

yt = a1 yt1 + b1 x1,t1 + + bk xk,t1 + ut .

I Only the best indicators should be included to minimize estimation uncertainty.


I How to compute multi-step forecasts?

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 100 / 128
Forecast methods in practice

VAR models

I Popular benchmark
I However, quickly overparameterized
I Valuable for multi-step forecasts

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 101 / 128
Forecast methods in practice

Multi-step forecasts
Direct forecasts

I Recall: forecasting Yt+h often requires knowledge of the conditional mean of Yt+h
given It .
I To be parsimonious, let us assume that the relevant variables included in It are Yt
and k indicator variables Xt .
I Now suppose we have a sample t = 1, . . . , T and want to forecast YT +h .
I Under stationarity and linearity, the sample counterpart to

E[YT +h |Zt , ] = E[YT +h |YT , X1,T , . . . , Xk,T ; ]

is the regression

yt+h = 0 + 1 yt + 2 x1,t + + k+1 xk,t + vt+h , t = q + 1, . . . , T 2.

I Estimating it by OLS allows to make the prediction

yT +h = 0 + 1 yT + . . . + q+1 yT q .

I This is called a direct multistep forecast.


Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 102 / 128
Forecast methods in practice

Multi-step forecasts
Direct forecasts

I Estimating by OLS models of the type

yt+h = 0 + 1 yt + 2 x1,t + + k+1 xk,t + vt+h , t = q + 1, . . . , T 2,

is, however, complicated by the fact that the disturbances are typically
autocorrelated for h > 1 which reduces estimation efficiency.
I To see the point, consider h = 2.
I Then vt+2 is due to unforeseeable shocks that occur in periods t + 1 and t + 2, while
vt+3 is due to unforeseeable shocks that occur in periods t + 2 and t + 3.
I Hence, the overlapping errors vt+2 and vt+3 are correlated.
I This is no surprise since optimal forecast errors should have a MA(h 1) structure
under MSE loss.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 103 / 128
Forecast methods in practice

Multi-step forecasts
Iterated forecasts

I Alternatively, one may use iterated multistep forecasts.


I Advantage: less problems with autocorrelation.
I The difficulty is that iterated multistep forecasts require (auxiliary) equations for all
explanatory variables.
I This is most easily obtained from a VAR model where each variable has one
equation.
I Potential drawback: relatively large number of parameters.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 104 / 128
Forecast methods in practice

Forecasting with large data sets

I ARX and VAR models only useful when the number of variables small
I Today: hundreds or even thousands of variables
I Impossible to include them all in one ARX or VAR model
I What to do?

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 105 / 128
Forecast methods in practice

Forecasting with large data sets

There are different ways to deal with large data sets:


I Information selection
I Use information criteria, correlations with the target variable, t values in ARX models
etc to select a few promising indicators. Include only them in the forecast model.
I Information pooling
I Extract common features from the data. Include only them in the forecast model.
I Popular method: factor models.
I Forecast pooling
I Use many small models with just a few parameters to generate many forecasts.
I Compute a (weighted) average of all forecasts. See below why this works.
I And many intermediate forms
I Particularly, shrinkage methods like bootstrap aggregation (bagging) of variable
selection methods, Bayesian estimation, LASSO, elastic net, and boosting.
I For boosting see Buchen and Wohlrabe (2011), Forecasting with many predictors: Is
boosting a viable alternative?, Economics Letters 113(1), 16-18, and the references
therein.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 106 / 128
Forecast combinations

10. Forecast combinations

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 107 / 128
Forecast combinations

Background

I In many forecast situations, we are in data rich environments.


I This means that there are many (K ) potential predictor variables compared to the
number of observations (T ).
I Often, we even have K > T .
I In such cases we have to condense the information.
I One approach is pooling of information: the K correlated predictor variables are
transformed into K << k representatives, e.g., by using factor models or variable
selection methods (like stepwise regressions).
I A kind of in-between approach is to use estimation methods appropriate for such
situations like shrinkage estimators or Bayesian techniques.
I A second approach is pooling of forecasts: one produces many forecasts using
different forecasting models and then combines these forecasts, e.g., by using the
average.
I In the following: let us concentrate on forecast combinations.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 108 / 128
Forecast combinations

Why forecast combination might be beneficial

Forecast combinations might be beneficial even with a relatively small number of


predictor variables. Here are some reasons why (cf. Timmerman, 2006).
1. Portfolio diversification
I Information set underlying the individual forecasts might be unobserved.
I Then it is not feasible to pool the underlying information sets and construct a super
model.
I Instead, one may pool the forecasts.
2. Protection against structural breaks
I Individual forecasts may be very differently affected by structural breaks.
I Some models may adapt quickly and will only temporarily be affected by structural
breaks, while others have parameters that only adjust very slowly to new post-break
data.
I If the data window since the most recent break is short, the faster adapting models
can be expected to produce the best forecasting performance.
I On average, combinations of forecasts from models with different degrees of
adaptability will outperform forecasts from individual models.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 109 / 128
Forecast combinations

Why forecast combination might be beneficial

3. Protection against misspecification


I The true DGP is likely to be more complex and of a much higher dimension than
assumed by the most flexible and general forecasting model.
I Viewing forecasting models as local approximations, it is implausible that the same
model dominates all others at all points in time.
I Rather, the best model may change over time in ways that can be difficult to track on
the basis of past forecasting performance.
I Combining forecasts can be viewed as a way of robustification against misspecification
biases and measurement errors.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 110 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: setup

What are the gains of forecast combination?

Let us consider the simplest possible case:


I There are two forecasts y1,t+1|t and y2,t+1|t with forecast errors
e1,t+1 = yt+1 y1,t+1|t and e2,t+1 = yt+1 y2,t+1|t , respectively.
I For ease of notation, denote the errors by e1 and e2 .
I Assume the forecasts are unbiased, E [e1 ] = 0 and E [e2 ] = 0.
I Denote their variances by Var[e1 ] = 12 and Var[e2 ] = 22 .
I Denote their covariance by 12 = 12 1 2 , where 12 is their correlation.

Now combine the two forecasts using weights w and 1 w :

yc,t+1|t = w y1,t+1|t + (1 w ) y2,t+1|t .

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 111 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: properties

The forecast error of the combined forecast is:

ec := ec,t+1 = yt+1 yc,t+1|t


= yt+1 w y1,t+1|t (1 w ) y2,t+1|t
= w (yt+1 y1,t+1|t ) + (1 w )(yt+1 y2,t+1|t )
= w e1 + (1 w ) e2 .

Hence, the properties of the combined forecast error are:

E [ec ] = w E [e1 ] + (1 w ) E [e2 ] = 0

Var[ec ] = w 2 Var[e1 ] + (1 w )2 Var[e2 ] + 2w (1 w ) Cov[e1, e2 ]


= w 2 12 + (1 w )2 22 + 2w (1 w )12 =: c2 (w ).

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 112 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: optimal weights

Under MSE loss, the aim is to minimize the expected variance if the forecast is unbiased.

Hence, minimize with respect to w the objective function

c2 (w ) = w 2 12 + (1 w )2 22 + 2w (1 w )12 .

The first order condition is


c2 (w ) !
= 2w 12 2(1 w )22 + 2(1 2w )12 = 0.
w
Solving for the optimal weights w yields

22 12
w =
12 + 22 212

12 12
1 w = .
12 + 22 212

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 113 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: MSE of the optimal combination

The MSE of the optimal forecast combination (which is again unbiased) is obtained from
substituting w into the objective function. After some algebra, using 12 = 12 1 2 , this
yields

12 22 (1 212 )
c2 (w ) = .
12 + 22 21 2 12

Is this smaller than the MSE of an individual forecast?

In the following, we show this by comparing c2 (w ) with the smallest individual variance
12 , i.e., we assume that 12 < 22 . (This is without loss of generality because we can
denote any of the two forecasts as forecast 1.)

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 114 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: MSE of the optimal combination versus the best individual forecast

Compute the difference

12 22 (1 212 ) 4 + 2 2 213 2 12
c2 (w ) 12 = 1 2 1 22
12 2
+ 2 21 2 12 1 + 2 21 2 12
12 22 212 14 + 213 2 12
=
12 + 22 21 2 12
12 12 22
 
2 1
= 12 2 + 212 2 2
2 2 1 + 2 21 2 12
2
12 22

1
= 12 2 2
2 1 + 2 21 2 12

I The nominator is positive.


I The denominator is positive because it equals Var[e1 e2 ] > 0.
I The leading term (12 1 /2 )2 is positive or zero.

c2 (w ) 12 0 c2 (w ) 12

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 115 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: when there is no gain of pooling

The result also shows when combination does not pay off: The difference c2 (w ) 12 is
zero if

12 = 1 /2 .

In this situation, the optimal weights are


1
22 1 2 12 1
2 12 1 212
w = = = =1
1 + 22 21 2 12
2 12
+1 2 12 12 1 212
22

1 w = 0.

This means that yc,t+1|t = y1,t+1|t because there is no gain of pooling.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 116 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: when there is no gain of pooling

0
(121/2)2

4
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 117 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: when there is no gain of pooling

0.5

0
(2c(w*)21)/21

0.5

1.5
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 118 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: why reality is less friendly

Exactly optimal combinations are only possible if


I the variances and correlations are stable and
I the population values are known to the forecaster.

In reality,
I we face the typical observation that population parameters seem unstable.
I we have to estimate the variances and correlations which typically induces a lot of
noise.

As a consequence, the theoretical gains of pooling analyzed above is not attainable in


practice.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 119 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: estimation of the weights

To estimate the weights, we can estimate the variances and the covariance from the
sample moments:
T1 1
X
12 = T 1 (e1,t+1|t e1 )2
t=T

T1 1
X
22 = T 1 (e2,t+1|t e2 )2
t=T

T1 1
X
2
12 = T 1 (e1,t+1|t e1 )(e1,t+1|t e1 )
t=T

where e1 and e2 are the respective means in the forecast sample.

Asymptotically, the estimators converge towards the population values. Hence, the
estimated weight converges towards the optimal weight.
Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 120 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: estimation of the weights

As an alternative method, we can use OLS. To this end, write the combination as a
regression equation:

yt+h = wy1,t+h|t + (1 w )y2,t+h|t + ec,t+h|t .

Subtracting y2,t+h|t from both sides yields the estimable equation

yt+h y2,t+h|t = w (y1,t+h|t y2,t+h|t ) + ec,t+h|t


e2,t+h|t = w (y1,t+h|t yt+h y2,t+h|t + yt+h ) + ec,t+h|t
e2,t+h|t = w (e2,t+h|t e1,t+h|t ) + ec,t+h|t .

(To account for possibly non-zero sample means of the forecast error, an intercept may
be added.)

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 121 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: estimation of the weights

The OLS estimator is


P 1 1
T 1 Tt=T e2,t+h|t (e2,t+h|t e1,t+h|t ) T 22 12
w = 1 1
.
T 1 Tt=T 12 + 22 212
P
(e2,t+h|t e1,t+h|t )2

Hence, asymptotically the optimal population weight is chosen.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 122 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: How precise is the OLS estimator?

For large T , the standard error of the regression converges towards c (w ) because w
converges to w and thus the regression residual converges to ec,t+h|t .

The asymptotic variance of the OLS estimator is thus

1 c2 (w )
Avar(w ) = T 1
.
T T 1 t=T
P 1
(e2,t+h|t e1,t+h|t )2

Since
T1 1
" #
X
E T 1 (e2,t+h|t e1,t+h|t )2 = 12 + 22 212
t=T

the above expression is asymptotically equivalent to

1 c2 (w ) 1 12 22 (1 212 ) 1 1 212
Avar(w ) = = = .
2 2
T 1 + 2 212 2 2
T (1 + 2 212 ) 2 T ( 2 + 21 212 )2
1

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 123 / 128
Forecast combinations

Linear forecast combinations under MSE loss


Two forecasts: How precise is the OLS estimator?

Hence, the standard error of the OLS estimator for w is


p
p 1 (1 212 )
Avar(w ) = 1 .
T 2 + 21 212

Is this large?

Realistic example: T = 20, 1 /2 = 0.9, 12 = 0.75.

optimal weight w = 0.70652.


p
standard error Avar(w ) 0.25882.
95%-interval for w : [0.70652 1.96 0.25882] = [0.2; 1.2].

This is huge!

(Note: precision becomes much better if the forecast errors are negatively correlated.)

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 124 / 128
Forecast combinations

Linear forecast combinations under MSE loss


p
Two forecasts: Avar(w ) as a function of 12 and 1 /2

0.8

0.6

0.4

0.2

0
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 125 / 128
Forecast combinations

Methods of linear forecast combination

Many different approaches to combine K different forecasts have been proposed (cf.
Timmermann, 2006):
I Equal weights: wi = 1/K .
I Optimal linear weights: using the weights estimated by OLS (see above).
MSE1
i
I Relative performance weights: wi = .
MSE1
P
k k

Rank1
i
I Rank-based weights: wi = .
Rank1
P
k k
I Trimming: before weighting, discard the worst 25%, 50% or even 75% of the
forecasts.
I Clustering: (1) cluster forecasts into groups of similar forecasts, (2) compute
average forecast of each cluster, (3) combine these cluster forecasts into one
forecast using one of the previous weighting schemes.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 126 / 128
Forecast combinations

Shrinkage methods for forecast combination

This method follow the general idea of shrinkage estimators.

Assume you have calculated an unbiased OLS estimator w .


1
A shrinkage estimator can be defined as w = 1+s
w , s > 0.

If E [w ] = w , then E [w ] = 1
1+s
w < w , hence it is biased.
1 1

However, Var(w ) = Var 1+s
w = (1+s)2
Var(w ) < Var(w ).

Hence, the MSE (= bias2 + variance) of the shrinkage estimator can, in principle, be
smaller than the MSE of the OLS estimator.

In general, shrinkage pays off if the variance of the OLS estimator is large. As we saw
above, this may be a typical situation in forecast combination exercises.

In the multiple regression, shrinkage may also be a good idea if the regressors are highly
correlated such that the inverse of the moment matrix X 0 X is near singular. Again this
may characterize a set of multiple forecasts.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 127 / 128
Forecast combinations

Forecast combination puzzle


(Stock and Watson, 2004)

Seemingly naive combination methods often found superior in empirical studies


compared to optimal approaches like one introduced above.
I Most common naive combination method is average forecast:
yt+h|t = (1/M) M
P
m=1 ym,t+h|t .

I Reasons might be insufficient sample sizes for estimation of w and/or


time-variation in the parameters.
Simple mean avoids estimation uncertainty, which can be large.
I In general, methods that avoid estimation altogether (such as averaging, averaging
after trimming,might perform well such as averaging, or averaging after trimming) or
reduce the estimation variance (shrinkage estimation, Bayesian estimation) might
perform better than optimal linear weighting with estimated weights.

Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 128 / 128

Potrebbero piacerti anche