Lecture 4

Multivariate Time Series Analysis and Forecasting
Lecture 4: Topics in Forecasting
Prof. Dr. Kai Carstensen
Christian Albrechts University Kiel
Summer Term 2017
Carstensen (CAU Kiel) Multivariate Time Series Summer Term 2017 1 / 128
Reading List
F.X. Diebold and R.S. Mariano (1995) Comparing predictive accuracy, Journal of Business
and Economic Statistics 13, 253-263.
R.S. Mariano (2007) Testing forecast accuracy, in: A Companion to Economic Forecasting
(eds. M. P. Clements and D. F. Hendry), Blackwell. Here is a working paper version.
F.X. Diebold and L.A. Lopez (1995) Forecast evaluation and combination, Federal Reserve
Bank of New York Research Paper No. 9525.
G. Elliott and A. Timmermann (2008) Economic forecasting, Journal of Economic
Literature 46, 3-56. Here is a working paper version.
A. Timmermann (2006) Forecast combinations, in: Handbook of Economic Forecasting,
Vol. 1, 99-134. Here is a working paper version.
A.J. Patton and A. Timmermann (2010) Generalized Forecast Errors, A Change of Measure,
and Forecast Optimality Conditions, in: Volatility and Time Series Econometrics: Essays in
Honor of Robert F. Engle, edited by T. Bollerslev, J.R. Russell and M.W. Watson, Oxford
University Press. Here is a working paper version.
A.J. Patton and A. Timmermann (2007) Testing Forecast Optimality under Unknown Loss,
Journal of the American Statistical Association, 102(480), 1172-1184. Here is a working
paper version.
Example: Forecasting US inflation and unemployment
1. Example: Forecasting US inflation and unemployment
Green Book forecast of US unemployment rate and inflation rate
I Staff forecasts at the Fed

I Input to each regularly scheduled FOMC meeting.
I Data released with 5 year lag.
I Past forecasts available here.
I Quarterly data: 1980Q1-2009Q4.
US unemployment rate (in percent)

1-step Green Book forecast
US unemployment rate, actual (blue) and 1step forecast (red)

12
10
2
1980 1985 1990 1995 2000 2005 2010
Forecast error of 1step Green Book forecast

1
0.5
0.5
1
1980 1985 1990 1995 2000 2005 2010
US inflation rate (quarter-on-quarter rate in percent, annualized)

1-step Green Book forecast
US inflation rate, actual (blue) and 1step Green Book forecast (red)
20
15
10
10
1980 1985 1990 1995 2000 2005 2010
Forecast error of 1step Green Book forecast

4
10
1980 1985 1990 1995 2000 2005 2010
VAR forecast of US unemployment rate and inflation rate
I 3 variables: unemployment rate, CPI inflation rate, Fed funds rate

I Quarterly data: 1955Q1-2015Q3
I Estimation sample recursively increased from 1955Q1-1984Q4 to 1955Q1-2015Q2
I 1-step forecasts for 1985Q1 to 2015Q3
I p = 4 lags
I Estimation by OLS
US unemployment rate (in percent)

1-step forecast from recursively estimated VAR
US unemployment rate, actual (blue) and 1step forecast (red)

10
2
1985 1990 1995 2000 2005 2010 2015 2020
Forecast error of 1step forecast from VAR with 4 lags

1.5
0.5
0.5
1985 1990 1995 2000 2005 2010 2015 2020
US inflation rate (quarter-on-quarter rate in percent)

1-step forecast from recursively estimated VAR
US inflation rate, actual (blue) and 1step forecast (red)

2
3
1985 1990 1995 2000 2005 2010 2015 2020
Forecast error of 1step forecast from VAR with 4 lags

4
4
1985 1990 1995 2000 2005 2010 2015 2020
Are these good forecasts?
Which one is better?
Should we use (one of) them in practice?
Forecasting and forecast loss
2. Forecasting and forecast loss
Notation
I h: forecast horizon
I Yt+h : random variable of information set IT +h that generates the value to be
forecast (realization: yt+h )
I Yt+h|t : forecast using the information set IT .
I The forecast is typically based on some population parameters which are unknown
to the forecaster and have to be estimated.
Loss function
I The real-valued loss function describes how bad a forecast might be.
I General formulation:
L(Yt+h|t , Yt+h , Wt )
Hence, in general the loss depends on the forecast Yt+h|t , the target variable Yt+h ,
and (some) data Wt of the information set IT .
I Many important loss functions can be simplified to
L(Yt+h Yt+h|t ) = L(et+h|t ),
where et+h|t = Yt+h Yt+h|t is the forecast error.

I As an important exception, the loss may depend on the state of the economy at the
time of the forecast (boom/recession, hyperinflation/deflation). Then the
simplification does not hold.
Properties of loss functions L(Yt+h|t , Yt+h , Wt )
I Normalization: L(yt+h , yt+h , wt ) = 0

I Uniqueness: L(yt+h|t , yt+h , wt ) > 0 for all yt+h|t 6= yt+h
I Existence of expected loss:
Z
E[L(Yt+h|t , Yt+h , Wt )|It ] = L(Yt+h|t , Yt+h , Wt )pY (yt+h |It )dyt+h
yt+h
exists, i.e., the integral is finite. Note that pY (yt+h |It ) is the conditional density of
Yt+h given the information set It . Existence might be an issue if the conditional
distribution is fat-tailed and/or the loss in the tails of the distribution is very large.
I Symmetry: L(yt+h + c, yt+h , wt ) = L(yt+h c, yt+h , wt ). This might be a strong
assumption in many economic applications (government budget forecast, ECBs
inflation forecast, firms demand forecast, banks asset price forecast). Nevertheless
standard in many empirical studies.
Optimal forecast
I Even when conditioning on the available information It , forecast loss is a random

variable because it depends on Yt+h which is unknown in period t.

I Therefore, an optimal forecast Yt+h|t typically minimizes expected loss (other
approaches like minimizing median or maximum loss are possible but not popular).
I To simplify notation, let us write Y Yt+h|t and Y Yt+h in the following.
I The first derivative of the objective function is
Z
d d
E[L(Y , Y , Wt )|It ] = L(Y , Y , Wt )pY (y |It )dy
d Y d Y y
I Assuming interchangeability of differentiation and integration, it simplifies to
Z
dL(Y , Y , Wt )
pY (y |It )dy = E[L0 (Y , Y , Wt )|It ].
y d Y
I Then the FOC is simply
E[L0 (Y , Y , Wt )|It ] = 0.
Example: Optimal forecast under MSE loss
I MSE loss is: L(Y , Y , Wt ) = L(Y Y ) = (Y Y )2

I First derivative: L0 (Y , Y , Wt ) = 2(Y Y )
I This yields the FOC:
E[L0 (Y , Y , Wt )|It ] = E[2(Y Y )|It ] = 0,
which is solved by
Y = E[Y |It ].
I Result: under MSE loss the optimal predictor of Yt+h is the conditional expectation
(given the information set available to the forecaster).
I This is a general result that does not only apply to VAR models (which was shown
in lecture 1).
Feasible forecast
I In practice, the conditional expectation depends on unknown parameters, . Recall

the VAR model.
I To make the optimal forecast feasible, we thus have to use an estimator, .
I As shown for the VAR model, estimation increases forecast uncertainty.
I Unfortunately, combining an in some sense optimal estimator with an optimal but
infeasible predictor does not necessarily lead to a feasible predictor with smallest
possible expected loss.
I Hence, there is ample room for many different estimation-plus-forecasting
approaches which is why this field is very active.
Specific loss functions
3. Specific loss functions
Mean squared error loss
LMSE (Y , Y , Wt ) = LMSE (Y Y ) = (Y Y )2
I Forecast errors are penalized quadratically.

I Is this too strong? Or too weak?
I As shown, MSE loss leads to the optimal forecast Y = E[Y |It ].
Mean absolute error loss
LMAE (Y , Y , Wt ) = LMAE (Y Y ) = |Y Y |
I Large forecast errors are penalized less heavily than under MSE loss.
I Not easily differentiable at Y = Y (no interchangeability of differentiation and
integration).
I For all continuous distributions pY (y |It ) the optimal forecast is the conditional
median of Y .
Generalized error loss function
LGEL (Y , Y , Wt ; , p) = LGEL (Y Y ; , p) = [ + (1 2) 1Y <Y ] |Y Y |p
I This is a very flexible loss function that only depends on the forecast error
e = Y Y .
I It is symmetric for = 0.5.
I Not easily differentiable at Y = Y (no interchangeability of differentiation and
integration).
I Nests MAE loss ( = 0.5, p = 1) and MSE loss ( = 0.5, p = 2).
I p = 1: lin-lin loss (asymmetric piecewise linear loss).
I p = 2: quad-quad loss (asymmetric piecewise quadratic loss).
Symmetric loss functions
Symmetric loss functions ( = 0.5)

0.9
MSE loss
MAE loss
0.8 GE loss (p=1.5)
GE loss (p=3)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.5 1 0.5 0 0.5 1 1.5
Asymmetric loss functions
Aymmetric loss functions ( = 0.25)

1.4
Quadquad loss
Linlin loss
GE loss (p=1.5)
1.2 GE loss (p=3)
0.8
0.6
0.4
0.2
0
1.5 1 0.5 0 0.5 1 1.5
Linex loss
LLinex (Y , Y , Wt ; b) = LLinex (Y Y ; b) = exp(b(Y Y )) b(Y Y ) 1
I Even more asymmetric

I b < 0: approximately exponential for Y < Y , approximately linear for Y > Y
I b > 0: approximately exponential for Y > Y , approximately linear for Y < Y
I Easily differentiable
I Does not nest MSE and MAE
Linex loss functions

8
Linex loss (b=1.5)
Linex loss (b=1.75)
Linex loss (b=2)
7
0
1.5 1 0.5 0 0.5 1 1.5 2 2.5
Direction of change loss

0 if sign(Y ) = sign(Y )
LDoC (Y , Y , Wt ; b) =
1 otherwise
I Important for financial market forecasts

I Example: let Y be the yield change of a financial asset. In some cases, all you need
to get right is the direction of the yield change.
I Note: This loss function does not have a unique minimum!
Weighted MSE loss
LWMSE (Y , Y , Wt ) = Wt (Y Y )2
I Wt depends on the information set at the forecast origin, IT . Might be a state

variable.
I Weighting can be applied not only to MSE loss but to any other appropriate loss
function.
I Idea: Sometimes interest rests on good prediction at particular states of the
economy (or important events).
I Example: Precise forecasts might be particularly important in recessions (and/or
booms) much more than in normal times.
I An attractive way to obtain weights is the empirical distribution (or density)

function.
I Suppose {Yt } (which is a subset of Wt ) is strictly stationary and can be interpreted
as a state variable (e.g. GDP).
I Wt = 1 f (Yt )/ max(f (Yt )), where f () is the density function of Yt , focuses on
both tails of the distribution of Yt .
I Wt = 1 F (Yt ), where F () is the cumulative distribution function of Yt , focuses
on the left tail of the distribution of Yt .
I W = F (Yt ) focuses on the right tail of the distribution of Yt .
I See van Dijk and Franses (2003) and Carstensen, Wohlrabe, Ziegler (2011).
Weighted MSE loss

Example: euro area industrial production
Euro area industrial production, year-on-year growth rate 1992M22009M6
Weighted MSE loss

Empirical cdf of euro area industrial production
Weighted MSE loss

Boom and recession weights based on the empirical cdf
The use of loss functions in practice
I Find (and use) the optimal forecast based on a specific loss function appropriate for
the forecasting problem at hand.
I Based on a sample of forecasts of forecaster j, check her forecast efficiency by

comparing forecasts and realizations (assuming she used a specific loss function).
I Based on a sample of forecasts of forecaster j, infer her loss function by comparing

forecasts and realizations (assuming she used an optimal forecast).
I Compare different forecasts or forecasting models for a given loss function in order
to identify the best model.
Finding optimal forecasts
4. Finding optimal forecasts
How to find an optimal forecast in theory
I State the loss function and derive the FOC.

I Find the distribution of the target variable.
I Solve.
I For some loss functions we can state closed form solutions for the optimal forecasts
without knowing the distribution of the target variable.
Example 1: Forecasting a Bernoulli variable
Suppose we want to forecast an iid Bernoulli random variable x that takes a value of 1
with probability p and a value of 0 with probability 1 p.
The optimal forecast x depends on the loss function:

I MSE loss leads to x = E(x) = p.
I MAE loss leads to x = Med(x) which is 0 if p < 0.5 and 1 if p > 0.5.
I Linlin loss leads to x = 0 if p < and 1 if p > .
I Linex loss leads to x = log(1 + p exp(a) p)/a.
Proofs: shown in class or tutorial.
Expected loss functions for predictions of Bernoulli random variable (p=0.25)

2
MSE loss
MAE loss
1.8 Linex loss (b=1,a=1.5)
Linex loss (b=1,a=1.5)
Linlin loss (=0.8)
1.6
Linlin loss (=0.1)
1.4
1.2
0.8
0.6
0.4
0.2
0
0.2 0 0.2 0.4 0.6 0.8 1 1.2
Expected loss functions for predictions of Bernoulli random variable (p=0.6)

1.6
MSE loss
MAE loss
1.4
Linlin loss (=0.8)
Linlin loss (=0.1)
1.2
0.8
0.6
0.4
0.2
0
0.2 0 0.2 0.4 0.6 0.8 1 1.2
Optimal forecast under MSE loss
Loss function:
LMSE (Yt+h Yt+h|t ) = (Yt+h Yt+h|t )2
As shown above, the optimal forecast is

Yt+h|t = E[Yt+h |It ]
irrespective of the distribution of Yt+h .
Optimal forecast under MAE loss
Loss function:
LMAE (Yt+h Yt+h|t ) = |Yt+h Yt+h|t |
The optimal forecast is

Yt+h|t = Med[Yt+h |It ]

Proof of Yt+h|t = Med[Yt+h |It ]:
Assumption: conditional distribution of Yt+h is continuous.
Step 1: rewrite expected loss, abbreviating Et () by E (|It ), Yt+h by Y , and Yt+h|t by Y ,

Z Z Y Z
Et (|Y Y |) = |y Y |ft (y )dy = (Y y )ft (y )dy + (y Y )ft (y )dy ,
Y
where ft (y ) is the conditional density of Yt+h given It . For later use, also define Ft (y ) as
the conditional cdf of Yt+h given It .
Step 2: Use Leibnizs rule for differentiation

Z b(x) Z b(x)
g (y , x) b(x) a(x)
g (y , x)dy = dy + g (b(x), x) g (a(x), x)
x a(x) a(x) x x x
Here:
Z Y Z Y
(Y y )
(Y y )ft (y )dy = ft (y )dy + (Y Y )ft (Y ) 1 (Y Y )ft (Y ) 0
Y Y
Z Y
= ft (y )dy = Ft (Y )

and
Z Z Z
(y Y )
(y Y )ft (y )dy = ft (y )dy + 0 0 = ft (y )dy
Y Y Y Y Y
= Pt (Y > Y ) = [1 Pt (Y Y )] = Ft (Y ) 1
Putting results together:

Et (|Y Y |) = 2Ft (Y ) 1
Y
Step 3: Solve the FOC
1
2Ft (Y ) 1 = 0 Ft (Y ) =
2
By the definition of the median,
Y = Medt [Y ]
or, in full notation,

Yt+h|t = Med[Yt+h |It ].
Optimal forecast under Linlin loss
Loss function:
LLL (Yt+h Yt+h|t ) = [ + (1 2) 1Yt+h <Yt+h|t ] |Yt+h Yt+h|t |, (0, 1)

Yt+h|t = Quantile [Yt+h |It ]

Proof of Yt+h|t = Quantile [Yt+h |It ]:
The proof is similar to the proof shown above for MAE loss. You will arrive at the
condition
Ft (Y ) =
which is solved by the -quantile of the conditional distribution of Yt+h .
Optimal forecast under Linex loss
Loss function:
LLinex (Y Y ; b) = exp(b(Y Y )) b(Y Y ) 1
1
Yt+h|t = log E[exp(b Yt+h )|It ]
b
2
Under conditional normality, i.e., Yt+h |It N (t+h|t , t+h|t ), the optimal forecast is
b 2
Yt+h|t = t+h|t + ,
2 t+h|t
2
where t+h|t is the conditional mean and t+h|t is the conditional variance.
Proof of the general result: shown in class
Proof of the result under normality:
Recall the moment generating function for a conditionally normal random variable:
2
mt (s) = E[exp(s Yt+h )|It ] = exp(t+h|t s + 0.5t+h|t s 2 ).
Thus, the general result simplifies as follows
1 1
Yt+h|t = log E[exp(b Yt+h )|It ] = log mt (b)
b b
1 h
2
i b 2
= log exp(t+h|t b + 0.5t+h|t b 2 ) = t+h|t + t+h|t
b 2
Example 2: Forecasting a normal variable
Suppose we want to forecast a variable Yt+h that is conditionally normal,

2
Yt+h |It N (t+h|t , t+h|t )
The optimal forecast Yt+h|t depends on the loss function:
I MSE loss leads to Yt+h|t = E(Yt+h |It ) = t+h|t .

I MAE loss leads to Yt+h|t = Med(Yt+h |It ) = t+h|t .
I Linlin loss leads to Yt+h|t = Quantile [Yt+h |It ].
2
I Linex loss leads to Yt+h|t = t+h|t + b2 t+h|t .
Under MSE and MAE loss, we only need to find the conditional mean t+h|t . Under
2
asymmetric loss, we need to to find the conditional variance t+h|t , too.
Example 3: Forecasting German GDP growth
Now assume the 1-year ahead forecast of German GDP growth follows a conditional
normal distribution. A forecasting model (for example, a VAR) yields
E(Yt+h |It ) = t+h|t = 2.0 (percent).

2
The conditional variance is assumed to be constant over time. Then a value of t+h|t =1
fits the empirical distribution fairly well.
What should we predict?

Loss functions
Let us think about loss functions. For many people is overprediction (=negative surprise)
more costly than underprediction (=positive surprise).
Suppose the German finance minister has a 50 percent harder time in Parliament when
MPs find out he overpredicted GDP (and thereby his budget) by 0.5 percentage points
than he has when he underpredicted GDP by 0.5 percentage points.
Let us thus parameterize Linlin and Linex loss such that L(0.5) = 1.50L(0.5).
This requires = 0.4 for Linlin loss and b = 1.2147 for Linex loss.

Loss functions
Linlin with =0.4 (blue) and Linex with b=1.2147 (red) loss
2
1.8
1.6 Linlin loss ( = 0.4):

1.4 L(0.5) = 1.50L(0.5)
1.2
L(1.0) = 1.50L(1.0)
0.8 Linex loss (b = 1.21):

0.6 L(0.5) = 1.50L(0.5)
0.4
L(1.0) = 2.26L(1.0)
0.2
0
1.5 1 0.5 0 0.5 1 1.5

Optimal forecasts
The optimal forecasts Yt+h|t for different loss functions are:

I MSE loss: Yt+h|t = 2.0.

I MAE loss: Yt+h|t = 2.0.

I Linlin loss ( = 0.4): Yt+h|t = 2.0 0.25 = 1.75.
1.2
I Linex loss (b = 1.21): Yt+h|t = 2.0 2
1 = 1.4.
The optimal forecast under Linlin loss is calculated as follows:

We know that Yt+h|t is the -quantile of the conditional distribution. Hence,
= Pr(Y Y ) = Pr(Y t+h|t Y t+h|t ).
Noting that Y t+h|t follows a standard normal distribution (given It ), this simplifies to
= (Y t+h|t ),
where () is the standard normal cdf, and thus
1 () = Y t+h|t .
Solving for Y yields
Y = t+h|t + 1 () = 2.0 + 1 (0.4) = 2.0 0.25 = 1.75.
Properties of optimal forecasts
5. Properties of optimal forecasts
MSE loss
Loss: LMSE (Yt+h Yt+h|t ) = (Yt+h Yt+h|t )2

Optimal forecast: Yt+h|t = E[Yt+h |It ]

Forecast error: et+h|t Yt+h Yt+h|t
Properties:

I Unbiasedness: E(et+h|t )=0

I Unpredictability: Cov(et+h|t , Zt ) = 0 for all Zt It

I Increasing variance: Var(et+h|t ) Var(et+h+k|t ) for all k 0.

I Note that unpredictability implies that et+h|t is autocorrelated at most of order
h 1. Consequently for h = 1, the optimal forecast error is white noise.
Proof of unbiasedness:

E(et+h|t |It ) = E(Yt+h Yt+h|t |It ) = E(Yt+h |It ) E(Yt+h |It ) = 0
By the LIE,

E(et+h|t ) = E[E(et+h|t |It )] = E[0] = 0.
Proof of unpredictability:

Recall that the conditional expectation is zero: E(et+h|t |It ) = 0.

As Zt It , this implies: E(et+h|t |Zt ) = 0.

Hence, the covariance is zero: Cov(et+h|t , Zt ) = 0.
Proof of increasing variance:
By the law of total variance,
Var(et+h|t ) = E[Var(et+h|t |It )] + Var[ E(et+h|t |It ) ] = E[Var(et+h|t |It )].

| {z }
=0
By theorem CV.4 of Wooldridge (2002),
E[Var(et+h|t |It )] E[Var(et+h|t |Itk )] as Itk It .
By strict stationarity,
E[Var(et+h|t |Itk )] = E[Var(et+h+k|t |It )].
Taken together,
Var(et+h|t ) = E[Var(et+h|t |It )] E[Var(et+h+k|t |It )] = Var(et+h+k|t ).
MAE loss
Loss function: LMAE (Yt+h Yt+h|t ) = |Yt+h Yt+h|t |

Optimal forecast: Yt+h|t = Med[Yt+h |It ]
Properties:

I Median unbiasedness: Med(et+h|t )=0
h i
1
I Median unpredictability: E 1et+h|t
0 2 Zt = 0 for all Zt It

I Increasing imprecision: E(|et+h|t |) E(|et+h+k|t |) for all k 0.
I This implies that for asymmetric distributions, the optimal forecast under MAE loss
differs from the optimal forecast under MSE loss.
Proof of median unbiasedness:
By definition of the median,

0.5 = Pt (Yt+h Yt+h|t ) = Pt (Yt+h Yt+h|t 0) = Pt (et+h|t < 0).
Hence, the conditional median is zero:

Medt (et+h|t ) = 0.
As shown in class, also the unconditional median is zero.
Proof of median unpredictability: shown in class
Proof of increasing imprecision: see Patton and Timmerman (2010)
Lin-lin loss
Loss function: LLL (Yt+h Yt+h|t ) = [ + (1 2) 1Yt+h <Yt+h|t ] |Yt+h Yt+h|t |,

(0, 1)

Optimal forecast: Yt+h|t = Quantile [Yt+h |It ]
Properties:
I see below for general loss functions
Linex loss
Loss function: Llinex (Yt+h Yt+h|t ) = exp(b(Yt+h Yt+h|t )) b(Yt+h Yt+h|t ) 1
1
Optimal forecast: Yt+h|t = b
log E[exp(b Yt+h )|It ]
Properties:
I see below for general loss functions
General loss function
General loss: L(Yt+h Yt+h|t )
Optimal forecast: Under some regularity conditions, the optimal forecast satisfies

E(L(Yt+h Yt+h|t )/ Yt+h|t |It ) = 0

Generalized forecast error: t+h|t L(Yt+h Yt+h|t )/ Yt+h|t
Properties:

I Unbiasedness: E(t+h|t ) = E(t+h|t |It ) = 0

I Unpredictability: Cov(t+h|t , Zt ) = 0 for all Zt It
I Increasing loss: E(L(Yt+h Yt+h|t )) E(L(Yt+h Yt+h|tk )) for all k 0.

I Note that unpredictability implies that t+h|t is autocorrelated at most of order
h 1. Consequently for h = 1, the optimal generalized forecast error is white noise.
Unknown loss function
I Often the forecast producer (e.g., the Fed) is separate from the forecast consumer
(e.g., us).
I In this case, the loss function used to obtain the forecast may be unknown.
I Still, under mild conditions, we can derive some properties an optimal forecast must
satisfy. This allows us to test whether the forecast producer did a good job...
I See Patton and Timmermann (2007) for details.
I (Note: Patton and Timmermann (2007) derive more results than the one we
consider in the following.)

Assumptions
I Assumption on the data generating process: The target variable is conditionally

homoskedastic, i.e.,
2
Yt+h = t+h + t+h , t+h |It F,h (0, ,h )
2 2
where F,h (0, ,h ) is some distribution with mean zero and variance ,h , which may
depend on h, but does not depend on It .
I Assumption on the loss function: The loss function is a function solely of the
forecast error, i.e.,
L(Yt+h Yt+h|t ) = L(et+h|t ).

Properties
1. The optimal forecast takes the form:

Yt+h|t = t+h + h ,
where h is a constant that depends on L and F,h , but not on It .

2. The optimal forecast error et+h|t is independent of all Zt It , since

et+h|t = Yt+h Yt+h|t = h + t+h ,
2
where t+h |It F,h (0, ,h ). Thus,

E(et+h|t |It ) = h .

Properties
3. The optimal forecast is such that, for all t,

Ft+h|t (Yt+h|t ) = qh
where qh (0, 1) depends only on the forecast horizon and the loss function. If Ft+h|t
is continuous and strictly increasing then we obtain:
1
Yt+h|t = Ft+h|t (qh ) = Quantileqh (Yt+h |It ).
4. The variable

It+h|t = 1Yt+h Y = 1et+h|t
0
t+h|t

is independent of all Zt It and thus Cov(It+h|t , Zt ) = 0.

Proofs
See Patton and Timmermann (2007).
Evaluating the efficiency of a forecast
6. Evaluating the efficiency of a forecast
Notation
I Forecast horizon: h
I Forecast sample: t + h = T1 , ..., T2
I Number of h-step forecasts starting from the baseline sample: T = T2 T1 + 1
MSE loss
Test of unbiasedness
Hypotheses: H0 : E(et+h|t ) = 0 versus H1 : E(et+h|t ) 6= 0
The sample equivalent of the expected value is

T2
X
eh = T 1 et+h|t .
t+h=T1
Asymptotic distribution (as T ):
d
T eh N (0, 2 ),
where

2 = lim Var( T eh )
T
is the long-run variance of et+h|t .
MSE loss
Test of unbiasedness
Note that for h > 1, the et+h|t s are autocorrelated. Therefore, to estimate 2 you have
to use an autocorrelation-robust estimator of the variance like Newey-West (where you
can set the lag window to h 1 because higher-order autocorrelation is excluded under
the null).
Test statistic:
eh
t=
/ T
The test statistic is asymptotically N (0, 1).
MSE loss
Test of unpredictability
Null hypothesis: Cov(et+h|t , Zt ) = 0 for all Zt It
Implement as a regression on known forecast error and other variables Zit It :
et+h|t = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .
Hypotheses: H0 : 0 = 1 = 1 = = k = 0 versus H1 : H0
Test statistic: usual Wald statistic which is asymptotically 2 (k + 2) distributed.
Note that for h > 1, the et+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West (where
you can set the lag window to h 1).
MSE loss
Test of unpredictability: Mincer-Zarnowitz regression
Similar regression approach (pioneered by J. Mincer and V. Zarnowitz, 1969):
Yt+h = 0 + 1 Yt+h|t + vt+h , t + h = T1 , . . . , T2 .
Hypotheses: H0 : 0 = 0 and 1 = 1 versus H1 : H0
Test statistic: usual Wald statistic which is asymptotically 2 (2) distributed.
Note again that for h > 1, the vt+h s are autocorrelated. Therefore, you have to use an
autocorrelation-robust estimator of the variance like Newey-West (where you can set the
lag window to h 1).
MSE loss
Test of unpredictability: augmented Mincer-Zarnowitz regression
Augment the Mincer-Zarnowitz regression by other variables Zit It :
Yt+h = 0 + 1 Yt+h|t + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .
Hypotheses: H0 : 0 = 0 and 1 = 1 and 1 = = k = 0 versus H1 : H0
Note again that for h > 1, the vt+h s are autocorrelated. Therefore, you have to use an
autocorrelation-robust estimator of the variance like Newey-West (where you can set the
lag window to h 1).
MSE loss
Test of increasing variance
I When forecasts for multiple horizons (e.g., h = 1, 2, 3) are available, we can test the
inequality
Var(et+h|t ) Var(et+h+k|t )
I See Patton and Timmermann (2012) for details.

I Not often found in the literature (note: variances are estimated much more imprecise
than means hence you need a lot of data for this test to be really informative).
MAE loss
Test of median unbiasedness

Hypotheses: H0 : E 1et+h|t 0 12 = 0 versus H1 : H0
Let us denote
1
t,h = 1et+h|t 0 .
2
T2 T2
X X 1 #(et+h|t s 0) 1
h = T 1 t,h = T 1 1et+h|t 0 = .
2 T 2
t+h=T1 t+h=T1
Test statistic:
h d
t= N (0, 1),
/ T
where 2 is (for h > 1) an autocorrelation-robust estimator of the variance like

Newey-West (where you can set the lag window to h 1).
MAE loss
Test of median unpredictability
h i
1
Null hypothesis: E 1et+h|t
0 2
Zt = E[t,h Zt ] = 0 for all Zt It
t,h = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .

Test of unbiasedness of the generalized forecast error
Hypotheses: H0 : E(t+h|t ) = 0 versus H1 : H0

T2
X
h = T 1 t+h|t .
t+h=T1
Test statistic:
h d
t= N (0, 1),
/ T
where 2 is (for h > 1) an autocorrelation-robust estimator of the variance like

Newey-West (where you can set the lag window to h 1).

Test of unpredictability of the generalized forecast error
Null hypothesis: Cov(t+h|t , Zt ) = 0 for all Zt It
Implement as a regression on known generalized forecast error and other variables

Zit It :
t+h|t = 0 + 1 t|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .
Note that for h > 1, the t+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you

Unpredictability of the forecast error (version 1)
Null hypothesis: H0 : E(et+h|t |It ) = h Cov(et+h|t , Zt ) = 0 for all Zt It
et+h|t = 0 + 1 et|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .
Hypotheses: H0 : 1 = 1 = = k = 0 versus H1 : H0
Note: We do not restrict 0 to zero!

Unpredictability of the forecast error (version 2)
Null hypothesis: H0 : Cov(It+h|t , Zt ) = Cov(1et+h|t 0 , Zt ) = 0 for all Zt It
Implement as a regression on known dummy and other variables Zit It :
It+h|t = 0 + 1 It|th + 1 Z1t + + k Zkt + vt+h , t + h = T1 , . . . , T2 .
Hypotheses: H0 : 1 = 1 = = k = 0 versus H1 : H0
Note: We do not restrict 0 to zero!
Note that for h > 1, the It+h|t s (and thus the vt+h s) are autocorrelated. Therefore, you
Inferring the loss function
7. Inferring the loss function
Inferring the loss function
References
I Elliott, Timmermann, Komunjer (2005), Estimation and Testing of Forecast

Rationality under Flexible Loss, Review of Economic Studies 72(4), 1107-1125.
I Elliott, Komunjer, Timmermann (2008), Biases in Macroeconomic Forecasts:
Irrationality or Asymmetric Loss? Journal of European Economic Association, 6,
122-157.
Comparing forecast accuracy
8. Comparing forecast accuracy
Introduction
I There are two types of forecast comparisons:

I Ex-post comparison: You observe two forecast series and the realizations.
I Out-of-sample forecast experiment: You generate as if forecasts of different forecast
methods or models to evaluate their accuracy.
I A common approach when comparing two observed forecasts is to compare their
average losses.
I A popular descriptive tool is Theils U.
I To obtain a significance statement, one may test the null hypothesis that the two
observed forecasts lead to the same loss.
I The appropriate test depends on the setting. The Diebold-Mariano test is used
frequently.
Ex-post comparison of two forecast series
I You observe two (or more) h-step forecasts of Yt+h over the sample
t + h = T1 , . . . , T2 .
I Example: US 4-quarter ahead inflation forecasts published by the Fed (Green Book)
and by Consensus Economics (see here) for 1980Q1 - 2009Q4.
I You know the realized values of Yt+h .
I You ask: Which forecast is (systematically) better?
I Dimensions of comparison:
I Descriptive statistics (such as average losses, Theils U)
I Tests of forecast efficiency
I Tests of the null hypothesis that the accuracy of two different forecasts does not differ
systematically.
Out-of-sample forecast experiment
I You have a sample t = 1, . . . , T2 of data Yt (and typically some covariates Xt ).

I You ask: which model would have generated the best h-step forecasts of Yt in the
past (and thus perhaps also in the future)?
I Split the sample into a baseline sample t = 1, . . . , T1 h and a forecast sample
t = T1 , . . . , T2 . Then proceed as follows.
I Estimate model 1 from the sample t = 1, . . . , T1 h and compute Y1,T1 |T1 h .
I Estimate model 1 from the sample t = 1, . . . , T1 h + 1 and compute Y1,T1 +1|T1 h+1 .
I ...
I Estimate model 1 from the sample t = 1, . . . , T2 h and compute Y1,T2 |T2 h .
I Based on Y1,T1 |T1 h , . . . , Y1,T2 |T2 h compute L1,T1 , . . . , L1,T2 .
I Replicate this for all models.
I To compare the results, use descriptive statistics (such as average losses, Theils U)
and tests of the null hypothesis that the forecast accuracy of two models does not
differ systematically.
Theils U
Theils U is usually defined in terms of the root of the MSE but in principle it can be
applied to any loss function.
q P
1 T2
t+h=T1 (yt+h yt+h|t )
2
RMSE(yt+h|t ) T
U= = q P
RMSE(yt+h|t ) 1 T2
(y y )2
T t+h=T1 t+h t+h|t
where yt+h|t denotes a naive or simple baseline forecast, e.g.,

I yt+h|t = 0
I yt+h|t = yt
I yt+h|t from an AR(1) model
Diebold Mariano Test

Setup
I Test of equal accuracy of two competing (non-nested) forecasts y1,t+h|t and y2,t+h|t
for target variable yt+h , t + h = T1 , . . . , T2 .
I Denote forecast errors e1,t+h|t and e2,t+h|t , respectively.
I Accuracy of forecasts is measured by some loss function L(). Here, focus on
error-based loss functions L(et+h|t ).
I Loss differential is denoted by
dh,t = L(e1,t+h|t ) L(e2,t+h|t )
I This leads to the average loss differential

T2 T2
1 X 1 X
dh = [L(e1,t+h|t ) L(e2,t+h|t )] = dh,t , T = T2 T1 + 1.
T T
t+h=T1 t+h=T1
Diebold Mariano Test

Test statistic
I Hypotheses
H0 : E [dh,t ] = 0 vs. E [dh,t ] 6= 0
I The test statistic converges (as T ),
dh d
DM = N (0, 1).
/ T
I This is an asymptotic result. It cannot be expected to approximate the unknown
finite-sample distribution well for small forecast samples T .
I Note that the dh,t s may be autocorrelated. This holds even under MSE loss and if
h = 1 because we do not know whether the forecasts are optimal. Therefore, you
have to use an autocorrelation-robust estimator of the variance like Newey-West.
Set the lag window at least to h 1.
I Take into account that Newey-West type variance estimators may perform poorly in
small samples. Hence, interpret the test results with caution.
Sidestep: asymptotic properties weakly stationary processes

Assumptions
Assume a stochastic process yt satisfies the following assumptions
E [yt ] = for all t

Cov[yt , ytj ] = j for all t

X
|j | < .

I The first two conditions define a weakly stationary process.

I The third condition (absolute summability of the autocovariances) guarantees that
the stationary process is ergodic for the mean and will be used to prove the following
asymptotic results.

Convergence of the sample mean and its covariance
If the assumptions stated above are satisfied, then

T
1 X p
y = yt
T t=1
and
d

T (y ) N 0, 2 ,
where the long-run variance is

n o n o X
2 = lim Var[ T (y )] = lim E[T (y )2 ] = j .
T T
j=
I The proofs can be found in Hamilton (1994, p. 186ff.). Also see Econometrics II.

Newey-West estimator
I An estimator for the long-run variance 2 should be positive.

I Unfortunately, this is not so easy to guarantee.
I For example, the autocovariances j of a MA(h) process are zero for j > h. But the
estimator
h
X h
X
2 = j = 0 + 2 j ,
j=h j=1
where j are the sample autocovariances, is not necessarily positive.

I Newey and West (1987) showed that the alternative estimator
h h
2
X |j| X j
= 1 j = 0 + 2 1 j
h+1 j=1
h+1
j=h
is consistent and positive.

Kernels
I The term
|j|
1
h+1
is called a kernel with bandwidth or lag window h (because autocovariances larger
than h are neglected).
I It leads to a downweighting of distant autocovariances. For example, for a
bandwidth of h = 3, the weights are
1 2 3 3 2 1
, , , 1, , , .
4 4 4 4 4 4
I Other kernels (Parzen kernel, quadratic kernel) have been proposed, see Hamilton
(1994, p. 281ff.) and Andrews (1991, Econometrica).
I Andrews (1991) also suggests a plug-in estimator for the bandwidth.
Modified Diebold Mariano Test
I Harvey, Leybourne and Newbold (1997) propose the modified DM statistic

h i1/2
MDM = T 1/2 T + 1 2h + T 1 h(h 1) DM .
I For small values of T , the MDM tests may have better size properties when critical
values from the t-distribution with T 1 degrees of freedom are used.
Forecast methods in practice
9. Forecast methods in practice
Introduction
I The literature on forecast methods is vast and quickly evolving, particularly with
respect to large data methods.
I We even cannot give an overview here. But a few overview papers:
I Stock and Watson (2006), Forecasting with many predictors, In: Graham et al. (ed),
Handbook of Economic Forecasting 1, 516-550.
I Stock and Watson (2010), Dynamic factor models, In: Clements and Hendry (ed),
Oxford Handbook of Economic Forecasting.
I Stock and Watson (2012), Generalized Shrinkage Methods for Forecasting Using
Many Predictors, JBES 30(4), 481-493.
I Carriero, Kapetanios, Marcellino (2010), Forecasting large datasets with Bayesian
reduced rank multivariate models, Journal of Applied Econometrics 26(5), 735-61.
Autoregressions
I Especially in macro forecasting, simple AR models are popular
yt = a1 yt1 + + ap ytp + ut .
I Since AR models are special cases of the VAR model studied in the first half of the
semester, all those results apply.
I In particular, due to estimation uncertainty, small lag orders should be preferred.
I In fact, the AR(1) model is typically difficult to beat.
AR-X models
I Improvements may be possible if the AR model is augmented by early indicators
yt = a1 yt1 + b1 x1,t1 + + bk xk,t1 + ut .
I Only the best indicators should be included to minimize estimation uncertainty.

I How to compute multi-step forecasts?
VAR models
I Popular benchmark
I However, quickly overparameterized
I Valuable for multi-step forecasts
Multi-step forecasts
Direct forecasts
I Recall: forecasting Yt+h often requires knowledge of the conditional mean of Yt+h
given It .
I To be parsimonious, let us assume that the relevant variables included in It are Yt
and k indicator variables Xt .
I Now suppose we have a sample t = 1, . . . , T and want to forecast YT +h .
I Under stationarity and linearity, the sample counterpart to
E[YT +h |Zt , ] = E[YT +h |YT , X1,T , . . . , Xk,T ; ]
is the regression
yt+h = 0 + 1 yt + 2 x1,t + + k+1 xk,t + vt+h , t = q + 1, . . . , T 2.
I Estimating it by OLS allows to make the prediction
yT +h = 0 + 1 yT + . . . + q+1 yT q .
I This is called a direct multistep forecast.

Direct forecasts
I Estimating by OLS models of the type
yt+h = 0 + 1 yt + 2 x1,t + + k+1 xk,t + vt+h , t = q + 1, . . . , T 2,
is, however, complicated by the fact that the disturbances are typically
autocorrelated for h > 1 which reduces estimation efficiency.
I To see the point, consider h = 2.
I Then vt+2 is due to unforeseeable shocks that occur in periods t + 1 and t + 2, while
vt+3 is due to unforeseeable shocks that occur in periods t + 2 and t + 3.
I Hence, the overlapping errors vt+2 and vt+3 are correlated.
I This is no surprise since optimal forecast errors should have a MA(h 1) structure
under MSE loss.
Iterated forecasts
I Alternatively, one may use iterated multistep forecasts.

I Advantage: less problems with autocorrelation.
I The difficulty is that iterated multistep forecasts require (auxiliary) equations for all
explanatory variables.
I This is most easily obtained from a VAR model where each variable has one
equation.
I Potential drawback: relatively large number of parameters.
Forecasting with large data sets
I ARX and VAR models only useful when the number of variables small
I Today: hundreds or even thousands of variables
I Impossible to include them all in one ARX or VAR model
I What to do?
Forecasting with large data sets
There are different ways to deal with large data sets:

I Information selection
I Use information criteria, correlations with the target variable, t values in ARX models
etc to select a few promising indicators. Include only them in the forecast model.
I Information pooling
I Extract common features from the data. Include only them in the forecast model.
I Popular method: factor models.
I Forecast pooling
I Use many small models with just a few parameters to generate many forecasts.
I Compute a (weighted) average of all forecasts. See below why this works.
I And many intermediate forms
I Particularly, shrinkage methods like bootstrap aggregation (bagging) of variable
selection methods, Bayesian estimation, LASSO, elastic net, and boosting.
I For boosting see Buchen and Wohlrabe (2011), Forecasting with many predictors: Is
boosting a viable alternative?, Economics Letters 113(1), 16-18, and the references
therein.
Forecast combinations
10. Forecast combinations
Background
I In many forecast situations, we are in data rich environments.

I This means that there are many (K ) potential predictor variables compared to the
number of observations (T ).
I Often, we even have K > T .
I In such cases we have to condense the information.
I One approach is pooling of information: the K correlated predictor variables are
transformed into K << k representatives, e.g., by using factor models or variable
selection methods (like stepwise regressions).
I A kind of in-between approach is to use estimation methods appropriate for such
situations like shrinkage estimators or Bayesian techniques.
I A second approach is pooling of forecasts: one produces many forecasts using
different forecasting models and then combines these forecasts, e.g., by using the
average.
I In the following: let us concentrate on forecast combinations.
Why forecast combination might be beneficial
Forecast combinations might be beneficial even with a relatively small number of

predictor variables. Here are some reasons why (cf. Timmerman, 2006).
1. Portfolio diversification
I Information set underlying the individual forecasts might be unobserved.
I Then it is not feasible to pool the underlying information sets and construct a super
model.
I Instead, one may pool the forecasts.
2. Protection against structural breaks
I Individual forecasts may be very differently affected by structural breaks.
I Some models may adapt quickly and will only temporarily be affected by structural
breaks, while others have parameters that only adjust very slowly to new post-break
data.
I If the data window since the most recent break is short, the faster adapting models
can be expected to produce the best forecasting performance.
I On average, combinations of forecasts from models with different degrees of
adaptability will outperform forecasts from individual models.
Why forecast combination might be beneficial
3. Protection against misspecification

I The true DGP is likely to be more complex and of a much higher dimension than
assumed by the most flexible and general forecasting model.
I Viewing forecasting models as local approximations, it is implausible that the same
model dominates all others at all points in time.
I Rather, the best model may change over time in ways that can be difficult to track on
the basis of past forecasting performance.
I Combining forecasts can be viewed as a way of robustification against misspecification
biases and measurement errors.
Linear forecast combinations under MSE loss

Two forecasts: setup
What are the gains of forecast combination?
Let us consider the simplest possible case:

I There are two forecasts y1,t+1|t and y2,t+1|t with forecast errors
e1,t+1 = yt+1 y1,t+1|t and e2,t+1 = yt+1 y2,t+1|t , respectively.
I For ease of notation, denote the errors by e1 and e2 .
I Assume the forecasts are unbiased, E [e1 ] = 0 and E [e2 ] = 0.
I Denote their variances by Var[e1 ] = 12 and Var[e2 ] = 22 .
I Denote their covariance by 12 = 12 1 2 , where 12 is their correlation.
Now combine the two forecasts using weights w and 1 w :
yc,t+1|t = w y1,t+1|t + (1 w ) y2,t+1|t .

Two forecasts: properties
The forecast error of the combined forecast is:
ec := ec,t+1 = yt+1 yc,t+1|t

= yt+1 w y1,t+1|t (1 w ) y2,t+1|t
= w (yt+1 y1,t+1|t ) + (1 w )(yt+1 y2,t+1|t )
= w e1 + (1 w ) e2 .
Hence, the properties of the combined forecast error are:
E [ec ] = w E [e1 ] + (1 w ) E [e2 ] = 0
Var[ec ] = w 2 Var[e1 ] + (1 w )2 Var[e2 ] + 2w (1 w ) Cov[e1, e2 ]

= w 2 12 + (1 w )2 22 + 2w (1 w )12 =: c2 (w ).

Two forecasts: optimal weights
Under MSE loss, the aim is to minimize the expected variance if the forecast is unbiased.
Hence, minimize with respect to w the objective function
c2 (w ) = w 2 12 + (1 w )2 22 + 2w (1 w )12 .
The first order condition is

c2 (w ) !
= 2w 12 2(1 w )22 + 2(1 2w )12 = 0.
w
Solving for the optimal weights w yields
22 12
w =
12 + 22 212
12 12
1 w = .
12 + 22 212

Two forecasts: MSE of the optimal combination
The MSE of the optimal forecast combination (which is again unbiased) is obtained from
substituting w into the objective function. After some algebra, using 12 = 12 1 2 , this
yields
12 22 (1 212 )
c2 (w ) = .
12 + 22 21 2 12
Is this smaller than the MSE of an individual forecast?
In the following, we show this by comparing c2 (w ) with the smallest individual variance
12 , i.e., we assume that 12 < 22 . (This is without loss of generality because we can
denote any of the two forecasts as forecast 1.)

Two forecasts: MSE of the optimal combination versus the best individual forecast
Compute the difference
12 22 (1 212 ) 4 + 2 2 213 2 12
c2 (w ) 12 = 1 2 1 22
12 2
+ 2 21 2 12 1 + 2 21 2 12
12 22 212 14 + 213 2 12
=
12 + 22 21 2 12
12 12 22

2 1
= 12 2 + 212 2 2
2 2 1 + 2 21 2 12
2
12 22

1
= 12 2 2
2 1 + 2 21 2 12
I The nominator is positive.

I The denominator is positive because it equals Var[e1 e2 ] > 0.
I The leading term (12 1 /2 )2 is positive or zero.
c2 (w ) 12 0 c2 (w ) 12

Two forecasts: when there is no gain of pooling
The result also shows when combination does not pay off: The difference c2 (w ) 12 is
zero if
12 = 1 /2 .
In this situation, the optimal weights are

1
22 1 2 12 1
2 12 1 212
w = = = =1
1 + 22 21 2 12
2 12
+1 2 12 12 1 212
22
1 w = 0.
This means that yc,t+1|t = y1,t+1|t because there is no gain of pooling.

0
(121/2)2
4
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2

0.5
0
(2c(w*)21)/21
0.5
1.5
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2

Two forecasts: why reality is less friendly
Exactly optimal combinations are only possible if

I the variances and correlations are stable and
I the population values are known to the forecaster.
In reality,
I we face the typical observation that population parameters seem unstable.
I we have to estimate the variances and correlations which typically induces a lot of
noise.
As a consequence, the theoretical gains of pooling analyzed above is not attainable in

practice.

Two forecasts: estimation of the weights
To estimate the weights, we can estimate the variances and the covariance from the
sample moments:
T1 1
X
12 = T 1 (e1,t+1|t e1 )2
t=T
T1 1
X
22 = T 1 (e2,t+1|t e2 )2
t=T
T1 1
X
2
12 = T 1 (e1,t+1|t e1 )(e1,t+1|t e1 )
t=T
where e1 and e2 are the respective means in the forecast sample.
Asymptotically, the estimators converge towards the population values. Hence, the
estimated weight converges towards the optimal weight.

As an alternative method, we can use OLS. To this end, write the combination as a
regression equation:
yt+h = wy1,t+h|t + (1 w )y2,t+h|t + ec,t+h|t .
Subtracting y2,t+h|t from both sides yields the estimable equation
yt+h y2,t+h|t = w (y1,t+h|t y2,t+h|t ) + ec,t+h|t

e2,t+h|t = w (y1,t+h|t yt+h y2,t+h|t + yt+h ) + ec,t+h|t
e2,t+h|t = w (e2,t+h|t e1,t+h|t ) + ec,t+h|t .
(To account for possibly non-zero sample means of the forecast error, an intercept may
be added.)

The OLS estimator is

P 1 1
T 1 Tt=T e2,t+h|t (e2,t+h|t e1,t+h|t ) T 22 12
w = 1 1
.
T 1 Tt=T 12 + 22 212
P
(e2,t+h|t e1,t+h|t )2
Hence, asymptotically the optimal population weight is chosen.

Two forecasts: How precise is the OLS estimator?
For large T , the standard error of the regression converges towards c (w ) because w
converges to w and thus the regression residual converges to ec,t+h|t .
The asymptotic variance of the OLS estimator is thus
1 c2 (w )
Avar(w ) = T 1
.
T T 1 t=T
P 1
(e2,t+h|t e1,t+h|t )2
Since
T1 1
" #
X
E T 1 (e2,t+h|t e1,t+h|t )2 = 12 + 22 212
t=T
the above expression is asymptotically equivalent to
1 c2 (w ) 1 12 22 (1 212 ) 1 1 212
Avar(w ) = = = .
2 2
T 1 + 2 212 2 2
T (1 + 2 212 ) 2 T ( 2 + 21 212 )2
1

Two forecasts: How precise is the OLS estimator?
Hence, the standard error of the OLS estimator for w is

p
p 1 (1 212 )
Avar(w ) = 1 .
T 2 + 21 212
Is this large?
Realistic example: T = 20, 1 /2 = 0.9, 12 = 0.75.
optimal weight w = 0.70652.

p
standard error Avar(w ) 0.25882.
95%-interval for w : [0.70652 1.96 0.25882] = [0.2; 1.2].
This is huge!
(Note: precision becomes much better if the forecast errors are negatively correlated.)

p
Two forecasts: Avar(w ) as a function of 12 and 1 /2
0.8
0.6
0.4
0.2
0
1
0.5 1
0.8
0 0.6
0.5 0.4
0.2
1 0
1/2
Methods of linear forecast combination
Many different approaches to combine K different forecasts have been proposed (cf.
Timmermann, 2006):
I Equal weights: wi = 1/K .
I Optimal linear weights: using the weights estimated by OLS (see above).
MSE1
i
I Relative performance weights: wi = .
MSE1
P
k k
Rank1
i
I Rank-based weights: wi = .
Rank1
P
k k
I Trimming: before weighting, discard the worst 25%, 50% or even 75% of the
forecasts.
I Clustering: (1) cluster forecasts into groups of similar forecasts, (2) compute
average forecast of each cluster, (3) combine these cluster forecasts into one
forecast using one of the previous weighting schemes.
Shrinkage methods for forecast combination
This method follow the general idea of shrinkage estimators.
Assume you have calculated an unbiased OLS estimator w .

1
A shrinkage estimator can be defined as w = 1+s
w , s > 0.
If E [w ] = w , then E [w ] = 1
1+s
w < w , hence it is biased.
1 1

However, Var(w ) = Var 1+s
w = (1+s)2
Var(w ) < Var(w ).
Hence, the MSE (= bias2 + variance) of the shrinkage estimator can, in principle, be
smaller than the MSE of the OLS estimator.
In general, shrinkage pays off if the variance of the OLS estimator is large. As we saw
above, this may be a typical situation in forecast combination exercises.
In the multiple regression, shrinkage may also be a good idea if the regressors are highly
correlated such that the inverse of the moment matrix X 0 X is near singular. Again this
may characterize a set of multiple forecasts.
Forecast combination puzzle

(Stock and Watson, 2004)
Seemingly naive combination methods often found superior in empirical studies

compared to optimal approaches like one introduced above.
I Most common naive combination method is average forecast:
yt+h|t = (1/M) M
P
m=1 ym,t+h|t .
I Reasons might be insufficient sample sizes for estimation of w and/or

time-variation in the parameters.
Simple mean avoids estimation uncertainty, which can be large.
I In general, methods that avoid estimation altogether (such as averaging, averaging
after trimming,might perform well such as averaging, or averaging after trimming) or
reduce the estimation variance (shrinkage estimation, Bayesian estimation) might
perform better than optimal linear weighting with estimated weights.

Lecture 4

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 4

Caricato da

Copyright:

Formati disponibili

Multivariate Time Series Analysis and Forecasting

Lecture 4: Topics in Forecasting

Prof. Dr. Kai Carstensen

Christian Albrechts University Kiel

Summer Term 2017

1. Example: Forecasting US inflation and unemployment

Green Book forecast of US unemployment rate and inflation rate

I Staff forecasts at the Fed

US unemployment rate (in percent)

US unemployment rate, actual (blue) and 1step forecast (red)

Forecast error of 1step Green Book forecast

US inflation rate (quarter-on-quarter rate in percent, annualized)

Forecast error of 1step Green Book forecast

VAR forecast of US unemployment rate and inflation rate

I 3 variables: unemployment rate, CPI inflation rate, Fed funds rate

US unemployment rate (in percent)

US unemployment rate, actual (blue) and 1step forecast (red)

Forecast error of 1step forecast from VAR with 4 lags

US inflation rate (quarter-on-quarter rate in percent)

US inflation rate, actual (blue) and 1step forecast (red)

Forecast error of 1step forecast from VAR with 4 lags

Are these good forecasts?

Which one is better?

Should we use (one of) them in practice?

2. Forecasting and forecast loss

L(Yt+h Yt+h|t ) = L(et+h|t ),

where et+h|t = Yt+h Yt+h|t is the forecast error.

Properties of loss functions L(Yt+h|t , Yt+h , Wt )

I Normalization: L(yt+h , yt+h , wt ) = 0

I Even when conditioning on the available information It , forecast loss is a random

Example: Optimal forecast under MSE loss

I MSE loss is: L(Y , Y , Wt ) = L(Y Y ) = (Y Y )2

E[L0 (Y , Y , Wt )|It ] = E[2(Y Y )|It ] = 0,

I In practice, the conditional expectation depends on unknown parameters, . Recall

3. Specific loss functions

Mean squared error loss

I Forecast errors are penalized quadratically.

Mean absolute error loss

Generalized error loss function

LGEL (Y , Y , Wt ; , p) = LGEL (Y Y ; , p) = [ + (1 2) 1Y <Y ] |Y Y |p

Symmetric loss functions

Symmetric loss functions ( = 0.5)

Asymmetric loss functions

Aymmetric loss functions ( = 0.25)

LLinex (Y , Y , Wt ; b) = LLinex (Y Y ; b) = exp(b(Y Y )) b(Y Y ) 1

I Even more asymmetric

Linex loss functions

Direction of change loss

I Important for financial market forecasts

Weighted MSE loss

I Wt depends on the information set at the forecast origin, IT . Might be a state

I An attractive way to obtain weights is the empirical distribution (or density)

Weighted MSE loss

Euro area industrial production, year-on-year growth rate 1992M22009M6

Weighted MSE loss

Empirical cdf of euro area industrial production

Weighted MSE loss

Boom and recession weights based on the empirical cdf

The use of loss functions in practice

I Based on a sample of forecasts of forecaster j, check her forecast efficiency by

I Based on a sample of forecasts of forecaster j, infer her loss function by comparing

4. Finding optimal forecasts

How to find an optimal forecast in theory