Sei sulla pagina 1di 87

SGPE Econometrics 1

Lecture Notes 1: Finite Sample OLS

Mark Schaer
Heriot-Watt University

Autumn 2016

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

1 / 87

Lecture Outline
1

Motivation

Notation, denitions, tools

Assumptions of the classical regression model and OLS

Finite-sample properties of OLS

Main reading: Hayashi Chapter 1


Greenes Econometric Analysis has a good matrix algebra appendix.

A thorough summary of many matrix algebra facts and results is The


Matrix Cookbook by Petersen and Pedersen, available online at
www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pd
or www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

2 / 87

Finite Sample OLS

Key concepts
1

Finite-sample vs. Large-sample

Assumptions of the classical linear regression model

Deriving the OLS estimator

Finite-sample properties of the OLS estimator

Unbiased estimator

E cient estimator

Mean squared error

Frisch-Waugh-Lovell Theorem

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

3 / 87

Finite Sample OLS


Key terms
1

Moment

Strict exogeneity

I.I.D. (independently and identically distributed)

Spherical error variance

Conditional homoskedasticity; conditional heteroskedasticity

Independence ("no serial correlation")

Unbiased estimator

E cient estimator

Mean squared error

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

4 / 87

Motivation
We want to estimate a parameter and perform inference using it. In
econometrics, we have two standard settings for characterising estimators
and developing results: "nite sample" and "large sample".
Finite sample setting:
In repeated samples of size n, what is E (b
)? Is E (b
) = , i.e., is our
estimator unbiased?

Our estimator b
is a statistic - it is a function of the data. In repeated
samples of size n, what is the distribution of b
and Var (b
) in particular?

Our expression for is Var (b


) may be infeasible because it depends on
\
parameters we dont know. How do we obtain an estimate Var
(b
)?

In practice, econometric theory often doesnt provide exact answers to


these questions. This is where Monte Carlo methods are very helpful: we
choose a specic sample size n and values for all the relevant parameters and the computer answers the questions.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

5 / 87

Motivation
Large sample setting:
It turns out to be a lot easier to develop theoretical results for b
in the
large sample setting, i.e., as n ! . We also say this is the asymptotic
setting, and we say our results are asymptotic or asymptotically valid.
What is the probability limit of b
as as n ! ? Does b
converge to ,
i.e., is p limn ! (b
) = ? In other words, is our estimator consistent?
(Loosely speaking, does any bias in b
disappear as the sample size gets
larger and larger?)

Our estimator b
is a statistic
p b - it is a function of the data. Whatbis the
limiting distribution of n( ) as n ! , i.e., what is AVar ( )?
Our expression for is AVar (b
) may be infeasible because it depends on
\
parameters we dont know. How do we obtain an estimate AVar
(b
)?

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

6 / 87

Motivation
What makes for a good estimator?
Central tendency: unbiasedness (nite-sample setting) or consistency
(large-sample setting). In repeated samples, b
will be centred around the
b
true ; as the sample size grows, converges to .
Variance: e ciency (nite-sample setting) or asymptotic e ciency
(large-sample setting). The variance of b
is "small".

How to weight these two criteria? What should be our loss function?
Might we sometimes prefer to use a biased or inconsistent estimator b

because it has a smaller variance than some other unbiased or consistent


estimator e
?
Commonly-used loss function: mean squared error.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

7 / 87

Motivation
Denition: The mean squared error of b
is MSE = E (b

a loss function that weights bias and variance equally.


h
i2
Theorem: MSE = Var (b
) + bias (b
)
). Then
Proof: Let = E (b
E (b

)2 = Eh (b

= E (b

)2 . MSE is

)2

i
)2 + 2( )(b
) + ( )2
h
i
= E (b
)2 + E 2( )(b
) + E ( )2
) + ( )2
= E (b
)2 + 2( )E ( b
= E (b
)2 + 2( ) 0 + ( )2
h
i2
= Var (b
) + bias (b
)

where at various points we have made use of the fact that can be
treated as a constant and so E ( ) = E (E ( )) = .
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

8 / 87

Motivation
Common pattern in econometrics:
b
does not have a nite-sample justication. Its biased, or we just
dont know what E (b
) is, or we cant derive Var (b
).
b
does have a large-sample justication. Its consistent, and we can
estimate the asymptotic variance.
Because b
is biased in nite samples, we are worried about how it will
perform in practice. No one ever actually has an innite sample.
Sometimes we can derive expressions for the nite-sample bias or
otherwise study nite-sample performance theoretically.
Often we do MC exercises to see how the estimator performs.
Increasing the sample size in a series of MCs can indicate how quickly
the asymptotic justication kicks in. We can compare how dierent
estimators perform in dierent settings according to their bias,
variance and MSE.
This lecture: nite-sample theory for OLS.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

9 / 87

Notation and denitions (and lots of it)


In general: matrices in upper case, vectors in bold lower case, scalars lower
case.
n
i
yi
xi
xi0
xik
i

k
y
X
0
0

Total number of observations


Indexes observations. (Well switch to t. for time series.)
Dependent variable; ith observation is a scalar
Independent variables; ith observation is a K 1 data row
Independent variables; ith observation is a 1 K data row
Independent variable k; ith observation is a scalar
Error term; ith observation is a scalar
Vector of errors, n 1
K 1 parameter vector
Coe cient on xik
Data vector of dependent variable, n 1
Data matrix of independent vars, n K
Scalar zero
Vector of zeros

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

10 / 87

Notation and denitions

Linear model, data row form:


yi = xi0 + i
Linear model, matrix form:
y = X +
where

yn

3
y1
6 .. 7
4 . 5
yn

Xn

Mark Schaer (Heriot-Watt University)

3
x10
6 .. 7
4 . 5
xn0

Lecture Notes 1

3
1
6 .. 7
4 . 5
K

3
1
6 .. 7
4 . 5
n

Autumn 2016

11 / 87

Notation and denitions


Curiously, it is traditional to denote the OLS estimator, residuals and error
variance by the special symbols b, e and s 2 .

y
y
y
y
y

OLS

"True" parameter vector


Some estimator for
Some other estimator for
The OLS estimator for
b
The OLS estimator for . Same meaning as
OLS .

X
b
X
e
X
b
X
OLS
Xb

Vector of
Residuals
Residuals
Residuals
Residuals

Mark Schaer (Heriot-Watt University)

errors, n
dened by
dened by
dened by
dened by

1
b
estimator
e
some other estimator
b
OLS estimator OLS (no subscript!)
OLS estimator b. Same as above.

Lecture Notes 1

Autumn 2016

12 / 87

Notation and denitions

Curiously, it is traditional to denote the OLS estimator, residuals and error


variance by the special symbols b, e and s 2 .
2
b2

e2

b 2OLS

s2

Var (i jX)
Some estimator for 2
Some other estimator for 2
The OLS estimator for 2
b 2OLS .
The OLS estimator for 2 . Same meaning as

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

13 / 87

The OLS estimator

The OLS estimator of in traditional data matrix form:


b = (X0 X )

1 X0 y

The OLS estimator of 2 :


s2 =

e0 e
n K

Note the K in the denition of s 2 . Asymptotically this doesnt matter - K


is xed as n ! - but in nite samples it will make a dierence (of
course).

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

14 / 87

Sum of Squared Residuals (SSR)


e can be written
The sum of squared residuals (SSR) for some estimator
in various ways:
e) = e
SSR (
0e

e ) 0 (y
=(y X
n

1
n

1
n

e2i

i =1
n

(yi

i =1

e)
X

e )0 (yi
xi

e)
xi

Denition of OLS estimator b: it minimizes the sum of squared residuals.


b

e)
arg minSSR (
e

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

15 / 87

Deriving the OLS estimator


The OLS estimator has multiple justications. Here we derive it as a
least-squares estimator; later we will see it justied as a Method of
Moments (MM) estimator and as a Maximum Likelihood (ML) estimator.
Easiest way: use vectors/matrices. Note: we will use the assumption
(more on this below) that X0 X is full rank.
First multiply out:
e) = e
SSR (
0e

e ) 0 (y X
e)
= (y X
0
e X0 )(y X
e)
= (y 0
0
e X0 y y 0 X
e+
e 0 X0 X
e
= y0 y
0
e+
e X0 X
e
e 0 X0 y is a scalar)
= y0 y 2y0 X
(since

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

(1.2.2)

Autumn 2016

16 / 87

Deriving the OLS estimator


e ) = y0 y
SSR (

e+
e 0 X0 X
e
2y0 X

(1.2.2)

e ) with respect to .
e Since SSR (
e ) is a scalar
Next, dierentiate SSR (
e
and is a K 1 vector, the result is a K 1 vector vector of rst
derivatives (the "gradient"). Note that:
1

0
e (y y) = 0.
Since y0 y does not depend on ,
e

e)
e)
(a0
( 2y0 X
= a,
= 2X0 y. (note the transposition)
e
e

0
e A
e)
e0 0 e
(
e for A symmetric, ( X X ) = 2X0 X .
e
Since
= 2A
e
e

Since

Hence:

e)
SSR (
=
e

Mark Schaer (Heriot-Watt University)

e
2y0 X + 2X0 X

Lecture Notes 1

Autumn 2016

17 / 87

Deriving the OLS estimator as an LS estimator


e)
SSR (
=
e

e
2y0 X + 2X0 X

e
e ) are SSR ( ) = 0, so substitute
The rst-order conditions for minSSR (
e
e

0
0 e
to get 2X y + 2X X = 0 and rearrange:
X0 Xb = X0 y

(the K "normal equations")

(1.2.2)

e because it solves the FOCs. Since X0 X is full


where we substituted b for
rank by assumption and hence positive denite and nonsingular
1
(=invertible), we can solve (1.2.2) for b by premultiplying by (X0 X) :

(X0 X )

X0 Xb = (X0 X)

b = (X0 X )

X y and after cancelling terms we have b:

(1.2.5)

Xy

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

18 / 87

Deriving the OLS estimator


One more step is required: we have to check the second order condition to
conrm that the SSR is minimized. For a matrix problem, we check that
the Hessian (the matrix of second derivatives) is positive denite.
e)
SSR (
=
e

e
2y0 X + 2X0 X

e)
2 SSR (
= 2X0 X
0
e
e

(gradient, vector of rst derivatives)

(Hessian, matrix of second derivatives)

And since weve assumed X0 X is full rank and hence positive denite,
2X0 X is also positive denite and the second order conditions for a
minimization are satised. Done!

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

19 / 87

More useful OLS expressions


Uncentered and centred R 2 :
Ru2 = 1

e0 e
y0 y

Recall that is an n 1 column vector of ones, and y is the sample mean


of yi . Denote by e
y the "demeaned" or "centered" values of y:
e
y

3
y1
7
6
y y = 4 ... 5
yn

Then the centred R 2 is:


R2 = 1

e0 e
e
y0 e
y

Mark Schaer (Heriot-Watt University)

3 2
3
1
y1 y
6 7 6
7
..
y 4 ... 5 = 4
5
.
1
yn y

Lecture Notes 1

Autumn 2016

20 / 87

More useful OLS expressions

The sampling error of b is the dierence between the OLS estimate b and
the true :
b

=
=
=
=
=

(X0 X ) 1 X0 y
(X0 X) 1 X0 (X + )
(X0 X) 1 X0 X + (X0 X) 1 X0
+ (X0 X ) 1 X0
(X0 X ) 1 X0

NB: proofs of unbiasedness and consistency typically require this kind of


substitution and simplication.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

21 / 87

Some special matrices


The projection matrix P projects some (usually) data matrix or vector Z
onto the the linear subspace dened by X:
Px

X (X0 X )

1 X0

Px is n

n.

The annihilation matrix M gives the dierence between Z and the


projection of Z onto X:
Mx

X (X0 X )

1 X0

MX is n

n.

Both Px and Mx are symmetric and idempotent:


Px0 = (X(X0 X)

1 X0 ) 0 =

X (X0 X )

1 X0

= Px

Px0 Px = (X(X0 X) 1 X0 )0 (X(X0 X) 1 X0 ) = (X(X0 X) 1 X0 )(X(X0 X)


= X(X0 X) 1 X0 X(X0 X) 1 X0 = X(X0 X) 1 X0 = Px

1 X0 )

and similarly for Mx .


Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

22 / 87

Some special matrices

The projection and annihilation matrices make it easy to write some OLS
expressions:
b
y = Xb = X(X0 X)

1 X0 y

= Px y

So Px y gives us the tted values of y from an OLS regression of y on X.


Note that Px y is n 1
e

b
y=y

Xb = y

X (X0 X )

1 X0 y

= (I

X (X0 X )

X0 )y = Mx y

So Mx y gives us the residuals from an OLS regression of y on X. Note


that Mx y is also n 1 .

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

23 / 87

The Frisch-Waugh-Lovell (FWL) theorem

The Frisch-Waugh-Lovell theorem is extremely handy. It allows us to


simplify linear models and theorems by "partialling out" selected regressors
- call them X1 .

X1 may be uninteresting because they get in the way of the exposition, or


because 1 are nuisance parameters that cant be estimated consistently,
or because a particular K K matrix corresponding to the full set of Xs
cant be inverted, but the K2 K2 matrix corresponding to the interesting
X2 s can be inverted. (We have a particular matrix in mind here...)

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

24 / 87

The FWL theorem (continued)

Partition X so that X is split between regressors of interest X2 and


uninteresting regressors X1 . Partition conformably into 1 and 2 .
X

[ X1

X2 ]

1
2

y = X + = [X1

X2 ]

1
+
2

y = Xb + e = [X1

X2 ]

b1
+e
b2

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

b1
b2

Autumn 2016

25 / 87

The FWL theorem


Dene projection and annihilation matrices using the uninteresting
regressors X1 :
Px 1

X1 ( X1 X1 )

X10

Mx 1

Px 1

Regress y and X2 on X1 using OLS and collect the residuals:


e 2 = Mx 1 X2
X

e
y= Mx 1 y

e 2 residuals
Frisch-Waugh-Lovell Theorem: Regress the e
y residuals on the X
using OLS and you get the same b2 as when you do OLS using the full set
of Xs:
b = (X
e0 X
e

2 2)
2

1X
e0 e
2y

= b2

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

26 / 87

The FWL theorem


e 2 residuals
Frisch-Waugh-Lovell Theorem: Regress the e
y residuals on the X
using OLS and you get the same b2 as when you do OLS using the full set
of Xs:
b = (X
e0 X
e

2 2)
2

1X
e0 e
2y

= b2

Moreover, the residuals are also the same:


b
=e
y

b =e
e2
X
2

Question: what if we dened the uninteresting regressors X1 as simply the


constant, i.e., X1 = ?
Answer: e
y would be the same "centred" or "demeaned" e
y used in the
e 2 would be the regressors
denition of the centred R 2 used above. X
(excluding the constant) in mean-deviation form, i.e., each xik would
become (xik x k ).

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

27 / 87

Moments
A "moment" is a measure of the shape of a distribution (the term is
borrowed from physics).
If ai and ci are scalar random variables, then the following are population
moments:
E ( ai )
E (ai2 )
E (ai E (ai ))2
E ( a i ci )
E [(ai E (ai ))(ci

E (ci ))]

First moment of ai Same as mean of ai .


Second moment of ai .
Second central moment. Same as Var (ai ).
Cross moment of ai and ci .
Central cross moment. Same as Cov (ai , ci ).

These are population moments: they are characteristics of the population.


If E (ai ci ) = 0, we say ai and ci are orthogonal. We will make extensive
use of this term.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

28 / 87

Moments
The corresponding sample moments:
n

1
n

ai

First sample moment of ai Same as sample mean.

i =1
n
1
n

ai2

Second sample moment of ai .

i =1
n
1
n

( ai

a )2

Second central sample moment of ai .

i =1
n
1
n

a i ci

Sample cross moment of ai and ci .

i =1
n
1
n

( ai

a)(ci

c)

Sample central cross moment of ai and ci .

i =1

Note that we are dividing by n throughout. The second central sample


moment above diers from the sample variance because the latter divides
by n 1. Same for the sample central cross moment vs. covariance.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

29 / 87

Moments
The extension of moments to matrices is straightforward. We illustrate by
applying to our data variables. These sample and population moments
appear frequently enough to warrant shorthand symbols.
Population moments:
E (xi yi )

xy

E (xi xi0 )

xx

Sample moments:
n

1
n

xi yi

1 0
nX y

Sxy

1 0
nX X

Sxx

i =1
n

1
n

xi xi0

i =1

Hence two ways to write the OLSLecture


estimator:
Notes 1

Mark Schaer (Heriot-Watt University)

Autumn 2016

30 / 87

Denition of i.i.d.
Denition of i.i.d.: independently and identically distributed
Note: we are assuming that the regressors X are random. Sometimes
textbooks assume that X is constant or "xed in repeated samples". This
makes the exposition a bit easier but is highly unrealistic, so we follow
Hayashi and assume X is random from the beginning.
The sample (y, X) is an i.i.d. random sample (or just "random sample") if
fyi , xi g is independently and identically distributed across observations i.
Note: the "identical" in i.i.d. means that the joint distribution of fi , xi g
does not depend on i. We will use this fact later.
Note: Hayashi (p. 12) uses the term "random sample" for an i.i.d.
random sample.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

31 / 87

Plan for the rest of the lecture


Set out all the assumptions required by the classical regression model.
Derive carefully and discuss key nite-sample properties of the OLS
estimator b.
Show how to construct a statistic to test a hypothesis about a single
coe cient, H0 : k = k , when we know the true error variance 2 . (z
statistic, Normally distributed)
Show how to construct a statistic to test linear hypotheses in general,
H0 : R = r, when we know the true error variance 2 . (W statistic, 2
distributed)
These test statistics are "infeasible" - they depend on the unknown
"nuisance parameter" 2 .
e0 e
, we obtain feasible
n K
test statistics. (t statistic/distribution, F statistic/distribution)
But if we replace 2 by its OLS estimator s 2 =

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

32 / 87

Assumptions of the Classical Regression Model


Numbering of assumptions and equations follows Hayashi.
Assumption 1.1: Linearity
y = X +

(1.1.1)

Assumption 1.2: Strict exogeneity


E ( i jX) = 0

i = 1, 2, ...n

(1.1.7)

Assumption 1.3: No multicollinearity


The rank of the n

K data matrix X is K with probability 1.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

33 / 87

Assumptions of the Classical Regression Model

Assumption 1.4: Spherical error variance


This assumption has two parts:
(a) Conditional homoskedasticity
E (2i jX) = 2

i = 1, 2, ...n

(1.1.12)

(b) Independence
E ( i j jX) = 0

8 i, j, i 6= j

Mark Schaer (Heriot-Watt University)

(1.1.13)

Lecture Notes 1

Autumn 2016

34 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.1: Linearity
y = X +

(1.1.1)

"Linear" here means linear in parameters.


It is easy to accommodate nonlinearity in variables. If, say, we want to use
log(income) as a regressor, just dene a new variable to be log(income)
and include it as one of the Xs. Same applies to functions of more than
one variable: if we want the interaction of log(income) and age, dene a
new variable to be age*log(income) and include it as one of the Xs.
NB: when dening multiplicative interactions, it is usually a good idea to
centre the separate variables around their means before interacting, i.e.,
dene the age-income-interaction variable to be
(age age ) (log (income ) log (income )).
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

35 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.1: Linearity (data row form)
yi = xi0 + i
An important feature of this assumption that is not discussed in Hayashi is
parameter homogeneity: the parameter vector is xed and does not
vary across observations.
A major area of research in econometrics, especially microeconometrics,
involves loosening this assumption.
Terminology involved here: "treatment eects", "casual eects",
"parameter heterogeneity", "heterogeneous causal eects", "random
coe cients", etc.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

36 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.1: Linearity (data row form)
yi = xi0 + i
For example, consider the model where the eect of varies across
observations:
yi = xi0 i + i
Obviously we cant obtain estimates of all the n dierent values of i .
But we can obtain an estimate of the average treatment eect (ATE)
E ( i ). In fact, OLS yields a consistent estimate of the ATE. See
Stock & Watson, pp. 540-2 (3rd edition).
But in this course we work almost entirely with the basic
no-heterogeneous-eects setup.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

37 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.2: Strict exogeneity
E ( i jX) = 0

i = 1, 2, ...n

This is also known as the zero conditional mean assumption, but strict
exogeneity is perhaps the best term because it emphasizes the key point
(conditioning on the entire X).
It is actually two assumptions:
E ( i jX) =

i = 1, 2, ...n

and
=0
But not restrictive if the model has a constant term.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

38 / 87

Assumptions of the Classical Regression Model: Remarks


Say
E ( i jX) =
xi 1 = 1

8i

i = 1, 2, ...n and 6= 0
(the rst column of the data matrix X is all ones)

yi = 1 + 2 xi 2 + . . . + K xiK + i
Add and subtract :
yi = ( 1 + ) + 2 xi 2 + . . . + K xiK + (i

And the new error term now has a zero conditional mean.
Much more important....
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

39 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.2: Strict exogeneity
E ( i jX) = 0

i = 1, 2, ...n

The key point is that X is all xj , for j = 1, 2, ...n. More obvious if we


write it like this:
E (i jx1 , x2 , ...xn ) = 0

i = 1, 2, ...n

Implications of strict exogeneity (1):


Unconditional mean of the error is zero:
E ( i ) = 0

i = 1, 2, ...n

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

40 / 87

Assumptions of the Classical Regression Model: Remarks


Implications of strict exogeneity (2):
Regressors are orthogonal to all errors (note that 0 is a vector of K zeros,
and the subscript on xj is a j):
E (xj i ) = 0K

8 i, j

Intuition: with strict exogeneity we are assuming exogeneity of xi with


respect to all the errors 1 , 2 , ..., n .
Compare this to the denition of weak exogeneity (Hayashi uses the
term "predetermined") we will use when developing large-sample theory:
E (xi i ) = 0K

i = 1, 2, ...n

Intuition: with weak exogeneity we are assuming exogeneity of xi with


respect to just the contemporaneous error i . A much weaker
(=believable in a wider range of settings) assumption.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

41 / 87

Assumptions of the Classical Regression Model: Remarks


Strict exogeneity in time series models is rarely satised.
See Hayashi pp. 9-10. Simple example: the rst-order autoregressive or
AR(1) model. The only regressor is the lagged dependent variable:
yi = yi

+ i

(1.1.11)

Say we assume just weak exogeneity, E (yi


E (yi i ) = E [( yi

+ i )i ] = E (yi

1 i )

= 0. Then:

2
1 i ) + E ( i )

= E (2i )

So unless the error term is always zero, E (yi i ) 6= 0. But that means
weve violated strict exogeneity because yi is the regressor for observation
i + 1. We can assume weak exogeneity but not strict exogeneity.
Fortunately, weak exogeneity is enough for the OLS estimator to have good
large-sample properties even in a time series setting. More on this later.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

42 / 87

NB: Conditional vs. Unconditional Moment Conditions


The weak exogeneity assumption we will use in the large-sample setting is
E (xi i ) = 0K

i = 1, 2, ...n

This is an orthogonality condition: we say xi is orthogonal to i . It is a


kind of moment condition or moment restriction. Specically, we are
saying that the unconditional moment is zero.
Sometimes a stronger assumption is made:
E (i jxi ) = 0

i = 1, 2, ...n

This is an example of a conditional moment restriction (there are


other possibilities for conditioning, e.g., xj 8 j i for time series).
This particular restriction is similar to weak exogeneity in that both cases
xi is exogenous with respect to just the contemporaneous error i . In
what way is it stronger?
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

43 / 87

NB: Conditional vs. Unconditional Moment Conditions

E (xi i ) = 0K
E (i jxi ) = 0

i = 1, 2, ...n
i = 1, 2, ...n

Weak exogeneity
Conditional moment restriction

In both cases xi is exogenous with respect to just the contemporaneous


error i . But the conditional moment restriction is stronger because it
implies E (f (xi )i ) = 0, i.e., not only is xi is orthogonal to i (as in weak
exogeneity), but any function f of xi is orthogonal to i .
This can be implied by some models, e.g., Rational Expectations models:
xi is in the information set of agents, and i is some random shock or
surprise they should not be able to predict .

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

44 / 87

NB: Conditional vs. Unconditional Moment Conditions


E (i jxi ) = 0

i = 1, 2, ...n

Conditional moment restriction

The conditional moment restriction implies E (f (xi )i ) = 0, i.e., not only


is xi is orthogonal to i (as in weak exogeneity), but any function f of xi
is orthogonal to i . We can show this:
E (f (xi )i )

= E [E (f (xi )i j xi )]
= E [f (xi )E (i jxi )]
= E [f (xi ) 0]
=0

(a)
(b)
(c)
(d)

Notes:
(a) By the Law of Total Expectations: E [E (AjB )] = E (A).
(b) By the linearity of conditional expectations: since we are conditioning
on xi , we can treat any function of xi as nonrandom and move it out of
the inner E ().
(c) By our conditional moment restriction E (i jxi ) = 0.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

45 / 87

NB: Conditional vs. Unconditional Moment Conditions


Summary:
E ( i jX) = 0
E (xi i ) = 0K

i = 1, 2, ...n
1

i = 1, 2, ...n

Strict exogeneity
Weak exogeneity ("predetermined")

E (i jxi ) = 0

i = 1, 2, ...n

E (i jxj ) = 0

8 i; 8 j

Conditional moment restriction


Conditional moment restriction

And the last 2 are examples of dierent conditional moment restrictions.


(Q: How are they dierent? Can you relate this to any economic models?)
Back to...

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

46 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.2: Strict exogeneity
E (i jx1 , x2 , ...xn ) = 0

i = 1, 2, ...n

(1.1.7)

If we combine this with


Optional Extra Hayashi Assumption: I.I.D. random sample

fyi , xi g is independently and identically distributed across observations i


then the strict exogeneity assumption simplies:
E (i jxi ) = 0

i = 1, 2, ...n

(1.1.16)

Dont confuse this result with weak exogeneity! (1.1.16) follows from the
i.i.d. assumption for fyi , xi g, which is often very strong. For example, if
we are working with time series, the independent variables xi will usually
be serially correlated - GDP last year is correlated with GDP this year. If
the data are not i.i.d., then we have to include all the x1 , x2 , ...xn in the
denition of strict exogeneity.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

47 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.3: No multicollinearity
The rank of the n

K data matrix X is K with probability 1.

Another way to state this:


The n

K data matrix X is full rank with probability 1

The rank of a matrix is the number of linearly independent columns. A


matrix is full rank if the rank equals the number of columns. A matrix is
rank decient or less than full rank if you can express one or more columns
as linear combinations of the other.
(We could say "column rank" to distinguish the concept from "row rank",
but since (a) n > K , (b) a basic matrix theorem says that the column rank
always equals the row rank, and so (c) the row rank is also at most K , we
will just say "rank".)
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

48 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.3: No multicollinearity
The rank of the n

K data matrix X is K with probability 1.

Classic example: the dummy variable trap. Say we have a dataset of male
and female individuals. Dene the male dummy m as the n 1 column
vector of dummies for individuals i = 1, ..., n. Dene the female dummy f
similarly. Say we also have a constant term, i.e., a variable which is just
an n 1 column vector of ones:
2 3
1
6 .. 7
n 1 4 . 5
(another notation for this column vector is 1)
1
There are no other regressors. Then the matrix of regressors is
X = [ m f ]. But its easy to see that = m + f . Thus X is not full
rank and the assumption fails.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

49 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.3: No multicollinearity
The rank of the n

K data matrix X is K with probability 1.

Why "with probability 1"? X is stochastic. Bad luck might give us a


sample of n observations such that X this particular data matrix is not full
rank.
In the dummy variable example, we might be using a male dummy as a
regressor along with a constant and some other regressors, but happen to
draw a random sample of all men. If so, X = [ m ...] will not be full
rank (the rst two columns are identical).
We say this is so unlikely that it happens with probability zero.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

50 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.4: Spherical error variance
(Recall, this has two parts: conditional homoskedasticity and
independence.)
(a) Conditional homoskedasticity
E (2i jX) = 2

i = 1, 2, ...n

Hayashi and others call this "homoskedasticity", but as Hayashi makes


clear, the "conditional" aspect is very important.
Note that since Assumption 1.2 says that the conditional mean of i is
zero, this assumption is the same thing as:
(a) Conditional homoskedasticity
Var (i jX) = 2

i = 1, 2, ...n

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

51 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity
E (2i jX) = 2

i = 1, 2, ...n

If we in addition assume
Optional Extra Hayashi Assumption: I.I.D. random sample

fyi , xi g is independently and identically distributed across observations i


then the conditional homoskedasticity assumption simplies to
E (2i jxi ) = 2

i = 1, 2, ...n

Mark Schaer (Heriot-Watt University)

(1.1.17)

Lecture Notes 1

Autumn 2016

52 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity
E (2i jX) = 2

i = 1, 2, ...n

Remark: just the assumption of an i.i.d. sample is not enough to


imply conditional homoskedasticity.
Note that the "identical" in i.i.d. implies that the joint distribution of
fi , xi g doesnt depend on i. Thus:
i.i.d. sample implies unconditional homoskedasticity: E (2i ) is constant
across i.
i.i.d. sample implies the functional form of E (2i jxi ) is constant across i.
But i.i.d. sample does not imply that the value of E (2i jxi ) is constant
across i, because the value of xi varies across i. Thus even if we assume
an i.i.d. random sample, we still need Assumption 1.4(a): Conditional
homoskedasticity as an additional assumption.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

53 / 87

Assumptions of the Classical Regression Model: Remarks


Assumption 1.4: Spherical error variance
(b) Independence
E ( i j jX) = 0

8 i, j, i 6= j

(1.1.13)

Combined with Assumption 1.2 (strict exogeneity), this becomes

Cov (i , j jX) = 0

8 i, j, i 6= j

Hayashi calls this assumption "no correlation between observations".


For time series data, this means the error term is not serially correlated.
This is often a strong assumption. It is also often a strong assumption in
other contexts. If we were working with spatial data - individuals or
whatever distributed over space - it means the errors of near neighbours
are not correlated with each other. If we had a dataset of students, it
means that the errors of students in the same school are not correlated.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

54 / 87

Assumptions of the Classical Regression Model


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j

i = 1, 2, ...n

Parts (a) and (b) can be combined into single expression involving
E (0 jX). Recall that the 0 matrix is n n with 2i s running down the
diagonal and i j on the o-diagonals:
2

21
6 ..
6 .
6
6 i 1
6
6 .
0 = 6 ..
6
6 j 1
6
6 ..
4 .
n 1

. . . 1 i
..
..
.
.
. . . 2i
..
.
...

j i
..
.

. . . n i

Mark Schaer (Heriot-Watt University)

. . . 1 j
..
.
...
..
.
...

i j
..
.
2j
..
.

. . . n j

3
. . . 1 n
.. 7
. 7
7
. . . i n 7
7
.. 7
. 7
7
. . . j n 7
7
.. 7
..
.
. 5
. . . 2n

Lecture Notes 1

Autumn 2016

55 / 87

Assumptions of the Classical Regression Model


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j

i = 1, 2, ...n

But (a) means the diagonal of E (0 jX) is 2 s, and (b) means all the
o-diagonals of E (0 jX) are zero. Thus the matrix simplies hugely:
2

21
6 ..
6 .
6
6 i 1
6
6 .
E (0 jX) = E( 6 ..
6
6 j 1
6
6 ..
4 .
n 1
Mark Schaer (Heriot-Watt University)

. . . 1 i
..
..
.
.
. . . 2i
..
.
...

j i
..
.

. . . n i

. . . 1 j
..
.
...
..
.
...

i j
..
.
2j
..
.

. . . n j
Lecture Notes 1

3
. . . 1 n
.. 7
. 7
7
. . . i n 7
7
.. 7
jX) = ...
. 7
7
7
. . . j n 7
.. 7
..
.
. 5
. . . 2n

Autumn 2016

56 / 87

Assumptions of the Classical Regression Model


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j
...

E(21 jX)
6
..
6
.
6
6 E( i 1 jX)
6
6
..
E (0 jX) = 6
.
6
6 E( j 1 jX)
6
6
..
4
.
E( n 1 jX)

i = 1, 2, ...n

. . . E( 1 i jX) . . . E( 1 j jX)
..
..
..
.
.
.
. . . E(2i jX) . . . E(i j jX)
..
..
..
.
.
.
. . . E(j i jX) . . . E(2j jX)
..
..
.
.
. . . E( n i jX) . . . E( n j jX)

3
. . . E( 1 n jX)
7
..
7
.
7
. . . E( i n jX) 7
7
7
..
7
.
7
. . . E( j n jX) 7
7
7
..
..
5
.
.
. . . E(2n jX)

And since the diagonals are all 2 , and the o-diagonals are all 0s...
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

57 / 87

Assumptions of the Classical Regression Model


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j
2
3
2 0
0 ... ... ... 0
6 0 2 0 . . . . . . . . . 0 7
6
7
6
7
.. ..
2
60
7
.
.
0

.
.
.
0
6
7
6
.. . . . . . . . .
..
.. 7
0
6
.
.
.
.
.
.7
E ( jX) = 6 .
7
6 ..
7
.. . . . .
6.
7
.
. 2 0
.
0
6
7
..
..
6 ..
7
4.
.
. . . . 0 2 0 5
0
0
0 ... 0
0 2

i = 1, 2, ...n

Hence

E (0 jX) = 2 In

(1.1.14)

...which is very concise.

(Hence the term "spherical" - its proportional to the identity matrix.)


Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

58 / 87

Assumptions of the Classical Regression Model


Assumption 1.4: Spherical error variance
(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j

i = 1, 2, ...n

Assumption 1.4 can now be stated concisely and equivalently in terms of


conditional second moments and cross-moments:
E (0 jX) = 2 In

(1.1.14)

Or, making use of Assumption 1.2 Strict exogeneity, and using the
notation Var (jX) for the entire variance-covariance matrix of , we
can write Assumption 1.4 in terms of conditional variances and covariances

Var (jX) = 2 In

In traditional nite-sample presentations of OLS, the latter is usually used.


The rst version (eq. 1.1.14) is more convenient for developing
large-sample theory.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

59 / 87

Finite-sample properties of OLS


Proposition 1.1: Finite-sample distribution of OLS estimator b
(H p. 27)
(a) Unbiasedness: E (bjX) = . Requires Assumptions 1.1-1.3.
(b) Variance: Var (bjX) = 2 (X0 X)

. Requires Assumptions 1.1-1.4.

(c) E ciency (Gauss-Markov, BLUE): b is e cient in the class of linear


unbiased estimators. Requires Assumptions 1.1-1.4.
(d) OLS estimator and residuals are uncorrelated: Cov (b, e) = 0.
Requires Assumptions 1.1-1.4.
We now show (a) and (b) in detail and sketch (c).

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

60 / 87

Finite-sample properties of OLS

Proposition 1.1: Finite-sample distribution of OLS estimator b


(a) Unbiasedness: E (bjX) = . Requires Assumptions 1.1-1.3.
The proof is short but instructive. It will be useful to compare it to the
proof of consistency in the large-sample setting later on.
It is common for proofs of unbiasedness to work with [E (bjX) ]
instead of E (bjX). (We will do the same in our proofs of consistency.)

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

61 / 87

Finite-sample properties of OLS


Proof of (1.1.a) Unbiasedness: E (bjX) =
E (bjX)

= E (b jX)
= E ((X0 X) 1 X0 jX)
= (X0 X ) 1 X0 E ( jX )
=0

(a)
(b)
(c)
(d)

Notes:
(a) Since is a constant.
(b) From denition of the sampling error of b.
(c) Since we are conditioning on X, we can treat any function of X as
nonrandom and move it out of the E ().
(d) By Assumption 1.3 Strict exogeneity. This is key. For models where
1.3 fails (such as most time-series models), OLS is biased.
Note we used Assumptions 1.1-1.3.but we did not use Assumption 1.4
(spherical errors: conditional homoskedasticity and independence).
Unbiasedness of OLS is robust to violations of this assumption.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

62 / 87

Finite-sample properties of OLS


Proof of (1.1.b) Variance: Var (bjX) = 2 (X0 X)
Requires Assumptions 1.1-1.4.
Var (bjX)

= Var (b jX)
= Var ((X0 X) 1 X0 jX)
= [(X0 X) 1 X0 Var (jX) X(X0 X)
= [(X0 X) 1 X0 2 In X(X0 X) 1
= 2 (X0 X ) 1

(a)
(b)
(c)
(d)
(e)

Notes:
(a) Since is a constant.
(b) From denition of the sampling error of b.
(c) Since we are conditioning on X, we can treat any function of X as
nonrandom and move it out of the Var ().
(d) By Assumption 1.4.a and 1.4.b (when we combined (a) and (b) into a
single statement).
(e) (X0 X) 1 and (X0 X) cancel (after moving the scalar 2 out of the way).
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

63 / 87

Finite-sample properties of OLS


Alternate proof of (1.1.b) Variance: Var (bjX) = 2 (X0 X)
Makes use of the fact that Var (yjX) =Var (jX).
Requires Assumptions 1.1-1.4.
Var (bjX)

= Var ((X0 X) 1 X0 yjX)


= [(X0 X) 1 X0 Var (yjX) X(X0 X)
= [(X0 X) 1 X0 Var (jX) X(X0 X)
= [(X0 X) 1 X0 2 In X(X0 X) 1
= 2 (X0 X ) 1

1
1

(a)
(b)
(c)
(d)
(e)

Notes:
(a) Substitution.
(b) Since we are conditioning on X , we can treat any function of X as
nonrandom and move it out of the Var ().
(c) From Var (yjX) =Var (jX).
(d) By Assumption 1.4.a and 1.4.b.
(e) (X0 X) 1 and (X0 X) cancel (after moving the scalar 2 out of the way).
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

64 / 87

Finite-sample properties of OLS


Proof (sketch) of (1.1.c) E ciency: (Gauss-Markov, BLUE): b is
e cient in the class of linear unbiased estimators.
Requires Assumptions 1.1-1.4.
(See Hayashi pp. 29-30 or some other text for a full proof.)
b that is linear in
Stated more carefully: For any other unbiased estimator
b
y, Var ( jX) Var (bjX).

Intuitively, "e ciency" means "precision", i.e., a small variance. If b is


"e cient", it means that no other estimator has a smaller variance. But
note that b is a vector. We therefore have to use the matrix denition of
. If A and B are square matrices, then:
A
(A K

()

(A

B) is positive semidenite.

K matrix C is positive semidenite if x0 Cx

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

0 for all vectors x.)


Autumn 2016

65 / 87

Finite-sample properties of OLS


Proof (sketch) of (1.1.c) E ciency
b jX) Var (bjX) = C is a positive semidenite
We need to show that Var (
b that is linear in y.
matrix for any other estimator

b that has b in it. Since


b
The way to start is to obtain an expression for
is linear in y, without loss of generality we pick some matrix D(X) which
b as
is some function of X so that we can write
b = (D + (X0 X )

1 X0 )y

= Dy + b = DX + D + b

b implies
Taking conditional expectations of this plus the unbiasedness of
b = D + b and therefore
that DX = 0 (see Hayashi). So
b

= D + (b

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

66 / 87

Finite-sample properties of OLS


Proof (sketch) of (1.1.c) E ciency
b

= D + (b

= D + (b

The second term on the right weve seen before - its the sampling error of
b. Substitute and we get
) = D+(X0 X)

1 X0

b
We are now set up to go, because Var (

= (D+(X0 X)

1 X0 )

b jX). So
jX) =Var (

b jX) =Var ((D+(X0 X) 1 X0 )jX) = ...(eventually)...


Var (
= 2 DD0 + (X0 X) 1 = 2 DD0 + Var (bjX)
And were done, because DD0 is positive semidenite and
b jX)
Var (

Var (bjX) = 2 DD0

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

67 / 87

Finite-sample properties of OLS


Proposition 1.2: Unbiasedness of s2
E (s 2 jX) = 2 . Requires Assumptions 1.1-1.4.
Proof: See Hayashi pp. 30-31.
Reminder: s 2 =
calculate it.

e0 e
is feasible - we have everything we need to
n K

Estimate of Var(bjX)
The variance of the OLS estimator Var (bjX) = 2 (X0 X)
it depends on the unknown "nuisance parameter" 2 .

is infeasible -

So in practice we use an estimate of the variance of the OLS estimator


Var (bjX). The obvious one to use is

\
Var
(bjX)= s2 (X0 X)

(1.3.4)

which is feasible - we have everything we need to calculate it.


Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

68 / 87

Finite-sample-based inference using OLS


Are we ready to go with our estimations and formulate and test
hypotheses on b? Unfortunately not.
To test a hypothesis involving b, we need to form a test statistic from b
whose distribution under the null is known. We could possibly go down
the route of large-sample asymptotic results, but then why not just
develop the large-sample asymptotic framework from scratch? (This is
what we will do shortly.)
What do we need to develop nite-sample (or "exact") results for the
distribution of the OLS estimator b?
We have an expression for the sampling error of b,
b

= (X0 X )

1 X0

Since the sampling error is a function of (X,), we could specify the joint
distribution of (X,) and work with that, but that is unattractive - how do
we know what the true distribution is?
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

69 / 87

Finite-sample-based inference using OLS

Assumption 1.5: Normality of


The distribution of conditional on X is normal.
This is a strong assumption in econometrics, though it might sometimes
be true or approximately true. But it simplies the nite-sample inference
problem hugely. Now we can derive the nite-sample distribution of
(b ) and use this to construct hypothesis tests.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

70 / 87

Finite-sample-based inference using OLS


Combine Assumptions 1.2, 1.4 and 1.5:
Assumption 1.2: Strict exogeneity: E (i jX) = 0
i = 1, 2, ...n
Assumption 1.4: Spherical error variance: Var (jX) = 2 In
Assumption 1.5: Normality of : The distribution of conditional on X
is normal.
Then jX s N (0, 2 In )

(1.4.1)

(Just plug the conditional mean and variance into the denition of a
Normal random variable - they dene a Normal distribution.)
This means the distribution of conditional on X doesnt depend on the
latter; and X are independent. Thus the marginal or unconditional
distribution of is simply s N (0, 2 In ).
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

71 / 87

Finite-sample-based inference using OLS

The sampling error of b is:

(b

) = (X0 X )

1 X0

which is linear in given X. Since is normal given X, b is also


normal given X. We know the conditional mean and variance of (b )
from Proposition 1.1.a and 1.1.b, so we just plug them in:

(b

) jX s N (0, 2 (X0 X)

1)

(1.4.2)

And we are almost ready to go.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

72 / 87

Testing a single regression coe cient


The nite-sample distribution of the OLS estimator b:

(b

) jX s N (0, 2 (X0 X)

1)

Say we want to test a hypothesis about the kth coe cient k :


H0 : k = k
where k is some specic hypothesized value. The distribution of b
implies that under the null (i.e., k is the "true" k ),
bk

k jX s N (0, 2 (X0 X)

where (X0 X)

1
kk

1
kk

is a scalar, the (k, k ) element of (X0 X)

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

1.

Autumn 2016

73 / 87

Testing a single regression coe cient


If the null H0 : k = k is true, bk
Divide both sides by
our hypothesis:
zk jX s N (0, 1)

2 [(X0 X)

where zk

1]

k jX s N (0, 2 (X0 X)
kk

1
kk

).

and we obtain a test statistic for

bk

2 [(X0 X)

1]

(1.4.3)
kk

This "z statistic" has a Normal distribution. To use it, we would go


through the usual steps: (1) choose k and a signicance level (5% is
common); (2) look up the critical values for the Normal distribution and
signicance level (for = 0.05 these are 1.96 and 1.96); (3) estimate
by OLS and calculate the test statistic zk ; (4) compare zk to the critical
values; (5) if zk is in the tails (e.g. less than 1.96 or bigger than 1.96),
reject the null hypothesis (extreme values for zk suggest H0 is unlikely to
be true).
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

74 / 87

Testing a single regression coe cient

zk jX s N (0, 1)

where zk

bk

2 [(X0 X)

1]

(1.4.3)
kk

The only problem is ... the test statistic zk is infeasible because it depends
on the unknown nuisance parameter 2 .
Exactly the same issue arises if we want to construct tests of linear
hypotheses.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

75 / 87

Tests of linear hypotheses


Say we want to test hypotheses involving linear combinations of the
elements of .
For example, say we estimate a log-linear Cobb-Douglas production
function with inputs ln(Ki ) and ln(Li ) and yi is log output:
yi = 1 + 2 ln(Ki ) + 3 ln(Li ) + i
Weve seen how to test just the capital elasticity 2 or just the labour
elasticity 3 .
But we might want to test whether the two elasticities are jointly zero:
H0 : 2 = 0

and

3 = 0

Or we might want to test whether we have constant returns to scale:


H0 : 2 + 3 = 1
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

76 / 87

Tests of linear hypotheses

To test linear hypotheses we write them as a system of linear equations:


H0 : R = r

(1.4.8)

r is a column vector with dimension #r (number of eqs).


R is #r

K , where K is the number of elements of .

Require rank (R) = #r, i.e., R is full row rank. Means no redundant
equations and no inconsistent equations (see Hayashi p. 40).
We can write any set of linear hypotheses this way.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

77 / 87

Tests of linear hypotheses


To test linear hypotheses we write them as a system of linear equations:
H0 : R = r

(1.4.8)

In the Cobb-Douglas example:


H0 : 2 = 0
R=

3 = 0

and

2 3
0
4
r = 05
0

2 3
0
r = 405
1

3
1
= 4 2 5
3

0 1 0
0 0 1

H0 : 2 + 3 = 1

(CRS)

R= 0 1 1
Mark Schaer (Heriot-Watt University)

3
1
= 4 2 5
3

Lecture Notes 1

Autumn 2016

78 / 87

Tests of linear hypotheses


If the null H0 : R = r is true, then
W jX s 2 (#r)

where W

(Rb

h
r ) 0 2 R(X0 X )

R0

(Rb

r)

Proof (short version; see Hayashi p. 41): If the null is true, R = r.


Subtract Rb from both sides and rearrange to get (Rb r) = R (b ).
We know (b ) jX s N (0, 2 (X0 X) 1 ).
Hence conditional on X, (Rb
Var (Rb
Hence W

r) is normal with mean 0 and variance


j X ) R0

rjX) = Var (R (b ) jX) = RVar (b


0
= 2 R(X0 X ) 1 R

(Rb

r)0 [Var (Rb

rjX)]

(Rb

r)

Fact: if the m-dimensional vector z s N (, ) and is nonsingular, then


(z )0 1 (z ) s 2 (#m).
2 R (X0 X )

1 R0

is nonsingular. Hence W jX s 2 (#r). Done!

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

79 / 87

Tests of linear hypotheses


If the null H0 : R = r is true, then
W jX s 2 (#r)

where W

(Rb

h
r ) 0 2 R(X0 X )

R0

(Rb

r)

This is just a generalization of the z statistic for testing the simple


hypothesis H0 : k = k . (Easy review question: what are R and r in this
case?) We would use W in the same way: (1) choose R, r and a
signicance level (5% is common); (2) look up the critical values for the
2 (#r) distribution and signicance level ; (3) estimate by OLS and
calculate the test statistic W ; (4) compare W to the critical value; (5) if
W is in the tail, reject the null hypothesis (a large value for W suggest H0
is unlikely to be true).
And we have the same problem: the test statistic W is infeasible because
it depends on the unknown nuisance parameter 2 .
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

80 / 87

Finite-sample-based inference using OLS


Under H0 : k = k , zk jX s N (0, 1)
Under H0 : R = r, W jX s 2 (#r)
W

(Rb

where zk

where

r ) 2 R (X0 X )

R0

bk

2 [(X0 X)

1]

r)

(Rb

.
kk

zk and W are infeasible test statistics because they depend on the


unknown nuisance parameter 2 .
What if instead of the unknown 2 we use the OLS estimator s 2 =

e0 e
?
n K

Bad news: zk and W no longer exactly distributed as Normal and 2 .


Good news: we have substitute test statistics for zk and W where we do
know their nite-sample distributions.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

81 / 87

Testing a single regression coe cient

Under H0 : k = k , zk jX s N (0, 1)

where zk

bk

2 [(X0 X)

1]

.
kk

e0 e
, we obtain a
n K
dierent, feasible test statistic with a known distribution:

If we replace 2 with the OLS estimator s 2 =

Under H0 : k = k , tk jX s t (n

k)

where tk

bk

s2

k
0
[(X X) 1 ]

kk

And now we are ready to go. No nuisance parameter problem.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

82 / 87

Tests of linear hypotheses


Under H0 : R = r, W jX s 2 (#r)
W

(Rb

where

r ) 2 R (X0 X )

R0

(Rb

r)

e0 e
, we can construct a
n K
dierent, feasible test statistic with a known distribution:
If we replace 2 with the OLS estimator s 2 =

Under H0 : R = r, F jX s F (#r, n

(Rb

K)
where
h
i
1
r ) s 2 R (X0 X ) R0
0

(Rb

r)

#r

Note that we have to divide by the "numerator degrees of freedom" #r.


(Move s 2 into the denominator of F and the denominator is divided by
n K , hence we call it the "denominator degrees of freedom".)
And now we are ready to go. No nuisance parameter problem.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

83 / 87

Testing using the Wald Principle vs. LR Principle


t and F (and z and W ) are Wald test statistics.
Wald Principle: estimate the unrestricted equation, i.e. do not impose
the constraints in the null H0 . Then calculate "cost" of imposing the
constraints.
LR (Likelihood Ratio) Principle: estimate the unrestricted equation and
the restricted equation, and construct a test statistic based on the values
of the two objective functions.
Unrestricted: b
Restricted: b

e)
arg minSSR (
e

e ) s.t. R
e=r
arg minSSR (
e

Values of the two objective functions at their minima: SSRU and SSRR .
NB: LM (Lagrange Multiplier) Principle: estimate the restricted equation.
Then calculate "reduction in cost" from relaxing the constraints in H0 .
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

84 / 87

Testing using the Wald Principle vs. LR Principle


It turns out that in this particular case the test statistic using the LR
principle is exactly the same as the Wald F test statistic if we use a
common estimate of the error variance.
F =

(SSRR SSRU ) /#r


SSRU /(n K )

Numerator: dierence in minimized objective functions. Denominator:


estimate of the error variance (s 2 from the unrestricted estimator).
Classic example: Chow test for a "structural break", i.e. do two
regressions t better than one? Restricted: SSRR from tting a single
OLS regression on the entire sample. Unrestricted: SSRU = SSR1 + SSR2
from tting two separate regressions to the two parts of the sample. Error
variance: SSRU /(n 2K ) since we have a total of 2K parameters in the
two separate regressions. Large F means reject null, conclude there are
two separate regimes i.e. there is a structural break. See Hayashi p. 175.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

85 / 87

Finite-Sample Theory for OLS: Summary and Remarks


To develop the exact nite-sample distribution of the OLS estimator, we
needed to make the following assumptions:
Assumption
Assumption
Assumption
Assumption
Assumption

1.1:
1.2:
1.3:
1.4:
1.5:

Linearity y = X +
Strict exogeneity E (i jX) = 0
i = 1, 2, ...n
No multicollinearity X is full rank with probability 1.
Spherical error variance Var (jX) = 2 In
Normality of

All of these assumptions (with the possible exceptions of 1.3 and 1.1) are
unattractive. We dont want our estimates and inferences to depend
heavily on assumptions that we dont believe and that are likely to be
violated in reality.
Loosening these assumptions and still obtaining nite-sample results is
often di cult or impossible. It is much easier to relax these assumptions in
a large-sample setting and rely on asymptotic results.
Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

86 / 87

Finite-Sample Theory for OLS: Summary and Remarks

The main exception to this is Assumption 1.4:


Assumption 1.4: Spherical error variance E (0 jX) = 2 In
(Reminder: this has two parts, conditional homoskedasticity and
independence.)
It is possible to relax this assumption and still obtain nite-sample results.
The results and methods are, moreover, generally useful, including in the
large-sample setting. This is the method of Generalized Least Squares
(GLS), to which we turn in a later lecture.

Mark Schaer (Heriot-Watt University)

Lecture Notes 1

Autumn 2016

87 / 87

Potrebbero piacerti anche