SGPE Econometrics Lecture 1 OLS

SGPE Econometrics 1
Lecture Notes 1: Finite Sample OLS
Mark Schaer
Heriot-Watt University
Autumn 2016
Mark Schaer (Heriot-Watt University)
Lecture Notes 1
Autumn 2016
1 / 87
Lecture Outline
1
Motivation
Notation, denitions, tools
Assumptions of the classical regression model and OLS
Finite-sample properties of OLS
Main reading: Hayashi Chapter 1

Greenes Econometric Analysis has a good matrix algebra appendix.
A thorough summary of many matrix algebra facts and results is The

Matrix Cookbook by Petersen and Pedersen, available online at
www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pd
or www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
Lecture Notes 1
Autumn 2016
2 / 87
Finite Sample OLS
Key concepts
1
Finite-sample vs. Large-sample
Assumptions of the classical linear regression model
Deriving the OLS estimator
Finite-sample properties of the OLS estimator
Unbiased estimator
E cient estimator
Mean squared error
Frisch-Waugh-Lovell Theorem
Lecture Notes 1
Autumn 2016
3 / 87
Finite Sample OLS

Key terms
1
Moment
Strict exogeneity
I.I.D. (independently and identically distributed)
Spherical error variance
Conditional homoskedasticity; conditional heteroskedasticity
Independence ("no serial correlation")
Unbiased estimator
E cient estimator
Mean squared error
Lecture Notes 1
Autumn 2016
4 / 87
Motivation
We want to estimate a parameter and perform inference using it. In
econometrics, we have two standard settings for characterising estimators
and developing results: "nite sample" and "large sample".
Finite sample setting:
In repeated samples of size n, what is E (b
)? Is E (b
) = , i.e., is our
estimator unbiased?
Our estimator b
is a statistic - it is a function of the data. In repeated
samples of size n, what is the distribution of b
and Var (b
) in particular?
Our expression for is Var (b

) may be infeasible because it depends on
\
parameters we dont know. How do we obtain an estimate Var
(b
)?
In practice, econometric theory often doesnt provide exact answers to

these questions. This is where Monte Carlo methods are very helpful: we
choose a specic sample size n and values for all the relevant parameters and the computer answers the questions.
Lecture Notes 1
Autumn 2016
5 / 87
Motivation
Large sample setting:
It turns out to be a lot easier to develop theoretical results for b
in the
large sample setting, i.e., as n ! . We also say this is the asymptotic
setting, and we say our results are asymptotic or asymptotically valid.
What is the probability limit of b
as as n ! ? Does b
converge to ,
i.e., is p limn ! (b
) = ? In other words, is our estimator consistent?
(Loosely speaking, does any bias in b
disappear as the sample size gets
larger and larger?)
Our estimator b
is a statistic
p b - it is a function of the data. Whatbis the
limiting distribution of n( ) as n ! , i.e., what is AVar ( )?
Our expression for is AVar (b
) may be infeasible because it depends on
\
parameters we dont know. How do we obtain an estimate AVar
(b
)?
Lecture Notes 1
Autumn 2016
6 / 87
Motivation
What makes for a good estimator?
Central tendency: unbiasedness (nite-sample setting) or consistency
(large-sample setting). In repeated samples, b
will be centred around the
b
true ; as the sample size grows, converges to .
Variance: e ciency (nite-sample setting) or asymptotic e ciency
(large-sample setting). The variance of b
is "small".
How to weight these two criteria? What should be our loss function?
Might we sometimes prefer to use a biased or inconsistent estimator b
because it has a smaller variance than some other unbiased or consistent

estimator e
?
Commonly-used loss function: mean squared error.
Lecture Notes 1
Autumn 2016
7 / 87
Motivation
Denition: The mean squared error of b
is MSE = E (b
a loss function that weights bias and variance equally.

h
i2
Theorem: MSE = Var (b
) + bias (b
)
). Then
Proof: Let = E (b
E (b
)2 = Eh (b
= E (b
)2 . MSE is
)2
i
)2 + 2( )(b
) + ( )2
h
i
= E (b
)2 + E 2( )(b
) + E ( )2
) + ( )2
= E (b
)2 + 2( )E ( b
= E (b
)2 + 2( ) 0 + ( )2
h
i2
= Var (b
) + bias (b
)
where at various points we have made use of the fact that can be
treated as a constant and so E ( ) = E (E ( )) = .
Lecture Notes 1
Autumn 2016
8 / 87
Motivation
Common pattern in econometrics:
b
does not have a nite-sample justication. Its biased, or we just
dont know what E (b
) is, or we cant derive Var (b
).
b
does have a large-sample justication. Its consistent, and we can
estimate the asymptotic variance.
Because b
is biased in nite samples, we are worried about how it will
perform in practice. No one ever actually has an innite sample.
Sometimes we can derive expressions for the nite-sample bias or
otherwise study nite-sample performance theoretically.
Often we do MC exercises to see how the estimator performs.
Increasing the sample size in a series of MCs can indicate how quickly
the asymptotic justication kicks in. We can compare how dierent
estimators perform in dierent settings according to their bias,
variance and MSE.
This lecture: nite-sample theory for OLS.
Lecture Notes 1
Autumn 2016
9 / 87
Notation and denitions (and lots of it)

In general: matrices in upper case, vectors in bold lower case, scalars lower
case.
n
i
yi
xi
xi0
xik
i
k
y
X
0
0
Total number of observations

Indexes observations. (Well switch to t. for time series.)
Dependent variable; ith observation is a scalar
Independent variables; ith observation is a K 1 data row
Independent variables; ith observation is a 1 K data row
Independent variable k; ith observation is a scalar
Error term; ith observation is a scalar
Vector of errors, n 1
K 1 parameter vector
Coe cient on xik
Data vector of dependent variable, n 1
Data matrix of independent vars, n K
Scalar zero
Vector of zeros
Lecture Notes 1
Autumn 2016
10 / 87
Notation and denitions
Linear model, data row form:

yi = xi0 + i
Linear model, matrix form:
y = X +
where
yn
3
y1
6 .. 7
4 . 5
yn
Xn
3
x10
6 .. 7
4 . 5
xn0
Lecture Notes 1
3
1
6 .. 7
4 . 5
K
3
1
6 .. 7
4 . 5
n
Autumn 2016
11 / 87

Curiously, it is traditional to denote the OLS estimator, residuals and error
variance by the special symbols b, e and s 2 .
y
y
y
y
y
OLS
"True" parameter vector

Some estimator for
Some other estimator for
The OLS estimator for
b
The OLS estimator for . Same meaning as
OLS .
X
b
X
e
X
b
X
OLS
Xb
Vector of
Residuals
Residuals
Residuals
Residuals
errors, n
dened by
dened by
dened by
dened by
1
b
estimator
e
some other estimator
b
OLS estimator OLS (no subscript!)
OLS estimator b. Same as above.
Lecture Notes 1
Autumn 2016
12 / 87
Curiously, it is traditional to denote the OLS estimator, residuals and error

variance by the special symbols b, e and s 2 .
2
b2
e2
b 2OLS
s2
Var (i jX)
Some estimator for 2
Some other estimator for 2
The OLS estimator for 2
b 2OLS .
The OLS estimator for 2 . Same meaning as
Lecture Notes 1
Autumn 2016
13 / 87
The OLS estimator
The OLS estimator of in traditional data matrix form:

b = (X0 X )
1 X0 y
The OLS estimator of 2 :

s2 =
e0 e
n K
Note the K in the denition of s 2 . Asymptotically this doesnt matter - K

is xed as n ! - but in nite samples it will make a dierence (of
course).
Lecture Notes 1
Autumn 2016
14 / 87
Sum of Squared Residuals (SSR)

e can be written
The sum of squared residuals (SSR) for some estimator
in various ways:
e) = e
SSR (
0e
e ) 0 (y
=(y X
n
1
n
1
n
e2i
i =1
n
(yi
i =1
e)
X
e )0 (yi
xi
e)
xi
Denition of OLS estimator b: it minimizes the sum of squared residuals.

b
e)
arg minSSR (
e
Lecture Notes 1
Autumn 2016
15 / 87

The OLS estimator has multiple justications. Here we derive it as a
least-squares estimator; later we will see it justied as a Method of
Moments (MM) estimator and as a Maximum Likelihood (ML) estimator.
Easiest way: use vectors/matrices. Note: we will use the assumption
(more on this below) that X0 X is full rank.
First multiply out:
e) = e
SSR (
0e
e ) 0 (y X
e)
= (y X
0
e X0 )(y X
e)
= (y 0
0
e X0 y y 0 X
e+
e 0 X0 X
e
= y0 y
0
e+
e X0 X
e
e 0 X0 y is a scalar)
= y0 y 2y0 X
(since
Lecture Notes 1
(1.2.2)
Autumn 2016
16 / 87

e ) = y0 y
SSR (
e+
e 0 X0 X
e
2y0 X
(1.2.2)
e ) with respect to .
e Since SSR (
e ) is a scalar
Next, dierentiate SSR (
e
and is a K 1 vector, the result is a K 1 vector vector of rst
derivatives (the "gradient"). Note that:
1
0
e (y y) = 0.
Since y0 y does not depend on ,
e
e)
e)
(a0
( 2y0 X
= a,
= 2X0 y. (note the transposition)
e
e
0
e A
e)
e0 0 e
(
e for A symmetric, ( X X ) = 2X0 X .
e
Since
= 2A
e
e
Since
Hence:
e)
SSR (
=
e
e
2y0 X + 2X0 X
Lecture Notes 1
Autumn 2016
17 / 87
Deriving the OLS estimator as an LS estimator

e)
SSR (
=
e
e
2y0 X + 2X0 X
e
e ) are SSR ( ) = 0, so substitute
The rst-order conditions for minSSR (
e
e
0
0 e
to get 2X y + 2X X = 0 and rearrange:
X0 Xb = X0 y
(the K "normal equations")
(1.2.2)
e because it solves the FOCs. Since X0 X is full

where we substituted b for
rank by assumption and hence positive denite and nonsingular
1
(=invertible), we can solve (1.2.2) for b by premultiplying by (X0 X) :
(X0 X )
X0 Xb = (X0 X)
b = (X0 X )
X y and after cancelling terms we have b:
(1.2.5)
Xy
Lecture Notes 1
Autumn 2016
18 / 87

One more step is required: we have to check the second order condition to
conrm that the SSR is minimized. For a matrix problem, we check that
the Hessian (the matrix of second derivatives) is positive denite.
e)
SSR (
=
e
e
2y0 X + 2X0 X
e)
2 SSR (
= 2X0 X
0
e
e

(gradient, vector of rst derivatives)
(Hessian, matrix of second derivatives)
And since weve assumed X0 X is full rank and hence positive denite,
2X0 X is also positive denite and the second order conditions for a
minimization are satised. Done!
Lecture Notes 1
Autumn 2016
19 / 87
More useful OLS expressions

Uncentered and centred R 2 :
Ru2 = 1
e0 e
y0 y
Recall that is an n 1 column vector of ones, and y is the sample mean

of yi . Denote by e
y the "demeaned" or "centered" values of y:
e
y
3
y1
7
6
y y = 4 ... 5
yn
Then the centred R 2 is:

R2 = 1
e0 e
e
y0 e
y
3 2
3
1
y1 y
6 7 6
7
..
y 4 ... 5 = 4
5
.
1
yn y
Lecture Notes 1
Autumn 2016
20 / 87
More useful OLS expressions
The sampling error of b is the dierence between the OLS estimate b and
the true :
b
=
=
=
=
=
(X0 X ) 1 X0 y
(X0 X) 1 X0 (X + )
(X0 X) 1 X0 X + (X0 X) 1 X0
+ (X0 X ) 1 X0
(X0 X ) 1 X0
NB: proofs of unbiasedness and consistency typically require this kind of

substitution and simplication.
Lecture Notes 1
Autumn 2016
21 / 87
Some special matrices

The projection matrix P projects some (usually) data matrix or vector Z
onto the the linear subspace dened by X:
Px
X (X0 X )
1 X0
Px is n
n.
The annihilation matrix M gives the dierence between Z and the

projection of Z onto X:
Mx
X (X0 X )
1 X0
MX is n
n.
Both Px and Mx are symmetric and idempotent:

Px0 = (X(X0 X)
1 X0 ) 0 =
X (X0 X )
1 X0
= Px
Px0 Px = (X(X0 X) 1 X0 )0 (X(X0 X) 1 X0 ) = (X(X0 X) 1 X0 )(X(X0 X)

= X(X0 X) 1 X0 X(X0 X) 1 X0 = X(X0 X) 1 X0 = Px
1 X0 )
and similarly for Mx .

Lecture Notes 1
Autumn 2016
22 / 87
Some special matrices
The projection and annihilation matrices make it easy to write some OLS
expressions:
b
y = Xb = X(X0 X)
1 X0 y
= Px y
So Px y gives us the tted values of y from an OLS regression of y on X.

Note that Px y is n 1
e
b
y=y
Xb = y
X (X0 X )
1 X0 y
= (I
X (X0 X )
X0 )y = Mx y
So Mx y gives us the residuals from an OLS regression of y on X. Note

that Mx y is also n 1 .
Lecture Notes 1
Autumn 2016
23 / 87
The Frisch-Waugh-Lovell (FWL) theorem
The Frisch-Waugh-Lovell theorem is extremely handy. It allows us to

simplify linear models and theorems by "partialling out" selected regressors
- call them X1 .
X1 may be uninteresting because they get in the way of the exposition, or

because 1 are nuisance parameters that cant be estimated consistently,
or because a particular K K matrix corresponding to the full set of Xs
cant be inverted, but the K2 K2 matrix corresponding to the interesting
X2 s can be inverted. (We have a particular matrix in mind here...)
Lecture Notes 1
Autumn 2016
24 / 87
The FWL theorem (continued)
Partition X so that X is split between regressors of interest X2 and

uninteresting regressors X1 . Partition conformably into 1 and 2 .
X
[ X1
X2 ]
1
2
y = X + = [X1
X2 ]
1
+
2
y = Xb + e = [X1
X2 ]
b1
+e
b2
Lecture Notes 1
b1
b2
Autumn 2016
25 / 87
The FWL theorem

Dene projection and annihilation matrices using the uninteresting
regressors X1 :
Px 1
X1 ( X1 X1 )
X10
Mx 1
Px 1
Regress y and X2 on X1 using OLS and collect the residuals:

e 2 = Mx 1 X2
X
e
y= Mx 1 y
e 2 residuals
Frisch-Waugh-Lovell Theorem: Regress the e
y residuals on the X
using OLS and you get the same b2 as when you do OLS using the full set
of Xs:
b = (X
e0 X
e
2 2)
2
1X
e0 e
2y
= b2
Lecture Notes 1
Autumn 2016
26 / 87
The FWL theorem

e 2 residuals
Frisch-Waugh-Lovell Theorem: Regress the e
y residuals on the X
using OLS and you get the same b2 as when you do OLS using the full set
of Xs:
b = (X
e0 X
e
2 2)
2
1X
e0 e
2y
= b2
Moreover, the residuals are also the same:

b
=e
y
b =e
e2
X
2
Question: what if we dened the uninteresting regressors X1 as simply the

constant, i.e., X1 = ?
Answer: e
y would be the same "centred" or "demeaned" e
y used in the
e 2 would be the regressors
denition of the centred R 2 used above. X
(excluding the constant) in mean-deviation form, i.e., each xik would
become (xik x k ).
Lecture Notes 1
Autumn 2016
27 / 87
Moments
A "moment" is a measure of the shape of a distribution (the term is
borrowed from physics).
If ai and ci are scalar random variables, then the following are population
moments:
E ( ai )
E (ai2 )
E (ai E (ai ))2
E ( a i ci )
E [(ai E (ai ))(ci
E (ci ))]
First moment of ai Same as mean of ai .

Second moment of ai .
Second central moment. Same as Var (ai ).
Cross moment of ai and ci .
Central cross moment. Same as Cov (ai , ci ).
These are population moments: they are characteristics of the population.

If E (ai ci ) = 0, we say ai and ci are orthogonal. We will make extensive
use of this term.
Lecture Notes 1
Autumn 2016
28 / 87
Moments
The corresponding sample moments:
n
1
n
ai
First sample moment of ai Same as sample mean.
i =1
n
1
n
ai2
Second sample moment of ai .
i =1
n
1
n
( ai
a )2
Second central sample moment of ai .
i =1
n
1
n
a i ci
Sample cross moment of ai and ci .
i =1
n
1
n
( ai
a)(ci
c)
Sample central cross moment of ai and ci .
i =1
Note that we are dividing by n throughout. The second central sample

moment above diers from the sample variance because the latter divides
by n 1. Same for the sample central cross moment vs. covariance.
Lecture Notes 1
Autumn 2016
29 / 87
Moments
The extension of moments to matrices is straightforward. We illustrate by
applying to our data variables. These sample and population moments
appear frequently enough to warrant shorthand symbols.
Population moments:
E (xi yi )
xy
E (xi xi0 )
xx
Sample moments:
n
1
n
xi yi
1 0
nX y
Sxy
1 0
nX X
Sxx
i =1
n
1
n
xi xi0
i =1
Hence two ways to write the OLSLecture

estimator:
Notes 1
Autumn 2016
30 / 87
Denition of i.i.d.
Denition of i.i.d.: independently and identically distributed
Note: we are assuming that the regressors X are random. Sometimes
textbooks assume that X is constant or "xed in repeated samples". This
makes the exposition a bit easier but is highly unrealistic, so we follow
Hayashi and assume X is random from the beginning.
The sample (y, X) is an i.i.d. random sample (or just "random sample") if
fyi , xi g is independently and identically distributed across observations i.
Note: the "identical" in i.i.d. means that the joint distribution of fi , xi g
does not depend on i. We will use this fact later.
Note: Hayashi (p. 12) uses the term "random sample" for an i.i.d.
random sample.
Lecture Notes 1
Autumn 2016
31 / 87
Plan for the rest of the lecture

Set out all the assumptions required by the classical regression model.
Derive carefully and discuss key nite-sample properties of the OLS
estimator b.
Show how to construct a statistic to test a hypothesis about a single
coe cient, H0 : k = k , when we know the true error variance 2 . (z
statistic, Normally distributed)
Show how to construct a statistic to test linear hypotheses in general,
H0 : R = r, when we know the true error variance 2 . (W statistic, 2
distributed)
These test statistics are "infeasible" - they depend on the unknown
"nuisance parameter" 2 .
e0 e
, we obtain feasible
n K
test statistics. (t statistic/distribution, F statistic/distribution)
But if we replace 2 by its OLS estimator s 2 =
Lecture Notes 1
Autumn 2016
32 / 87
Assumptions of the Classical Regression Model

Numbering of assumptions and equations follows Hayashi.
Assumption 1.1: Linearity
y = X +
(1.1.1)
Assumption 1.2: Strict exogeneity

E ( i jX) = 0
i = 1, 2, ...n
(1.1.7)
Assumption 1.3: No multicollinearity

The rank of the n
K data matrix X is K with probability 1.
Lecture Notes 1
Autumn 2016
33 / 87
Assumption 1.4: Spherical error variance

This assumption has two parts:
(a) Conditional homoskedasticity
E (2i jX) = 2
i = 1, 2, ...n
(1.1.12)
(b) Independence
E ( i j jX) = 0
8 i, j, i 6= j
(1.1.13)
Lecture Notes 1
Autumn 2016
34 / 87
Assumptions of the Classical Regression Model: Remarks

Assumption 1.1: Linearity
y = X +
(1.1.1)
"Linear" here means linear in parameters.

It is easy to accommodate nonlinearity in variables. If, say, we want to use
log(income) as a regressor, just dene a new variable to be log(income)
and include it as one of the Xs. Same applies to functions of more than
one variable: if we want the interaction of log(income) and age, dene a
new variable to be age*log(income) and include it as one of the Xs.
NB: when dening multiplicative interactions, it is usually a good idea to
centre the separate variables around their means before interacting, i.e.,
dene the age-income-interaction variable to be
(age age ) (log (income ) log (income )).
Lecture Notes 1
Autumn 2016
35 / 87

Assumption 1.1: Linearity (data row form)
yi = xi0 + i
An important feature of this assumption that is not discussed in Hayashi is
parameter homogeneity: the parameter vector is xed and does not
vary across observations.
A major area of research in econometrics, especially microeconometrics,
involves loosening this assumption.
Terminology involved here: "treatment eects", "casual eects",
"parameter heterogeneity", "heterogeneous causal eects", "random
coe cients", etc.
Lecture Notes 1
Autumn 2016
36 / 87

Assumption 1.1: Linearity (data row form)
yi = xi0 + i
For example, consider the model where the eect of varies across
observations:
yi = xi0 i + i
Obviously we cant obtain estimates of all the n dierent values of i .
But we can obtain an estimate of the average treatment eect (ATE)
E ( i ). In fact, OLS yields a consistent estimate of the ATE. See
Stock & Watson, pp. 540-2 (3rd edition).
But in this course we work almost entirely with the basic
no-heterogeneous-eects setup.
Lecture Notes 1
Autumn 2016
37 / 87

E ( i jX) = 0
i = 1, 2, ...n
This is also known as the zero conditional mean assumption, but strict
exogeneity is perhaps the best term because it emphasizes the key point
(conditioning on the entire X).
It is actually two assumptions:
E ( i jX) =
i = 1, 2, ...n
and
=0
But not restrictive if the model has a constant term.
Lecture Notes 1
Autumn 2016
38 / 87

Say
E ( i jX) =
xi 1 = 1
8i
i = 1, 2, ...n and 6= 0
(the rst column of the data matrix X is all ones)
yi = 1 + 2 xi 2 + . . . + K xiK + i
Add and subtract :
yi = ( 1 + ) + 2 xi 2 + . . . + K xiK + (i
And the new error term now has a zero conditional mean.
Much more important....
Lecture Notes 1
Autumn 2016
39 / 87

E ( i jX) = 0
i = 1, 2, ...n
The key point is that X is all xj , for j = 1, 2, ...n. More obvious if we

write it like this:
E (i jx1 , x2 , ...xn ) = 0
i = 1, 2, ...n
Implications of strict exogeneity (1):

Unconditional mean of the error is zero:
E ( i ) = 0
i = 1, 2, ...n
Lecture Notes 1
Autumn 2016
40 / 87

Implications of strict exogeneity (2):
Regressors are orthogonal to all errors (note that 0 is a vector of K zeros,
and the subscript on xj is a j):
E (xj i ) = 0K
8 i, j
Intuition: with strict exogeneity we are assuming exogeneity of xi with

respect to all the errors 1 , 2 , ..., n .
Compare this to the denition of weak exogeneity (Hayashi uses the
term "predetermined") we will use when developing large-sample theory:
E (xi i ) = 0K
i = 1, 2, ...n
Intuition: with weak exogeneity we are assuming exogeneity of xi with

respect to just the contemporaneous error i . A much weaker
(=believable in a wider range of settings) assumption.
Lecture Notes 1
Autumn 2016
41 / 87

Strict exogeneity in time series models is rarely satised.
See Hayashi pp. 9-10. Simple example: the rst-order autoregressive or
AR(1) model. The only regressor is the lagged dependent variable:
yi = yi
+ i
(1.1.11)
Say we assume just weak exogeneity, E (yi

E (yi i ) = E [( yi
+ i )i ] = E (yi
1 i )
= 0. Then:
2
1 i ) + E ( i )
= E (2i )
So unless the error term is always zero, E (yi i ) 6= 0. But that means
weve violated strict exogeneity because yi is the regressor for observation
i + 1. We can assume weak exogeneity but not strict exogeneity.
Fortunately, weak exogeneity is enough for the OLS estimator to have good
large-sample properties even in a time series setting. More on this later.
Lecture Notes 1
Autumn 2016
42 / 87
NB: Conditional vs. Unconditional Moment Conditions

The weak exogeneity assumption we will use in the large-sample setting is
E (xi i ) = 0K
i = 1, 2, ...n
This is an orthogonality condition: we say xi is orthogonal to i . It is a

kind of moment condition or moment restriction. Specically, we are
saying that the unconditional moment is zero.
Sometimes a stronger assumption is made:
E (i jxi ) = 0
i = 1, 2, ...n
This is an example of a conditional moment restriction (there are

other possibilities for conditioning, e.g., xj 8 j i for time series).
This particular restriction is similar to weak exogeneity in that both cases
xi is exogenous with respect to just the contemporaneous error i . In
what way is it stronger?
Lecture Notes 1
Autumn 2016
43 / 87
E (xi i ) = 0K
E (i jxi ) = 0
i = 1, 2, ...n
i = 1, 2, ...n
Weak exogeneity
Conditional moment restriction
In both cases xi is exogenous with respect to just the contemporaneous

error i . But the conditional moment restriction is stronger because it
implies E (f (xi )i ) = 0, i.e., not only is xi is orthogonal to i (as in weak
exogeneity), but any function f of xi is orthogonal to i .
This can be implied by some models, e.g., Rational Expectations models:
xi is in the information set of agents, and i is some random shock or
surprise they should not be able to predict .
Lecture Notes 1
Autumn 2016
44 / 87

E (i jxi ) = 0
i = 1, 2, ...n
The conditional moment restriction implies E (f (xi )i ) = 0, i.e., not only

is xi is orthogonal to i (as in weak exogeneity), but any function f of xi
is orthogonal to i . We can show this:
E (f (xi )i )
= E [E (f (xi )i j xi )]
= E [f (xi )E (i jxi )]
= E [f (xi ) 0]
=0
(a)
(b)
(c)
(d)
Notes:
(a) By the Law of Total Expectations: E [E (AjB )] = E (A).
(b) By the linearity of conditional expectations: since we are conditioning
on xi , we can treat any function of xi as nonrandom and move it out of
the inner E ().
(c) By our conditional moment restriction E (i jxi ) = 0.
Lecture Notes 1
Autumn 2016
45 / 87

Summary:
E ( i jX) = 0
E (xi i ) = 0K
i = 1, 2, ...n
1
i = 1, 2, ...n
Strict exogeneity
Weak exogeneity ("predetermined")
E (i jxi ) = 0
i = 1, 2, ...n
E (i jxj ) = 0
8 i; 8 j

And the last 2 are examples of dierent conditional moment restrictions.

(Q: How are they dierent? Can you relate this to any economic models?)
Back to...
Lecture Notes 1
Autumn 2016
46 / 87

E (i jx1 , x2 , ...xn ) = 0
i = 1, 2, ...n
(1.1.7)
If we combine this with

Optional Extra Hayashi Assumption: I.I.D. random sample
fyi , xi g is independently and identically distributed across observations i

then the strict exogeneity assumption simplies:
E (i jxi ) = 0
i = 1, 2, ...n
(1.1.16)
Dont confuse this result with weak exogeneity! (1.1.16) follows from the
i.i.d. assumption for fyi , xi g, which is often very strong. For example, if
we are working with time series, the independent variables xi will usually
be serially correlated - GDP last year is correlated with GDP this year. If
the data are not i.i.d., then we have to include all the x1 , x2 , ...xn in the
denition of strict exogeneity.
Lecture Notes 1
Autumn 2016
47 / 87

The rank of the n
Another way to state this:

The n
K data matrix X is full rank with probability 1
The rank of a matrix is the number of linearly independent columns. A

matrix is full rank if the rank equals the number of columns. A matrix is
rank decient or less than full rank if you can express one or more columns
as linear combinations of the other.
(We could say "column rank" to distinguish the concept from "row rank",
but since (a) n > K , (b) a basic matrix theorem says that the column rank
always equals the row rank, and so (c) the row rank is also at most K , we
will just say "rank".)
Lecture Notes 1
Autumn 2016
48 / 87

The rank of the n
Classic example: the dummy variable trap. Say we have a dataset of male
and female individuals. Dene the male dummy m as the n 1 column
vector of dummies for individuals i = 1, ..., n. Dene the female dummy f
similarly. Say we also have a constant term, i.e., a variable which is just
an n 1 column vector of ones:
2 3
1
6 .. 7
n 1 4 . 5
(another notation for this column vector is 1)
1
There are no other regressors. Then the matrix of regressors is
X = [ m f ]. But its easy to see that = m + f . Thus X is not full
rank and the assumption fails.
Lecture Notes 1
Autumn 2016
49 / 87

The rank of the n
Why "with probability 1"? X is stochastic. Bad luck might give us a

sample of n observations such that X this particular data matrix is not full
rank.
In the dummy variable example, we might be using a male dummy as a
regressor along with a constant and some other regressors, but happen to
draw a random sample of all men. If so, X = [ m ...] will not be full
rank (the rst two columns are identical).
We say this is so unlikely that it happens with probability zero.
Lecture Notes 1
Autumn 2016
50 / 87

(Recall, this has two parts: conditional homoskedasticity and
independence.)
E (2i jX) = 2
i = 1, 2, ...n
Hayashi and others call this "homoskedasticity", but as Hayashi makes

clear, the "conditional" aspect is very important.
Note that since Assumption 1.2 says that the conditional mean of i is
zero, this assumption is the same thing as:
Var (i jX) = 2
i = 1, 2, ...n
Lecture Notes 1
Autumn 2016
51 / 87

E (2i jX) = 2
i = 1, 2, ...n
If we in addition assume
Optional Extra Hayashi Assumption: I.I.D. random sample
fyi , xi g is independently and identically distributed across observations i

then the conditional homoskedasticity assumption simplies to
E (2i jxi ) = 2
i = 1, 2, ...n
(1.1.17)
Lecture Notes 1
Autumn 2016
52 / 87

E (2i jX) = 2
i = 1, 2, ...n
Remark: just the assumption of an i.i.d. sample is not enough to

imply conditional homoskedasticity.
Note that the "identical" in i.i.d. implies that the joint distribution of
fi , xi g doesnt depend on i. Thus:
i.i.d. sample implies unconditional homoskedasticity: E (2i ) is constant
across i.
i.i.d. sample implies the functional form of E (2i jxi ) is constant across i.
But i.i.d. sample does not imply that the value of E (2i jxi ) is constant
across i, because the value of xi varies across i. Thus even if we assume
an i.i.d. random sample, we still need Assumption 1.4(a): Conditional
homoskedasticity as an additional assumption.
Lecture Notes 1
Autumn 2016
53 / 87

(b) Independence
E ( i j jX) = 0
8 i, j, i 6= j
(1.1.13)
Combined with Assumption 1.2 (strict exogeneity), this becomes
Cov (i , j jX) = 0
8 i, j, i 6= j
Hayashi calls this assumption "no correlation between observations".

For time series data, this means the error term is not serially correlated.
This is often a strong assumption. It is also often a strong assumption in
other contexts. If we were working with spatial data - individuals or
whatever distributed over space - it means the errors of near neighbours
are not correlated with each other. If we had a dataset of students, it
means that the errors of students in the same school are not correlated.
Lecture Notes 1
Autumn 2016
54 / 87

(a) Conditional homoskedasticity: E (2i jX) = 2
(b) Independence: E (i j jX) = 0
8 i, j, i 6= j
i = 1, 2, ...n
Parts (a) and (b) can be combined into single expression involving
E (0 jX). Recall that the 0 matrix is n n with 2i s running down the
diagonal and i j on the o-diagonals:
2
21
6 ..
6 .
6
6 i 1
6
6 .
0 = 6 ..
6
6 j 1
6
6 ..
4 .
n 1
. . . 1 i
..
..
.
.
. . . 2i
..
.
...
j i
..
.
. . . n i
. . . 1 j
..
.
...
..
.
...
i j
..
.
2j
..
.
. . . n j
3
. . . 1 n
.. 7
. 7
7
. . . i n 7
7
.. 7
. 7
7
. . . j n 7
7
.. 7
..
.
. 5
. . . 2n
Lecture Notes 1
Autumn 2016
55 / 87

8 i, j, i 6= j
i = 1, 2, ...n
But (a) means the diagonal of E (0 jX) is 2 s, and (b) means all the
o-diagonals of E (0 jX) are zero. Thus the matrix simplies hugely:
2
21
6 ..
6 .
6
6 i 1
6
6 .
E (0 jX) = E( 6 ..
6
6 j 1
6
6 ..
4 .
n 1
. . . 1 i
..
..
.
.
. . . 2i
..
.
...
j i
..
.
. . . n i
. . . 1 j
..
.
...
..
.
...
i j
..
.
2j
..
.
. . . n j
Lecture Notes 1
3
. . . 1 n
.. 7
. 7
7
. . . i n 7
7
.. 7
jX) = ...
. 7
7
7
. . . j n 7
.. 7
..
.
. 5
. . . 2n
Autumn 2016
56 / 87

8 i, j, i 6= j
...
E(21 jX)
6
..
6
.
6
6 E( i 1 jX)
6
6
..
E (0 jX) = 6
.
6
6 E( j 1 jX)
6
6
..
4
.
E( n 1 jX)
i = 1, 2, ...n
. . . E( 1 i jX) . . . E( 1 j jX)
..
..
..
.
.
.
. . . E(2i jX) . . . E(i j jX)
..
..
..
.
.
.
. . . E(j i jX) . . . E(2j jX)
..
..
.
.
. . . E( n i jX) . . . E( n j jX)
3
. . . E( 1 n jX)
7
..
7
.
7
. . . E( i n jX) 7
7
7
..
7
.
7
. . . E( j n jX) 7
7
7
..
..
5
.
.
. . . E(2n jX)
And since the diagonals are all 2 , and the o-diagonals are all 0s...
Lecture Notes 1
Autumn 2016
57 / 87

8 i, j, i 6= j
2
3
2 0
0 ... ... ... 0
6 0 2 0 . . . . . . . . . 0 7
6
7
6
7
.. ..
2
60
7
.
.
0
.
.
.
0
6
7
6
.. . . . . . . . .
..
.. 7
0
6
.
.
.
.
.
.7
E ( jX) = 6 .
7
6 ..
7
.. . . . .
6.
7
.
. 2 0
.
0
6
7
..
..
6 ..
7
4.
.
. . . . 0 2 0 5
0
0
0 ... 0
0 2
i = 1, 2, ...n
Hence
E (0 jX) = 2 In
(1.1.14)
...which is very concise.
(Hence the term "spherical" - its proportional to the identity matrix.)

Lecture Notes 1
Autumn 2016
58 / 87

8 i, j, i 6= j
i = 1, 2, ...n
Assumption 1.4 can now be stated concisely and equivalently in terms of

conditional second moments and cross-moments:
E (0 jX) = 2 In
(1.1.14)
Or, making use of Assumption 1.2 Strict exogeneity, and using the
notation Var (jX) for the entire variance-covariance matrix of , we
can write Assumption 1.4 in terms of conditional variances and covariances
Var (jX) = 2 In
In traditional nite-sample presentations of OLS, the latter is usually used.

The rst version (eq. 1.1.14) is more convenient for developing
large-sample theory.
Lecture Notes 1
Autumn 2016
59 / 87

Proposition 1.1: Finite-sample distribution of OLS estimator b
(H p. 27)
(a) Unbiasedness: E (bjX) = . Requires Assumptions 1.1-1.3.
(b) Variance: Var (bjX) = 2 (X0 X)
. Requires Assumptions 1.1-1.4.
(c) E ciency (Gauss-Markov, BLUE): b is e cient in the class of linear

unbiased estimators. Requires Assumptions 1.1-1.4.
(d) OLS estimator and residuals are uncorrelated: Cov (b, e) = 0.
Requires Assumptions 1.1-1.4.
We now show (a) and (b) in detail and sketch (c).
Lecture Notes 1
Autumn 2016
60 / 87
Proposition 1.1: Finite-sample distribution of OLS estimator b

(a) Unbiasedness: E (bjX) = . Requires Assumptions 1.1-1.3.
The proof is short but instructive. It will be useful to compare it to the
proof of consistency in the large-sample setting later on.
It is common for proofs of unbiasedness to work with [E (bjX) ]
instead of E (bjX). (We will do the same in our proofs of consistency.)
Lecture Notes 1
Autumn 2016
61 / 87

Proof of (1.1.a) Unbiasedness: E (bjX) =
E (bjX)
= E (b jX)
= E ((X0 X) 1 X0 jX)
= (X0 X ) 1 X0 E ( jX )
=0
(a)
(b)
(c)
(d)
Notes:
(a) Since is a constant.
(b) From denition of the sampling error of b.
(c) Since we are conditioning on X, we can treat any function of X as
nonrandom and move it out of the E ().
(d) By Assumption 1.3 Strict exogeneity. This is key. For models where
1.3 fails (such as most time-series models), OLS is biased.
Note we used Assumptions 1.1-1.3.but we did not use Assumption 1.4
(spherical errors: conditional homoskedasticity and independence).
Unbiasedness of OLS is robust to violations of this assumption.
Lecture Notes 1
Autumn 2016
62 / 87

Proof of (1.1.b) Variance: Var (bjX) = 2 (X0 X)
Var (bjX)
= Var (b jX)
= Var ((X0 X) 1 X0 jX)
= [(X0 X) 1 X0 Var (jX) X(X0 X)
= [(X0 X) 1 X0 2 In X(X0 X) 1
= 2 (X0 X ) 1
(a)
(b)
(c)
(d)
(e)
Notes:
(a) Since is a constant.
(b) From denition of the sampling error of b.
(c) Since we are conditioning on X, we can treat any function of X as
nonrandom and move it out of the Var ().
(d) By Assumption 1.4.a and 1.4.b (when we combined (a) and (b) into a
single statement).
(e) (X0 X) 1 and (X0 X) cancel (after moving the scalar 2 out of the way).
Lecture Notes 1
Autumn 2016
63 / 87

Alternate proof of (1.1.b) Variance: Var (bjX) = 2 (X0 X)
Makes use of the fact that Var (yjX) =Var (jX).
Var (bjX)
= Var ((X0 X) 1 X0 yjX)

= [(X0 X) 1 X0 Var (yjX) X(X0 X)
= [(X0 X) 1 X0 Var (jX) X(X0 X)
= [(X0 X) 1 X0 2 In X(X0 X) 1
= 2 (X0 X ) 1
1
1
(a)
(b)
(c)
(d)
(e)
Notes:
(a) Substitution.
(b) Since we are conditioning on X , we can treat any function of X as
nonrandom and move it out of the Var ().
(c) From Var (yjX) =Var (jX).
(d) By Assumption 1.4.a and 1.4.b.
(e) (X0 X) 1 and (X0 X) cancel (after moving the scalar 2 out of the way).
Lecture Notes 1
Autumn 2016
64 / 87

Proof (sketch) of (1.1.c) E ciency: (Gauss-Markov, BLUE): b is
e cient in the class of linear unbiased estimators.
(See Hayashi pp. 29-30 or some other text for a full proof.)
b that is linear in
Stated more carefully: For any other unbiased estimator
b
y, Var ( jX) Var (bjX).
Intuitively, "e ciency" means "precision", i.e., a small variance. If b is

"e cient", it means that no other estimator has a smaller variance. But
note that b is a vector. We therefore have to use the matrix denition of
. If A and B are square matrices, then:
A
(A K
()
(A
B) is positive semidenite.
K matrix C is positive semidenite if x0 Cx
Lecture Notes 1
0 for all vectors x.)

Autumn 2016
65 / 87

Proof (sketch) of (1.1.c) E ciency
b jX) Var (bjX) = C is a positive semidenite
We need to show that Var (
b that is linear in y.
matrix for any other estimator
b that has b in it. Since

b
The way to start is to obtain an expression for
is linear in y, without loss of generality we pick some matrix D(X) which
b as
is some function of X so that we can write
b = (D + (X0 X )
1 X0 )y
= Dy + b = DX + D + b
b implies
Taking conditional expectations of this plus the unbiasedness of
b = D + b and therefore
that DX = 0 (see Hayashi). So
b
= D + (b
Lecture Notes 1
Autumn 2016
66 / 87

Proof (sketch) of (1.1.c) E ciency
b
= D + (b
= D + (b
The second term on the right weve seen before - its the sampling error of
b. Substitute and we get
) = D+(X0 X)
1 X0
b
We are now set up to go, because Var (
= (D+(X0 X)
1 X0 )
b jX). So
jX) =Var (
b jX) =Var ((D+(X0 X) 1 X0 )jX) = ...(eventually)...

Var (
= 2 DD0 + (X0 X) 1 = 2 DD0 + Var (bjX)
And were done, because DD0 is positive semidenite and
b jX)
Var (
Var (bjX) = 2 DD0
Lecture Notes 1
Autumn 2016
67 / 87

Proposition 1.2: Unbiasedness of s2
E (s 2 jX) = 2 . Requires Assumptions 1.1-1.4.
Proof: See Hayashi pp. 30-31.
Reminder: s 2 =
calculate it.
e0 e
is feasible - we have everything we need to
n K
Estimate of Var(bjX)
The variance of the OLS estimator Var (bjX) = 2 (X0 X)
it depends on the unknown "nuisance parameter" 2 .
is infeasible -
So in practice we use an estimate of the variance of the OLS estimator

Var (bjX). The obvious one to use is
\
Var
(bjX)= s2 (X0 X)
(1.3.4)
which is feasible - we have everything we need to calculate it.

Lecture Notes 1
Autumn 2016
68 / 87
Finite-sample-based inference using OLS

Are we ready to go with our estimations and formulate and test
hypotheses on b? Unfortunately not.
To test a hypothesis involving b, we need to form a test statistic from b
whose distribution under the null is known. We could possibly go down
the route of large-sample asymptotic results, but then why not just
develop the large-sample asymptotic framework from scratch? (This is
what we will do shortly.)
What do we need to develop nite-sample (or "exact") results for the
distribution of the OLS estimator b?
We have an expression for the sampling error of b,
b
= (X0 X )
1 X0
Since the sampling error is a function of (X,), we could specify the joint
distribution of (X,) and work with that, but that is unattractive - how do
we know what the true distribution is?
Lecture Notes 1
Autumn 2016
69 / 87
Assumption 1.5: Normality of

The distribution of conditional on X is normal.
This is a strong assumption in econometrics, though it might sometimes
be true or approximately true. But it simplies the nite-sample inference
problem hugely. Now we can derive the nite-sample distribution of
(b ) and use this to construct hypothesis tests.
Lecture Notes 1
Autumn 2016
70 / 87

Combine Assumptions 1.2, 1.4 and 1.5:
Assumption 1.2: Strict exogeneity: E (i jX) = 0
i = 1, 2, ...n
Assumption 1.4: Spherical error variance: Var (jX) = 2 In
Assumption 1.5: Normality of : The distribution of conditional on X
is normal.
Then jX s N (0, 2 In )
(1.4.1)
(Just plug the conditional mean and variance into the denition of a
Normal random variable - they dene a Normal distribution.)
This means the distribution of conditional on X doesnt depend on the
latter; and X are independent. Thus the marginal or unconditional
distribution of is simply s N (0, 2 In ).
Lecture Notes 1
Autumn 2016
71 / 87
The sampling error of b is:
(b
) = (X0 X )
1 X0
which is linear in given X. Since is normal given X, b is also

normal given X. We know the conditional mean and variance of (b )
from Proposition 1.1.a and 1.1.b, so we just plug them in:
(b
) jX s N (0, 2 (X0 X)
1)
(1.4.2)
And we are almost ready to go.
Lecture Notes 1
Autumn 2016
72 / 87
Testing a single regression coe cient

The nite-sample distribution of the OLS estimator b:
(b
) jX s N (0, 2 (X0 X)
1)
Say we want to test a hypothesis about the kth coe cient k :

H0 : k = k
where k is some specic hypothesized value. The distribution of b
implies that under the null (i.e., k is the "true" k ),
bk
k jX s N (0, 2 (X0 X)
where (X0 X)
1
kk
1
kk
is a scalar, the (k, k ) element of (X0 X)
Lecture Notes 1
1.
Autumn 2016
73 / 87

If the null H0 : k = k is true, bk
Divide both sides by
our hypothesis:
zk jX s N (0, 1)
2 [(X0 X)
where zk
1]
k jX s N (0, 2 (X0 X)
kk
1
kk
).
and we obtain a test statistic for
bk
2 [(X0 X)
1]
(1.4.3)
kk
This "z statistic" has a Normal distribution. To use it, we would go

through the usual steps: (1) choose k and a signicance level (5% is
common); (2) look up the critical values for the Normal distribution and
signicance level (for = 0.05 these are 1.96 and 1.96); (3) estimate
by OLS and calculate the test statistic zk ; (4) compare zk to the critical
values; (5) if zk is in the tails (e.g. less than 1.96 or bigger than 1.96),
reject the null hypothesis (extreme values for zk suggest H0 is unlikely to
be true).
Lecture Notes 1
Autumn 2016
74 / 87
zk jX s N (0, 1)
where zk
bk
2 [(X0 X)
1]
(1.4.3)
kk
The only problem is ... the test statistic zk is infeasible because it depends
on the unknown nuisance parameter 2 .
Exactly the same issue arises if we want to construct tests of linear
hypotheses.
Lecture Notes 1
Autumn 2016
75 / 87
Tests of linear hypotheses

Say we want to test hypotheses involving linear combinations of the
elements of .
For example, say we estimate a log-linear Cobb-Douglas production
function with inputs ln(Ki ) and ln(Li ) and yi is log output:
yi = 1 + 2 ln(Ki ) + 3 ln(Li ) + i
Weve seen how to test just the capital elasticity 2 or just the labour
elasticity 3 .
But we might want to test whether the two elasticities are jointly zero:
H0 : 2 = 0
and
3 = 0
Or we might want to test whether we have constant returns to scale:

H0 : 2 + 3 = 1
Lecture Notes 1
Autumn 2016
76 / 87
To test linear hypotheses we write them as a system of linear equations:

H0 : R = r
(1.4.8)
r is a column vector with dimension #r (number of eqs).

R is #r
K , where K is the number of elements of .
Require rank (R) = #r, i.e., R is full row rank. Means no redundant
equations and no inconsistent equations (see Hayashi p. 40).
We can write any set of linear hypotheses this way.
Lecture Notes 1
Autumn 2016
77 / 87

To test linear hypotheses we write them as a system of linear equations:
H0 : R = r
(1.4.8)
In the Cobb-Douglas example:

H0 : 2 = 0
R=
3 = 0
and
2 3
0
4
r = 05
0
2 3
0
r = 405
1
3
1
= 4 2 5
3
0 1 0
0 0 1
H0 : 2 + 3 = 1
(CRS)
R= 0 1 1
3
1
= 4 2 5
3
Lecture Notes 1
Autumn 2016
78 / 87

If the null H0 : R = r is true, then
W jX s 2 (#r)
where W
(Rb
h
r ) 0 2 R(X0 X )
R0
(Rb
r)
Proof (short version; see Hayashi p. 41): If the null is true, R = r.

Subtract Rb from both sides and rearrange to get (Rb r) = R (b ).
We know (b ) jX s N (0, 2 (X0 X) 1 ).
Hence conditional on X, (Rb
Var (Rb
Hence W
r) is normal with mean 0 and variance

j X ) R0
rjX) = Var (R (b ) jX) = RVar (b

0
= 2 R(X0 X ) 1 R
(Rb
r)0 [Var (Rb
rjX)]
(Rb
r)
Fact: if the m-dimensional vector z s N (, ) and is nonsingular, then

(z )0 1 (z ) s 2 (#m).
2 R (X0 X )
1 R0
is nonsingular. Hence W jX s 2 (#r). Done!
Lecture Notes 1
Autumn 2016
79 / 87

If the null H0 : R = r is true, then
W jX s 2 (#r)
where W
(Rb
h
r ) 0 2 R(X0 X )
R0
(Rb
r)
This is just a generalization of the z statistic for testing the simple

hypothesis H0 : k = k . (Easy review question: what are R and r in this
case?) We would use W in the same way: (1) choose R, r and a
signicance level (5% is common); (2) look up the critical values for the
2 (#r) distribution and signicance level ; (3) estimate by OLS and
calculate the test statistic W ; (4) compare W to the critical value; (5) if
W is in the tail, reject the null hypothesis (a large value for W suggest H0
is unlikely to be true).
And we have the same problem: the test statistic W is infeasible because
it depends on the unknown nuisance parameter 2 .
Lecture Notes 1
Autumn 2016
80 / 87

Under H0 : k = k , zk jX s N (0, 1)
Under H0 : R = r, W jX s 2 (#r)
W
(Rb
where zk
where
r ) 2 R (X0 X )
R0
bk
2 [(X0 X)
1]
r)
(Rb
.
kk
zk and W are infeasible test statistics because they depend on the

unknown nuisance parameter 2 .
What if instead of the unknown 2 we use the OLS estimator s 2 =
e0 e
?
n K
Bad news: zk and W no longer exactly distributed as Normal and 2 .

Good news: we have substitute test statistics for zk and W where we do
know their nite-sample distributions.
Lecture Notes 1
Autumn 2016
81 / 87
Under H0 : k = k , zk jX s N (0, 1)
where zk
bk
2 [(X0 X)
1]
.
kk
e0 e
, we obtain a
n K
dierent, feasible test statistic with a known distribution:
If we replace 2 with the OLS estimator s 2 =
Under H0 : k = k , tk jX s t (n
k)
where tk
bk
s2
k
0
[(X X) 1 ]
kk
And now we are ready to go. No nuisance parameter problem.
Lecture Notes 1
Autumn 2016
82 / 87

Under H0 : R = r, W jX s 2 (#r)
W
(Rb
where
r ) 2 R (X0 X )
R0
(Rb
r)
e0 e
, we can construct a
n K
dierent, feasible test statistic with a known distribution:
If we replace 2 with the OLS estimator s 2 =
Under H0 : R = r, F jX s F (#r, n
(Rb
K)
where
h
i
1
r ) s 2 R (X0 X ) R0
0
(Rb
r)
#r
Note that we have to divide by the "numerator degrees of freedom" #r.

(Move s 2 into the denominator of F and the denominator is divided by
n K , hence we call it the "denominator degrees of freedom".)
And now we are ready to go. No nuisance parameter problem.
Lecture Notes 1
Autumn 2016
83 / 87
Testing using the Wald Principle vs. LR Principle

t and F (and z and W ) are Wald test statistics.
Wald Principle: estimate the unrestricted equation, i.e. do not impose
the constraints in the null H0 . Then calculate "cost" of imposing the
constraints.
LR (Likelihood Ratio) Principle: estimate the unrestricted equation and
the restricted equation, and construct a test statistic based on the values
of the two objective functions.
Unrestricted: b
Restricted: b
e)
arg minSSR (
e
e ) s.t. R
e=r
arg minSSR (
e
Values of the two objective functions at their minima: SSRU and SSRR .
NB: LM (Lagrange Multiplier) Principle: estimate the restricted equation.
Then calculate "reduction in cost" from relaxing the constraints in H0 .
Lecture Notes 1
Autumn 2016
84 / 87
Testing using the Wald Principle vs. LR Principle

It turns out that in this particular case the test statistic using the LR
principle is exactly the same as the Wald F test statistic if we use a
common estimate of the error variance.
F =
(SSRR SSRU ) /#r

SSRU /(n K )
Numerator: dierence in minimized objective functions. Denominator:

estimate of the error variance (s 2 from the unrestricted estimator).
Classic example: Chow test for a "structural break", i.e. do two
regressions t better than one? Restricted: SSRR from tting a single
OLS regression on the entire sample. Unrestricted: SSRU = SSR1 + SSR2
from tting two separate regressions to the two parts of the sample. Error
variance: SSRU /(n 2K ) since we have a total of 2K parameters in the
two separate regressions. Large F means reject null, conclude there are
two separate regimes i.e. there is a structural break. See Hayashi p. 175.
Lecture Notes 1
Autumn 2016
85 / 87
Finite-Sample Theory for OLS: Summary and Remarks

To develop the exact nite-sample distribution of the OLS estimator, we
needed to make the following assumptions:
Assumption
Assumption
Assumption
Assumption
Assumption
1.1:
1.2:
1.3:
1.4:
1.5:
Linearity y = X +
Strict exogeneity E (i jX) = 0
i = 1, 2, ...n
No multicollinearity X is full rank with probability 1.
Spherical error variance Var (jX) = 2 In
Normality of
All of these assumptions (with the possible exceptions of 1.3 and 1.1) are
unattractive. We dont want our estimates and inferences to depend
heavily on assumptions that we dont believe and that are likely to be
violated in reality.
Loosening these assumptions and still obtaining nite-sample results is
often di cult or impossible. It is much easier to relax these assumptions in
a large-sample setting and rely on asymptotic results.
Lecture Notes 1
Autumn 2016
86 / 87
Finite-Sample Theory for OLS: Summary and Remarks
The main exception to this is Assumption 1.4:

Assumption 1.4: Spherical error variance E (0 jX) = 2 In
(Reminder: this has two parts, conditional homoskedasticity and
independence.)
It is possible to relax this assumption and still obtain nite-sample results.
The results and methods are, moreover, generally useful, including in the
large-sample setting. This is the method of Generalized Least Squares
(GLS), to which we turn in a later lecture.
Lecture Notes 1
Autumn 2016
87 / 87

SGPE Econometrics Lecture 1 OLS

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

SGPE Econometrics Lecture 1 OLS

Caricato da

Copyright:

Formati disponibili

SGPE Econometrics 1

Lecture Notes 1: Finite Sample OLS

Mark Schaer (Heriot-Watt University)

Notation, denitions, tools

Assumptions of the classical regression model and OLS

Finite-sample properties of OLS

Main reading: Hayashi Chapter 1

A thorough summary of many matrix algebra facts and results is The

Finite Sample OLS

Finite-sample vs. Large-sample

Assumptions of the classical linear regression model

Deriving the OLS estimator

Finite-sample properties of the OLS estimator

Mean squared error

Mark Schaer (Heriot-Watt University)

Finite Sample OLS

I.I.D. (independently and identically distributed)

Spherical error variance

Conditional homoskedasticity; conditional heteroskedasticity

Independence ("no serial correlation")

Mean squared error

Mark Schaer (Heriot-Watt University)

Our expression for is Var (b

In practice, econometric theory often doesnt provide exact answers to

Mark Schaer (Heriot-Watt University)

because it has a smaller variance than some other unbiased or consistent

Mark Schaer (Heriot-Watt University)

a loss function that weights bias and variance equally.

Notation and denitions (and lots of it)

Total number of observations

Mark Schaer (Heriot-Watt University)

Notation and denitions

Linear model, data row form:

Mark Schaer (Heriot-Watt University)

Notation and denitions

"True" parameter vector

Mark Schaer (Heriot-Watt University)

Notation and denitions

Curiously, it is traditional to denote the OLS estimator, residuals and error

Mark Schaer (Heriot-Watt University)

The OLS estimator

The OLS estimator of in traditional data matrix form:

The OLS estimator of 2 :

Note the K in the denition of s 2 . Asymptotically this doesnt matter - K

Mark Schaer (Heriot-Watt University)

Sum of Squared Residuals (SSR)

Denition of OLS estimator b: it minimizes the sum of squared residuals.

Mark Schaer (Heriot-Watt University)

Deriving the OLS estimator

Mark Schaer (Heriot-Watt University)

Deriving the OLS estimator

Mark Schaer (Heriot-Watt University)

Deriving the OLS estimator as an LS estimator

(the K "normal equations")

e because it solves the FOCs. Since X0 X is full

X y and after cancelling terms we have b:

Mark Schaer (Heriot-Watt University)

Deriving the OLS estimator

(gradient, vector of rst derivatives)

(Hessian, matrix of second derivatives)

Mark Schaer (Heriot-Watt University)

More useful OLS expressions

Recall that is an n 1 column vector of ones, and y is the sample mean