Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL

INTRODUCTION TO ECONOMETRIC THEORY
CHUNG-MING KUAN
Institute of Economics
Academia Sinica
This version: September 15, 2000
c Chung-Ming Kuan. Address for correspondence: Institute of Economics, Academia Sinica, Taipei
115, Taiwan; e-mail: ckuan@ieas.econ.sinica.edu.tw.
Contents
1 Linear and Matrix Algebra 1
1.1 Basic Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Matrix Determinant and Trace . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Matrix Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Eigenvalue and Eigenvector . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Symmetric Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Statistical Concepts 17
2.1 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Special Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Criteria for Point Estimators . . . . . . . . . . . . . . . . . . . . 30
2.5.3 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.2 Construction of Tests . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Classical Least Squares Theory 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
i
ii CONTENTS
3.2 The Method of Ordinary Least Squares . . . . . . . . . . . . . . . . . . 40
3.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Geometric Interpretations . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Measures of Goodness of Fit . . . . . . . . . . . . . . . . . . . . 48
3.3 Statistical Properties of the OLS Estimators . . . . . . . . . . . . . . . . 51
3.3.1 Classical Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Without the Normality Condition . . . . . . . . . . . . . . . . . 52
3.3.3 With the Normality Condition . . . . . . . . . . . . . . . . . . . 56
3.4 Hypotheses Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Tests for Linear Hypotheses . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Power of the Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.3 An Alternative Approach . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Condence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6.1 Near Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6.2 Digress: Dummy Variables . . . . . . . . . . . . . . . . . . . . . 67
3.7 Limitations of the Classical Conditions . . . . . . . . . . . . . . . . . . . 69
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Generalized Least Squares Theory 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 The Method of Generalized Least Squares . . . . . . . . . . . . . . . . . 78
4.2.1 When y Does Not Have a Scalar Covariance Matrix . . . . . . . 78
4.2.2 The GLS Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.3 Properties of the GLS Estimator . . . . . . . . . . . . . . . . . . 81
4.2.4 FGLS Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Tests for Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 GLS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Serial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1 A Simple Model of Serial Correlation . . . . . . . . . . . . . . . . 87
4.4.2 An Alternative View . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.3 Tests for AR(1) Disturbances . . . . . . . . . . . . . . . . . . . . 91
4.4.4 FGLS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
c _ Chung-Ming Kuan, 2001
CONTENTS iii
4.5 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Seemingly Unrelated Regressions . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Models for Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7.1 Fixed Eects Model . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7.2 Random Eects Model . . . . . . . . . . . . . . . . . . . . . . . . 104
4.8 Limitations of the FGLS Method . . . . . . . . . . . . . . . . . . . . . . 106
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5 Probability Theory 109
5.1 Probability Space and Random Variables . . . . . . . . . . . . . . . . . 109
5.1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.1.3 Moments and Norms . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Conditional Distribution and Moments . . . . . . . . . . . . . . . . . . . 114
5.2.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 115
5.2.2 Conditional Moments . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . 119
5.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . 121
5.3.3 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . 123
5.4 Order Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.5 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Uniform Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 132
5.7 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6 Asymptotic Least Squares Theory 141
6.1 When Regressors are Stochastic . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 Asymptotic Properties of the OLS Estimators . . . . . . . . . . . . . . . 143
6.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3 Consistent Estimation of Covariance Matrix . . . . . . . . . . . . . . . . 156
6.3.1 When Serial Correlations Are Absent . . . . . . . . . . . . . . . 157
6.3.2 When Serial Correlations Are Present . . . . . . . . . . . . . . . 159
6.4 Large-Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
iv CONTENTS
6.4.1 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.2 Lagrange Multiplier Test . . . . . . . . . . . . . . . . . . . . . . 164
6.5 Application: Autoregressive Models . . . . . . . . . . . . . . . . . . . . . 168
6.5.1 Properties of the OLS estimators . . . . . . . . . . . . . . . . . . 168
6.5.2 Dierence Equation . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.3 Weak Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.6 Limitations of the Linear Specication . . . . . . . . . . . . . . . . . . . 173
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7 Nonlinear Least Squares Theory 177
7.1 Nonlinear Specications . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2 The Method of Nonlinear Least Squares . . . . . . . . . . . . . . . . . . 181
7.2.1 Nonlinear Least Squares Estimator . . . . . . . . . . . . . . . . . 181
7.2.2 Nonlinear Optimization Algorithms . . . . . . . . . . . . . . . . 183
7.3 Asymptotic Properties of the NLS Estimators . . . . . . . . . . . . . . . 188
7.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.3.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . 191
7.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Chapter 1
Linear and Matrix Algebra
This chapter summarizes some important results of linear and matrix algebra that are
instrumental in analyzing the statistical methods in subsequent chapters. The coverage
of these mathematical topics is rather brief but self-contained. Readers may also con-
sult other linear and matrix algebra textbooks for more detailed discussions; see e.g.,
Anton (1981), Basilevsky (1983), Graybill (1969), and Noble and Daniel (1977).
In this chapter we rst introduce basic matrix notations (Section 1.1) and matrix
operations (Section 1.2). We then study the determinant and trace functions (Sec-
tion 1.3), matrix inverse (Section 1.4), and matrix rank (Section 1.5). After introducing
eigenvalue and diagonalization (Section 1.6), we discuss the properties of symmetric
matrix (Section 1.7) and orthogonal projection in a vector space (Section 1.8).
1.1 Basic Notations
A matrix is an array of numbers. In what follows, a matrix is denoted by an upper-case
alphabet in boldface (e.g., A), and its (i, j)
th
element (the element at the i
th
row and
j
th
column) is denoted by the corresponding lower-case alphabet with subscripts ij
(e.g., a
ij
). Specically, a m n matrix A contains m rows and n columns and can be
expressed as
A =
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
.
An n 1 (1 n) matrix is an n-dimensional column (row) vector. Every vector will be
denoted by a lower-case alphabet in boldface (e.g., z), and its i
th
element is denoted
1
2 CHAPTER 1. LINEAR AND MATRIX ALGEBRA
by the corresponding lower-case alphabet with subscript i (e.g., z
i
). An 1 1 matrix is
just a scalar. For a matrix A, its i
th
column is denoted as a
i
.
A matrix is square if its number of rows equals the number of columns. A matrix is
said to be diagonal if its o-diagonal elements (i.e., a
ij
, i ,= j) are all zeros and at least
one of its diagonal elements is non-zero, i.e., a
ii
,= 0 for some i = 1, . . . , n. A diagonal
matrix whose diagonal elements are all ones is an identity matrix, denoted as I; we also
write the nn identity matrix as I
n
. A matrix A is said to be lower (upper) triangular
if a
ij
= 0 for i < (>) j. We let 0 denote the matrix whose elements are all zeros.
For a vector-valued function f : R
m
R
n
,
f() is the m n matrix of the

rst-order derivatives of f with respect to the elements of :
f() =
_
_
f
1
()
1
f
2
()
1
. . .
fn()
1
f
1
()
2
f
2
()
2
. . .
fn()
2
.
.
.
.
.
.
.
.
.
.
.
.
f
1
()
m
f
2
()
m
. . .
fn()
m
_
_
.
When n = 1,
f() is the (column) gradient vector of f(). The m m Hessian

matrix of the second-order derivatives of the real-valued function f() is
f() =
f()) =
_
2
f()
2
f()
2
. . .

2
f()
1
m
2
f()
2
f()
2
. . .

2
f()
2
m
.
.
.
.
.
.
.
.
.
.
.
.
2
f()
m
1
2
f()
m
2
. . .

2
f()
mm
_
_
.
1.2 Matrix Operations
Two matrices are said to be of the same size if they have the same number of rows and
same number of columns. Matrix equality is dened for two matrices of the same size.
Given two mn matrices A and B, A = B if a
ij
= b
ij
for every i, j. The transpose of
an m n matrix A, denoted as A
, is the n m matrix whose (i, j)

th
element is the
(j, i)
th
element of A. The transpose of a column vector is a row vector; the transpose
of a scalar is just the scalar itself. A matrix A is said to be symmetric if A = A
, i.e.,
a
ij
= a
ji
for all i, j. Clearly, a diagonal matrix is symmetric, but a triangular matrix is
not.
Matrix addition is also dened for two matrices of the same size. Given two mn
matrices A and B, their sum, C = A+B, is the mn matrix with the (i, j)
th
element
1.2. MATRIX OPERATIONS 3
c
ij
= a
ij
+b
ij
. Note that matrix addition, if dened, is commutative:
A+B = B +A,
and associative:
A+ (B +C) = (A+B) +C.
Also, A+0 = A.
The scalar multiplication of the scalar c and matrix Ais the matrix cAwhose (i, j)
th
element is ca
ij
. Clearly, cA = Ac, and A = 1 A. Thus, A+ (A) = AA = 0.
Given two matrices A and B, the matrix multiplication AB is dened only when the
number of columns of A is the same as the number of rows of B. Specically, when A
is m n and B is n p, their product, C = AB, is the m p matrix whose (i, j)
th
element is
c
ij
=
n
k=1
a
ik
b
kj
.
Matrix multiplication is not commutative, i.e., AB ,= BA; in fact, when AB is dened,
BA need not be dened. On the other hand, matrix multiplication is associative:
A(BC) = (AB)C,
and distributive with respect to matrix addition:
A(B +C) = AB +AC.
It is easy to verify that (AB)
= B
. For an mn matrix A, I
m
A = AI
n
= A.
The inner product of two d-dimensional vectors y and z is the scalar
y
z =
d
i=1
y
i
z
i
.
If y is m-dimensional and z is n-dimensional, their outer product is the matrix yz
whose (i, j)
th
element is y
i
z
j
. In particular,
z
z =
d
i=1
z
2
i
,
which is non-negative and induces the standard Euclidean norm of z as |z| = (z
z)
1/2
.
The vector with Euclidean norm zero must be a zero vector; the vector with Euclidean
norm one is referred to as a unit vector. For example,
(1 0 0),
_
0
1
2
3
2
_
,
_
1
2
1
3
1
6
_
,
are all unit vectors. A vector whose i
th
element is one and the remaining elements are
all zero is called the i
th
Cartesian unit vector.
Let denote the angle between y and z. By the law of cosine,
|y z|
2
= |y|
2
+|z|
2
2|y| |z| cos ,
where the left-hand side is |y|
2
+|z|
2
2y
z. Thus, the inner product of y and z can

be expressed as
y
z = |y||z| cos .
When = /2, cos = 0 so that y
z = 0. In this case, we say that y and z are orthogonal

to each other. A square matrix A is said to be orthogonal if A
A = AA
= I. Hence,
each column (row) vector of an orthogonal matrix is a unit vector and orthogonal to all
remaining column (row) vectors. When y = cz for some c ,= 0, = 0 or , and y and
z are said to be linearly dependent.
As 1 cos 1, we immediately obtain the so-called Cauchy-Schwarz inequality.
Lemma 1.1 (Cauchy-Schwarz) For two d-dimensional vectors y and z,
[y
z[ |y||z|,
where the equality holds when y and z are linearly dependent.
It follows from the Cauchy-Schwarz inequality that
|y +z|
2
= |y|
2
+|z|
2
+ 2y
z
|y|
2
+|z|
2
+ 2|y||z|
= (|y| +|z|)
2
.
This leads to the following triangle inequality.
Lemma 1.2 For two d-dimensional vectors y and z,
|y +z| |y| +|z|,
where the equality holds when y = cz for some c > 0.
When y and z are orthogonal,
|y +z|
2
= |y|
2
+|z|
2
,
1.3. MATRIX DETERMINANT AND TRACE 5
which is the celebrated Pythagoras theorem.
A special type of matrix multiplication, known as the Kronecker product, is de-
ned for matrices without size restrictions. Specically, the Kronecker product of two
matrices A (mn) and B (p q) is the mp nq matrix:
AB =
_
_
a
11
B a
12
B a
1n
B
a
21
B a
22
B a
2n
B
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
B a
m2
B a
mn
B
_
_
.
The Kronecker product is not commutative:
AB ,= B A,
but it is associative:
(AB) C = A(B C).
It also obeys the distributive law:
A(B +C) = AB +AC.
It can be veried that
(AB)
= A
.
Consider now dierentiation involving vectors and matrices. Let a and be tow
d-dimensional vectors. We have
(a
) = a.
For a symmetric matrix A,
A) = 2A,
2
A) = 2A.
1.3 Matrix Determinant and Trace
Given a square matrix A, let A
ij
denote the sub-matrix obtained from A by deleting
its i
th
row and j
th
column. The determinant of A is
det(A) =
m
i=1
(1)
i+j
a
ij
det(A
ij
),
for any j = 1, . . . , n, where (1)
i+j
det(A
ij
) is called the cofactor of a
ij
. This denition
is based on the cofactor expansion along the j
th
column. Equivalently, the determinant
can also be dened using the cofactor expansion along the i
th
row:
det(A) =
n
j=1
(1)
i+j
a
ij
det(A
ij
),
for any i = 1, . . . , m. The determinant of a scalar is the scalar itself; the determinant of
a 2 2 matrix A is simply a
11
a
22
a
12
a
21
. A square matrix with non-zero determinant
is said to be nonsingular; otherwise, it is singular.
Clearly, det(A) = det(A
). From the denition of determinant, it is straightforward

to see that for a scalar c and an n n matrix A,
det(cA) = c
n
det(A),
and that for a square matrix with a column (or row) of zeros, its determinant must be
zero. Also, the determinant of a diagonal or triangular matrix is simply the product of
all the diagonal elements. It can also be shown that the determinant of the product of
two square matrices of the same size is the product of their determinants:
det(AB) = det(A) det(B) = det(BA).
Also, for an mm matrix A and a p p matrix B,
det(AB) = det(A)
m
det(B)
p
.
If A is an orthogonal matrix, we know AA
= I so that
det(I) = det(AA
) = [det(A)]
2
.
As the determinant of the identity matrix is one, the determinant of an orthogonal
matrix must be either 1 or 1.
The trace of a square matrix is the sum of its diagonal elements; i.e., trace(A) =
i
a
ii
. For example, trace(I
n
) = n. Clearly, trace(A) = trace(A
). The trace function

has the linear property:
trace(cA+dB) = c trace(A) +d trace(B),
where c and d are scalars. It can also be shown that
trace(AB) = trace(BA),
provided that both AB and BA are dened. For two square matrices A andB,
trace(AB) = trace(A) trace(B).
1.4. MATRIX INVERSE 7
1.4 Matrix Inverse
A nonsingular matrix A possesses a unique inverse A
1
in the sense that AA
1
=
A
1
A = I. A singular matrix cannot be inverted, however. Thus, saying that a matrix
is invertible is equivalent to saying that it is nonsingular.
Given an invertible matrix A, its inverse can be calculated as
A
1
=
1
det(A)
F
,
where F is the matrix of cofactors, i.e., the (i, j)
th
element of F is the cofactor
(1)
i+j
det(A
ij
). The matrix F
is known as the adjoint of A. For example, when

A is 2 2,
A
1
=
1
a
11
a
22
a
12
a
21
_
a
22
a
12
a
21
a
11
_
.
Matrix inversion and transposition can be interchanged, i.e., (A
)
1
= (A
1
)
. For two
nonsingular matrices A and B of the same size, we have ABB
1
A
1
= I, so that
(AB)
1
= B
1
A
1
.
Some special matrices can be easily inverted. For example, for a diagonal matrix A, A
1
is also diagonal with the diagonal elements a
1
ii
; for an orthogonal matrix A, A
1
= A
.
A formula for computing the inverse of a partitioned matrix is
_
A B
C D
_
1
=
_
A
1
+A
1
BF
1
CA
1
A
1
BF
1
F
1
CA
1
F
1
_
,
where F = DCA
1
B, or equivalently,
_
A B
C D
_
1
=
_
G
1
G
1
BD
1
D
1
CG
1
D
1
+D
1
CG
1
BD
1
_
,
where G = A BD
1
C, provided that the matrix inverses in the expressions above
are well dened. In particular, if this matrix is block diagonal so that the o-diagonal
blocks are zero matrices, we have
_
A 0
0 D
_
1
=
_
A
1
0
0 D
1
_
,
provided that A and D are invertible.
1.5 Matrix Rank
The vectors z
1
, . . . , z
n
are said to be linearly independent if the only solution to
c
1
z
1
+c
2
z
2
+ +c
n
z
n
= 0
is the trivial solution: c
1
= = c
n
= 0; otherwise, they are linearly dependent. When
two (three) vectors are linearly dependent, they are on the same line (plane).
The column (row) rank of a matrix A is the maximum number of linearly indepen-
dent column (row) vectors of A. When the column (row) rank equals the number of
column (row) vectors, this matrix is said to be of full column (row) rank. The space
spanned by the vectors z
1
, . . . , z
n
is the collection of all linear combinations of these
vectors, denoted as span(z
1
, . . . , z
n
). The space spanned by the column vectors of A
is therefore span(A), which is also known as the column space of A. A vector z is
in span(A) if it can be expressed as Ac for some vector c ,= 0. Similarly, the space
spanned by the row vectors of A is span(A
) and known as the row space of A. The

column (row) rank of A is the dimension of the column (row) space of A.
Given an n k matrix A with k n, suppose that A has row rank r n and
column rank c k. Without loss of generality, assume that the rst r row vectors are
linear independent. Hence, each row vector a
i
can be expressed as
a
i
= q
i1
a
1
+q
i2
a
2
+ +q
ir
a
r
, i = 1, . . . , n,
with the j
th
element
a
ij
= q
i1
a
1j
+q
i2
a
2j
+ +q
ir
a
rj
, i = 1, . . . , n, j = 1, . . . , k.
Fixing j, we immediately see that every column vector of A can be written as a linear
combination of the vectors q
1
, . . . , q
r
. As such, the column rank of A must be less than
or equal to r. Similarly, the column rank of A
, which is also the row rank of A, must

be less than or equal to c. This proves the following result.
Lemma 1.3 The column rank and row rank of a matrix are equal.
By Lemma 1.3, we can then dene the rank of A as the maximum number of linearly
independent column (or row) vectors of A, denoted as rank(A). Clearly, rank(A) =
rank(A
). An n n matrix A is said to be of full rank if rank(A) = n.

For an nk matrix A, its left inverse is a k n matrix A
1
L
such that A
1
L
A = I
k
.
Similarly, a right inverse of A is a k n matrix A
1
R
such that AA
1
R
= I
n
. The left
1.6. EIGENVALUE AND EIGENVECTOR 9
and right inverses are not unique, however. It can be shown that a matrix possesses a
left (right) inverse if and only if it has full column (row) rank. Thus, for a square matrix
with full rank, it has both inverses, which are just the unique matrix inverse. Thus, a
nonsingular (invertible) matrix must be of full rank and vice versa.
It can be shown that for two n k matrices A and B,
rank(A+B) rank(A) + rank(B).
If A is n k and B is k m,
rank(A) + rank(B) k rank(AB) min[rank(A), rank(B)].
For the Kronecker product, we have
rank(AB) = rank(A) rank(B).
If A is a nonsingular matrix, we have from the inequality above that
rank(AB) rank(B) = rank(A
1
AB) rank(AB);
i.e., rank(AB) = rank(B). This also implies that for a nonsingular matrix C,
rank(BC) = rank(C
) = rank(B
) = rank(B).
Thus, the rank of a matrix is preserved under nonsingular transformations.
Lemma 1.4 Let A (nn) and C (k k) be nonsingular matrices. Then for any nk
matrix B,
rank(B) = rank(AB) = rank(BC).
1.6 Eigenvalue and Eigenvector
Given a square matrix A, if Ac = c for some scalar and non-zero vector c, then c
is an eigenvector of A corresponding to the eigenvalue . The system (A I)c = 0
has a non-trivial solution if and only if
det(AI) = 0.
This is known as the characteristic equation of A, from which we can solve for the
eigenvalues of A. Hence, eigenvalues (eigenvectors) are also referred to as characteristic
roots (characteristic vectors). Note that the eigenvalues and eigenvectors of a real-valued
matrix need not be real-valued.
When Ais nn, the characteristic equation is an n
th
-order polynomial in and has
at most n distinct solutions. These solutions (eigenvalues) are usually complex-valued.
If some eigenvalues take the same value, there may exist several eigenvectors corre-
sponding to the same eigenvalue. Given an eigenvalue , let c
1
, . . . , c
k
be associated
eigenvectors. Then,
A(a
1
c
1
+a
2
c
2
+. . . +a
k
c
k
) = (a
1
c
1
+a
2
c
2
+. . . +a
k
c
k
),
so that any linear combination of these eigenvectors is again an eigenvector correspond-
ing to . That is, these eigenvectors are closed under scalar multiplication and vector
addition and form the eigenspace corresponding to . As such, for a common eigenvalue,
we are mainly concerned with those eigenvectors that are linearly independent.
If A (n n) possesses n distinct eigenvalues, each eigenvalue must correspond to
one eigenvector, unique up to scalar multiplications. It is therefore typical to normalize
eigenvectors such that they have Euclidean norm one. It can also be shown that if the
eigenvalues of a matrix are all distinct, their associated eigenvectors must be linearly
independent. Let C denote the matrix of these eigenvectors and denote the diagonal
matrix with diagonal elements being the eigenvalues of A. We can write AC = C.
As C is nonsingular, we have
C
1
AC = , or A = CC
1
.
In this case, A is said to be similar to .
When A has n distinct eigenvalues
1
, . . . ,
n
, it is readily seen that
det(A) = det(CC
1
) = det() det(C) det(C
1
) = det(),
and
trace(A) = trace(CC
1
) = trace(C
1
C) = trace().
This yields the following result.
1.7. SYMMETRIC MATRIX 11
Lemma 1.5 Let A be an n n matrix with distinct eigenvalues
1
, . . . ,
n
. Then
det(A) = det() =
n
i=1
i
,
trace(A) = trace() =
n
i=1
i
.
When A = CC
1
, we have A
1
= C
1
C
1
. This shows that the eigenvectors
of A
1
are the same as those of A, and the corresponding eigenvalues are simply the
reciprocals of the eigenvalues of A. Similarly,
A
2
= (CC
1
)(CC
1
) = C
2
C
1
,
so that the eigenvectors of A
2
are the same as those of A, and the corresponding
eigenvalues are the squares of the eigenvalues of A. This result generalizes immediately
to A
k
.
1.7 Symmetric Matrix
More can be said about symmetric matrices. Let c
1
and c
2
be two eigenvectors of A
corresponding to the distinct eigenvalues
1
and
2
, respectively. If A is symmetric,
then
c
2
Ac
1
=
1
c
2
c
1
=
2
c
2
c
1
.
As
1
,=
2
, it must be true that c
2
c
1
= 0, so that they are orthogonal. Given linearly
independent eigenvectors that correspond to a common eigenvalue, they can also be
orthogonalized. Thus, a symmetric matrix is orthogonally diagonalizable, in the sense
that
C
AC = , or A = CC
,
where is again the diagonal matrix of the eigenvalues of A, and C is the orthogonal
matrix of associated eigenvectors.
As nonsingular transformations preserve rank (Lemma 1.4), so do orthogonal trans-
formations. We thus have the result below.
Lemma 1.6 For a symmetric matrix A, rank(A) = rank(), the number of non-zero
eigenvalues of A.
Moreover, when A is diagonalizable, the assertions of Lemma 1.5 remain valid,
whether or not the eigenvalues of A are distinct.
Lemma 1.7 Let A be an n n symmetric matrix. Then,
det(A) = det() =
n
i=1
i
,
trace(A) = trace() =
n
i=1
i
.
By Lemma 1.7, a symmetric matrix is nonsingular if its eigenvalues are all non-zero.
A symmetric matrix A is said to be positive denite if b
Ab > 0 for all vectors b ,= 0;

A is said to be positive semi-denite if b
Ab 0 for all b ,= 0. A positive denite matrix

thus must be nonsingular, but a positive semi-denite matrix may be singular. Suppose
that A is a symmetric matrix orthogonally diagonalized as C
AC = . If A is also
positive semi-denite, then for any b ,= 0,
b
b = b
(C
AC)b =

b
b 0,
where

b = Cb. This shows that is also positive semi-denite, and all the diagonal
elements of must be non-negative. It can be seen that the converse also holds.
Lemma 1.8 A symmetric matrix is positive denite (positive semi-denite) if, and only
if, its eigenvalues are all positive (non-negative).
For a symmetric and positive denite matrix A, A
1/2
is such that A
1/2
A
1/2
=
A
1
. In particular, by orthogonal diagonalization,
A
1
= C
1
C
= (C
1/2
C
)(C
1/2
C
),
so that we may choose A
1/2
= C
1/2
C
. The inverse of A
1/2
is A
1/2
= C
1/2
C
.
It follows that A
1/2
A
1/2
= A, and A
1/2
AA
1/2
= I. Note that
1/2
C
is also a
legitimate choice of A
1/2
, yet it is not symmetric.
Finally, we know that for two positive real numbers a and b, a b implies b
1
a
1
.
This result can be generalized to compare two positive denite matrices, as stated below
without proof.
Lemma 1.9 Given two symmetric and positive denite matrices A and B, if A B
is positive semi-denite, then so is B
1
A
1
.
1.8. ORTHOGONAL PROJECTION 13
1.8 Orthogonal Projection
A matrix A is said to be idempotent if A
2
= A. Given a vector y in the Euclidean
space V , a projection of y onto a subspace S of V is a linear transformation of y to
S. The resulting projected vector can be written as Py, where P is the associated
transformation matrix. Given the projection Py in S, further projection to S should
have no eect on Py, i.e.,
P(Py) = P
2
y = Py.
Thus, a matrix P is said to be a projection matrix if it is idempotent.
A projection of y onto S is orthogonal if the projection Py is orthogonal to the
dierence between y and Py. That is,
(y Py)
Py = y
(I P)
Py = 0.
As y is arbitrary, the equality above holds if, and only if, (I P)
P = 0. Consequently,
P = P
P and P
= P
P. This shows that P must be symmetric. Thus, a matrix is an

orthogonal projection matrix if, and only if, it is symmetric and idempotent. It can be
easily veried that the orthogonal projection Py must be unique.
When P is an orthogonal projection matrix, it is easily seen that IP is idempotent
because
(I P)
2
= I 2P +P
2
= I P.
As I P is also symmetric, it is an orthogonal projection matrix. Since (I P)P = 0,
the projections Py and (I P)y must be orthogonal. This shows that any vector y
can be uniquely decomposed into two orthogonal components:
y = Py + (I P)y.
Dene the orthogonal complement of a subspace S V as
S
= v V : v
s = 0, for all s S.
If P is the orthogonal projection matrix that projects vectors onto S V , we have
Ps = s for any s S. It follows that (I P)y is orthogonal to s and that (I P)y
is the orthogonal projection of y onto S
.
Intuitively, the orthogonal projection Py can be interpreted as the best approxi-
mation of y in S, in the sense that Py is the closest to y in terms of the Euclidean
norm. To see this, we observe that for any s S,
|y s|
2
= |y Py +Py s|
2
= |y Py|
2
+|Py s|
2
+ 2(y Py)
(Py s)
= |y Py|
2
+|Py s|
2
.
This establishes the following result.
Lemma 1.10 Let y be a vector in V and Py its orthogonal projection onto S V .
Then,
|y Py| |y s|,
for all s S.
Let A be a symmetric and idempotent matrix and C be the orthogonal matrix that
diagonalizes A to . Then,
= C
AC = C
A(CC
)AC =
2
.
This is possible only when the eigenvalues of A are zero and one. The result below now
follows from Lemmas 1.8.
Lemma 1.11 A symmetric and idempotent matrix is positive semi-denite with the
eigenvalues 0 and 1.
Moreover, trace() is the number of non-zero eigenvalues of A and hence rank().
When A is symmetric, rank(A) = rank() by Lemma 1.6, and trace(A) = trace() by
Lemma 1.7. Combining these results we have:
Lemma 1.12 For a symmetric and idempotent matrix A, rank(A) = trace(A), the
number of non-zero eigenvalues of A.
Given an n k matrix A, it is easy to see that A
A and AA
are symmetric
and positive semi-denite. Let x denote a vector orthogonal to the rows of A
A; i.e.,
A
Ax = 0. Hence x
Ax = 0, so that Ax must be a zero vector. That is, x is

also orthogonal to the rows of A. Conversely, Ax = 0 implies A
Ax = 0. This shows
1.8. ORTHOGONAL PROJECTION 15
that the orthogonal complement of the row space of A is the same as the orthogonal
complement of the row space of A
A. Hence, these two row spaces are also the same.

Similarly, the column space of A is the same as the column space of AA
. It follows
from Lemma 1.3 that
rank(A) = rank(A
A) = rank(AA
).
In particular, if A (n k) is of full column rank k < n, then A
A is k k and hence
of full rank k (nonsingular), but AA
is n n and hence singular. The result below is

now immediate.
Lemma 1.13 If A is an n k matrix with full column rank k < n, then, A
A is
symmetric and positive denite.
Given an n k matrix A with full column rank k < n, P = A(A
A)
1
A
is clearly
symmetric and idempotent and hence an orthogonal projection matrix. As
trace(P) = trace(A
A(A
A)
1
) = trace(I
k
) = k,
we have from Lemmas 1.11 and 1.12 that P has exactly k eigenvalues equal to 1 and
that rank(P) = k. Similarly, rank(I P) = n k. Moreover, any vector y span(A)
can be written as Ab for some non-zero vector b, and
Py = A(A
A)
1
A
(Ab) = Ab = y.
This suggests that P must project vectors onto span(A). On the other hand, when
y span(A)
, y is orthogonal to the column vectors of A so that A
y = 0. It follows
that Py = 0 and (I P)y = y. Thus, I P must project vectors onto span(A)
.
These results are summarized below.
Lemma 1.14 Let A be an n k matrix with full column rank k. Then, A(A
A)
1
A
orthogonally projects vectors onto span(A) and has rank k; I

n
A(A
A)
1
A
orthog-
onally projects vectors onto span(A)
and has rank n k.

References
Anton, Howard (1981). Elementary Linear Algebra, third edition, New York: Wiley.
Basilevsky, Alexander (1983). Applied Matrix Algebra in the Statistical Sciences, New
York: North-Holland.
Graybill, Franklin A. (1969). Introduction to Matrices with Applications in Statistics,
second edition, Belmont, CA: Wadsworth.
Noble, Ben and James W. Daniel. Applied Linear Algebra, second edition, Englewood
Clis, NJ: Prentice-Hall.
Chapter 2
Statistical Concepts
In this chapter we summarize some basic probability and statistics results to be used in
subsequent chapters. We focus on nite-sample results of multivariate random vectors
and statistics; asymptotic properties require more profound mathematical tools and will
not be discussed until Chapter 5. The topics covered in this chapter can be found in
most of statistics textbooks; in particular, Amemiya (1994) is a useful reference.
2.1 Distribution Functions
Given a random experiment, let denote the collection of all possible outcomes of this
experiment and IP denote the probability measure assigned to a certain collection of
events (subsets of ). If A is an event, IP(A) is such that 0 IP(A) 1 and measures
the likelihood of A. The larger is IP(A), the more likely is the event A to occur. A
d-dimensional random vector (R
d
-valued random variable) is a function of the outcomes
and takes values in R
d
. Formal denitions of probability space and random
variables are given in Section 5.1.
The (joint) distribution function of the R
d
-valued random variable z is the non-
decreasing, right-continuous function F
z
such that for = (
1
. . .
d
)
R
d
,
F
z
() = IP : z
1
()
1
, . . . , z
d
()
d
,
with
lim
1
, ... ,
d
F
z
() = 0, lim
1
, ... ,
d
F
z
() = 1.
Note that the distribution function of z is a standard point function dened on R
d
and
provides a convenient way to characterize the randomness of z. The (joint) density
17
18 CHAPTER 2. STATISTICAL CONCEPTS
function of F
z
, if exists, is the non-negative function f
z
such that
F
z
() =
_

d

_

1
f
z
(s
1
, . . . , s
d
) ds
1
ds
d
,
where the right-hand side is a Riemann integral. Clearly, the density function f
z
must
be integrated to one on R
d
.
The marginal distribution function of the i
th
component of z is
F
z
i
(
i
) = IP : z
i
()
i
= F
z
(, . . . , ,
i
, , . . . , ).
Thus, the marginal distribution function of z
i
is the joint distribution function without
restrictions on the other elements z
j
, j ,= i. The marginal density function of z
i
is the
non-negative function f
z
i
such that
F
z
i
(
i
) =
_

i
f
z
i
(s) ds.
It is readily seen that the marginal density function f
z
i
can also be obtained from the
associated joint density function by integrating out the other elements:
f
z
i
(s
i
) =
_
R

_
R
f
z
(s
1
, . . . , s
d
) ds
1
ds
i1
ds
i+1
ds
d
.
If there are two random vectors z
1
and z
2
, they are said to be independent if,
and only if, their joint distribution function is the product of all marginal distribution
functions:
F
z
1
,z
2
(
1
,
2
) = F
z
1
(
1
) F
z
2
(
2
);
otherwise, they are dependent. If random vectors possess density functions, they are
independent if, and only if, their joint density function is also the product of marginal
density functions. It is intuitively clear that functions of independent random variables
remain independent, as stated in the result below.
Lemma 2.1 If z
1
and z
2
are independent random vectors, then their transformations,
h
1
(z
1
) and h
2
(z
2
), are also independent random variables.
2.2 Moments
Given the d-dimensional random vector z with the distribution function F
z
, the expec-
tation of the i
th
element z
i
is dened as
IE(z
i
) =
_

_
R
d
i
dF
z
(
1
, . . . ,
d
),
2.2. MOMENTS 19
where the right-hand side is a Stieltjes integral; for more details about dierent integrals
we refer to Rudin (1976). As this integral equals
_
R
i
dF
z
(, . . . , ,
i
, , . . . , ) =
_
R
i
dF
z
i
(
i
),
the expectation of z
i
can be taken with respect to either the joint distribution function
F
z
or the marginal distribution function F
z
i
.
We say that the random variable z
i
has a nite expected value (or the expectation
IE(z
i
) exists) if IE[z
i
[ < . A random variable need not have a nite expected value;
if it does, this random variable is said to be integrable. More generally, the expectation
of a random vector is dened elementwise. Thus, for a random vector z, IE(z) exists if
all IE(z
i
), i = 1, . . . , d, exist, and z is integrable if all z
i
, i = 1, . . . , d, are integrable.
It is easily seen that the expectation operator does not have any eect on a constant;
that is, IE(b) = b for any constant b. For integrable random variables z
i
and z
j
, the
expectation operator is monotonic in the sense that
IE(z
i
) IE(z
j
),
for any z
i
z
j
with probability one. Moreover, the expectation operator possesses the
linearity property:
IE(az
i
+bz
j
) = a IE(z
i
) +b IE(z
j
),
where a and b are two real numbers. This property immediately generalizes to integrable
random vectors.
Lemma 2.2 Let A (n d) and B (n c) be two non-stochastic matrices. Then for
any integrable random vectors z (d 1) and y (c 1),
IE(Az +By) = AIE(z) +BIE(y).
If b is an n-dimensional nonstochastic vector, then IE(Az +b) = AIE(z) +b.
More generally, let y = g(z) be a well-dened, vector-valued function of z. The
expectation of y is
IE(y) = IE[g(z)] =
_
R
d
g() dF
z
().
When g(z) = z
k
i
, IE[g(z)] = IE(z
k
i
) is known as the k
th
moment of z
i
, where k need not
be an integer. In particular, IE(z
i
) is the rst moment of z
i
. When a random variable
has nite k
th
moment, its moments of order less than k are also nite. Thus, if the k
th
moment does not exist, then the moments of order greater than k also fail to exist. See
Section 2.3 for some examples of random variables that possess only low order moments.
A random vector is said to have nite k
th
moment if its elements all have nite k
th
moment. A random variable with nite second moment is said to be square integrable;
a random vector is square integrable if its elements are all square integrable.
The k
th
central moment of z
i
is IE[z
i
IE(z
i
)]
k
. In particular, the second central
moment of the square integrable random variable z
i
is
IE[z
i
IE(z
i
)]
2
= IE(z
2
i
) [IE(z
i
)]
2
,
which is a measure of dispersion of the values of z
i
. The second central moment is also
known as variance, denoted as var(). The square root of variance is standard deviation.
It can be veried that, given the square integrable random variable z
i
and real numbers
a and b,
var(az
i
+b) = var(az
i
) = a
2
var(z
i
).
This shows that variance is location invariant but not scale invariant.
When g(z) = z
i
z
j
, IE[g(z)] = IE(z
i
z
j
) is the cross moment of z
i
and z
j
. The cross
central moment of z
i
and z
j
is
IE[(z
i
IE(z
i
))(z
j
IE(z
j
))] = IE(z
i
z
j
) IE(z
i
) IE(z
j
),
which is a measure of the co-variation between these two random variables. The cross
central moment of two random variables is known as their covariance, denoted as
cov(, ). Clearly, cov(z
i
, z
j
) = cov(z
j
, z
i
) and cov(z
i
, z
i
) = var(z
i
). It can be seen
that for real numbers a, b, c, d,
cov(az
i
+b, cz
j
+d) = cov(az
i
, cz
j
) = ac cov(z
i
, z
j
).
Thus, covariance is also location invariant but depends on the scale (measurement units)
of random variables.
Observe that for any real numbers a and b,
var(az
i
+bz
j
) = a
2
var(z
i
) +b
2
var(z
j
) + 2ab cov(z
i
, z
j
),
so that
var(z
i
az
j
) = var(z
i
) +a
2
var(z
j
) 2a cov(z
i
, z
j
),
2.2. MOMENTS 21
which must be non-negative. Setting a = cov(z
i
, z
j
)/ var(z
j
), we have
var(z
i
) cov(z
i
, z
j
)
2
/ var(z
j
) 0.
In particular, when z
i
= az
j
+ b for some real numbers a and b, we have var(z
i
) =
a
2
var(z
j
) and cov(z
i
, z
j
) = a var(z
j
), so that
var(z
i
) cov(z
i
, z
j
)
2
/ var(z
j
) = 0.
This yields the Cauchy-Schwarz inequality for square integrable random variables.
Lemma 2.3 (Cauchy-Schwarz) Let z
i
, z
j
be two square integrable random variables.
Then,
cov(z
i
, z
j
)
2
var(z
i
) var(z
j
),
where the equality holds when z
i
= az
j
+b for some real numbers a and b.
cf. the Cauchy-Schwarz inequality (Lemma 1.1) in Section 1.2. This also suggests that
when two random variables are square integrable, their covariance must be nite.
The correlation coecient of z
i
and z
j
is dened as
corr(z
i
, z
j
) =
cov(z
i
, z
j
)
_
var(z
i
) var(z
j
)
.
By Lemma 2.3 we have
1 corr(z
i
, z
j
) 1.
If corr(z
i
, z
j
) = 0, z
i
and z
j
are said to be uncorrelated. If corr(z
i
, z
j
) > 0, z
i
and z
j
are
said to be positively correlated; if corr(z
i
, z
j
) < 0, z
i
and z
j
are negatively correlated.
When z
i
= az
j
+ b, corr(z
i
, z
j
) = 1 if a > 0 and 1 if a < 0. In both cases, z
i
and z
j
are perfectly correlated. For two random variables z
i
and z
j
and real numbers a, b, c, d,
corr(az
i
+b, cz
j
+d) = corr(az
i
, cz
j
) =
ac
[a[ [c[
corr(z
i
, z
j
).
Thus, the correlation coecient is not only location invariant but also scale invariant,
apart from the sign change.
For a d-dimensional, square integrable random vector z, its variance-covariance ma-
trix is
var(z) = IE[(z IE(z))(z IE(z))
]
=
_
_
var(z
1
) cov(z
1
, z
2
) cov(z
1
, z
d
)
cov(z
2
, z
1
) var(z
2
) cov(z
2
, z
d
)
.
.
.
.
.
.
.
.
.
.
.
.
cov(z
d
, z
1
) cov(z
d
, z
2
) var(z
d
)
_
_
.
As cov(z
i
, z
j
) = cov(z
j
, z
i
), var(z) must be symmetric. Moreover, var(z) is positive
semi-denite because it is the expectation of a matrix that is positive semi-denite with
probability one.
For two random vectors y (c 1) and z (d 1), the d c covariance matrix of z
and y is
cov(z, y) = IE[(z IEz)(y IEy)
] = IE(zy
) IE(z) IE(y
).
Two random vectors are uncorrelated if their covariance matrix is a zero matrix. If
y and z are independent, their joint distribution function is the product of individual
distribution functions. It follows that the cross moment of y and z is the product of
their individual rst moment: that
IE(zy
) = IE(z) IE(y
).
This shows that independence implies cov(z, y) = 0. Uncorrelated random vectors are
not necessarily independent, however.
Based on the properties of variance and covariance for random variables, we have
the following result for random vectors.
Lemma 2.4 Let A (nd), B (nc), and C (mc) be non-stochastic matrices and b
an n-dimensional non-stochastic vector. Then for any square integrable random vectors
z (d 1) and y (c 1),
var(Az +By) = Avar(z)A
+Bvar(y)B
+ 2Acov(z, y)B
,
var(Az +b) = var(Az) = A var(z) A
.
2.3. SPECIAL DISTRIBUTIONS 23
Given two square integrable random vectors z and y, suppose that var(y) is positive
denite. As the variance-covariance matrix of (z
must be a positive semi-denite

matrix,
[I cov(z, y) var(y)
1
]
_
var(z) cov(z, y)
cov(y, z) var(y)
_ _
I
var(y)
1
cov(y, z)
_
= var(z) cov(z, y) var(y)
1
cov(y, z)
is also a positive semi-denite matrix. This establishes the multivariate version of the
Cauchy-Schwarz inequality for square integrable random vectors.
Lemma 2.5 (Cauchy-Schwarz) Let y, z be two square integrable random vectors.
Then,
var(z) cov(z, y) var(y)
1
cov(y, z)
is a positive semi-denite matrix.
A random vector is said to be degenerate (have a singular distribution) if its variance-
covariance matrix is singular. Let be the variance-covariance matrix of the d-
dimensional random vector z. If is singular, then there exists a non-zero vector
c such that c = 0. For this particular c, we have
c
c = IE[c
(z IE(z))]
2
= 0.
It follows that c
[z IE(z)] = 0 with probability one; i.e, the elements of z are linearly

dependent with probability one. This implies that all the probability mass of z is
concentrated in a subspace of dimension less than d.
2.3 Special Distributions
In this section we discuss the multivariate normal (Gaussian) distribution and other
univariate distributions such as the chi-square, Students t, and Fishers F distributions.
A random vector z is said to have a multivariate normal distribution with mean
and variance-covariance matrix , denoted as z N(, ), if it has the density
function
1
(2)
d/2
det()
1/2
exp
_
1
2
(z )
1
(z )
_
.
For d = 1, this is just the density of the univariate normal random variable. Note that
the multivariate normal density function is completely characterized by its mean vector
and variance-covariance matrix. A normal random variable has moments of all orders;
in particular, its even-order central moments are
IE(z )
k
= (k 1) 3 1 var(z)
k/2
, k 2 and k is even,
and its odd-order central moments are all zeros. A normal random variable with mean
zero and variance one is usually called the standard normal random variable.
When is a diagonal matrix with diagonal elements
ii
, i = 1, . . . , d, the elements
of z are uncorrelated. Note that for normal random variables, uncorrelatedness implies
independence. In this case, the density function is simply the product of marginal
density functions for z
1
, . . . , z
d
:
1
(2)
d/2
(
d
i=1
ii
)
1/2
exp
_
1
2
d
i=1
(z
i
i
)
2
ii
_
.
When
ii
=
2
o
, a constant, this joint density simplies to
1
(2
2
o
)
d/2
exp
_
1
2
2
o
d
i=1
(z
i
i
)
2
_
.
Although uncorrelated normal random variables are also independent, we stress again
that this need not be true for other random variables.
The result below shows that proper linear transformations of normal random vectors
remain normally distributed.
Lemma 2.6 Let z be a d-dimensional random vector distributed as N(, ). Also let
A be an n d non-stochastic matrix with full row rank n < d and b be a d-dimensional
non-stochastic vector. Then,
Az +b N(A +b, AA
).
Lemma 2.6 implies that, when z N(, ), any sub-vector (element) of z also has a
multivariate (univariate) normal distribution; the converse need not be true, however.
It is also easily seen that
1/2
(z ) N(0, I
d
),
2.3. SPECIAL DISTRIBUTIONS 25
where
1/2
is such that
1/2
1/2
= I, as dened in Section 1.7. Proper standard-
ization of a normal random vector thus yields a normal random vector with independent
elements. If A is not of full row rank, var(Az) = AA
does not have full rank, so

that Az is degenerate.
Let z N(, I
d
). The sum of squares of the elements of z is the non-central
chi-square random variable with d degrees of freedom and the non-centrality parameter
=
, denoted as
z
z
2
(d; ).
The density function of
2
(d; ) is
f(x) = exp
_
+x
2
_
x
d/21
1
2
d/2
i=0
x
i
i
i! 2
2i
(i +d/2)
, x > 0,
where is the gamma function with
(n) =
_

0
e
x
x
n1
dx.
It can be shown that a
2
(d; ) random variable has mean (d+) and variance 2d+4.
When = 0, the non-centrality parameter = 0, and
2
(d; 0) is known as the central
chi-square random variable, denoted as
2
(d). The density of
2
(d) is
f(x) = exp
_
x
2
_
x
d/21
1
2
d/2
(d/2)
, x > 0,
with mean d and variance 2d. The result below follows directly from Lemma 2.6.
Lemma 2.7 Let z be a d-dimensional random vector distributed as N(, ). Then,
z
1
z
2
(d;
1
);
in particular, if = 0, z
1
z
2
(d).
Let w and x be two independent random variables such that w N(, 1) and
x
2
(n). Then
w
_
x/n
t(n; ),
the non-central t distribution with n degrees of freedom and the non-centrality param-
eter . The density function of t(n; ) is
f(x) =
n
n/2
exp(
2
/2)
(n/2)(1/2)(n +x
2
)
(n+1)/2
i=0
_
n +i + 1
2
_
i
i!
_
2x
2
n +x
2
_
i/2
(sign x)
i
.
When = 0, t(n; ) reduces to the central t distribution, denoted as t(n), which has
the density
f(x) =
((n + 1)/2)
(n/2)(1/2)n
1/2
_
1 +
x
2
n
_
(n+1)/2
.
Note that a t(n) random variable is symmetric about zero, and its k
th
moment exists
only for k < n; when n > 2, its mean is zero and variance is n/(n 2).
As n tends to innity, it can be seen that
_
1 +
x
2
n
_
(n+1)/2
=
_
_
1 +
x
2
n
_
n/x
2
_
x
2
/2
_
1 +
x
2
n
_
1/2
exp(x
2
/2).
Also note that (1/2) =
1/2
and that for large n,
((n + 1)/2)
(n/2)
(n/2)
1/2
.
Thus, when n tends to innity, the density of t(n) converges to
1
2
exp(x
2
/2),
the density of the standard normal random variable. When n = 1, the density for t(1)
becomes
f(x) =
1
[1 +x
2
]
.
This is also the density of the Cauchy random variable with the location parameter 0.
Note that the Cauchy random variable does not even have the rst moment.
Let z
1
and z
2
be two independent random variables such that z
1

2
(n
1
;
1
) and
z
2

2
(n
2
;
2
). Then,
z
1
/n
1
z
2
/n
2
F(n
1
, n
2
;
1
,
2
),
the non-central F distribution with the degrees of freedom n
1
and n
2
and the non-
centrality parameters
1
and
2
. The k
th
moment of F(n
1
, n
2
;
1
,
2
) exists when
k < n
2
/2. In many statistical applications we usually encounter F(n
1
, n
2
;
1
, 0). When
n
2
> 2, the mean of F(n
1
, n
2
;
1
, 0)is
n
2
(n
1
+
1
)
n
1
(n
2
2)
;
2.4. LIKELIHOOD 27
when n
2
> 4, the variance is
2
_
n
2
n
1
_
2
(n
1
+
1
)
2
+ (n
1
+ 2
1
)(n
2
2)
(n
2
2)
2
(n
2
4)
.
If both
1
and
2
are zero, we have the central F distribution F(n
1
, n
2
). When n
2
> 2,
F(n
1
, n
2
) has mean n
2
/(n
2
2); when n
2
> 4, it has variance
2n
2
2
(n
1
+n
2
2)
n
1
(n
2
2)
2
(n
2
4)
.
Note that if a random variable is distributed as t(n), its square has the F(1, n) distri-
bution.
2.4 Likelihood
Suppose that we postulate p as the joint probability function of the discrete random vari-
ables z
1
, . . . , z
T
with the parameter vector . Plugging the observed values
1
, . . . ,
T
into p we then obtain a function of :
L() := p(
1
, . . . ,
T
; ).
This function represents the probability (likelihood) that those observed values are
generated from the postulated probability function p; dierent parameter values of
course result in dierent probability values. Thus, L() is also known as the likelihood
function of .
Similarly, let f denote the postulated joint density function of the random vectors
z
1
, . . . , z
T
with the parameter vector . Then given the observed values
1
, . . . ,
T
,
the likelihood function of is
L() := f(
1
, . . . ,
T
; ).
In what follows, we will use L and f interchangeably. Note, however, that a postu-
lated density function need not be the true density function that generates the random
variables.
When f is dierentiable and non-zero with probability one, the gradient vector of
log L(),
log L() =
1
L()
L(),
is known as the score vector, denoted as s(
1
, . . . ,
T
; ). We can then write
s(
1
, . . . ,
T
; )f(
1
, . . . ,
T
; ) =
f(
1
, . . . ,
T
; ).
For a given , the score vector may vary with the observed values
1
, . . . ,
T
. Thus, we
can also treat the score vector as a random vector and denote it as s(z
1
, . . . , z
T
; ).
When dierentiation and integration can be interchanged,
_
R
d

_
R
d
s(
1
, . . . ,
T
; ) f(
1
, . . . ,
T
; ) d
1
. . . d
T
=
_
R
d

_
R
d
f(
1
, . . . ,
T
; ) d
1
. . . d
T
=
__
R
d

_
R
d
f(
1
, . . . ,
T
; ) d
1
. . . d
T
_
=
1
= 0.
The left-hand side is in eect the expectation of the score vector with respect to f. If
there exists
o
such that f(
1
, . . . ,
T
;
o
) is the true density function, we immediately
obtain the following result.
Lemma 2.8 If there exists
o
such that f(
1
, . . . ,
T
;
o
) is the joint density function
of the random vectors z
1
, . . . , z
T
. Then under regularity conditions,
IE[s(z
1
, . . . , z
T
;
o
)] = 0,
where s(z
1
, . . . , z
T
;
o
) is the score evaluated at
o
, and IE is taken with respect to the
true density function.
Remark: Lemma 2.8 requires the conditions that ensure dierentiability of the likelihood
function and interchangeability of dierentiation and integration. We do not give those
conditions explicitly; see e.g., Amemiya (1985) for some sucient conditions. This
comment also applies to Lemma 2.9.
It is easy to see that the Hessian matrix of the log-likelihood function is
log L() =
1
L()
L()
1
L()
2
[
L()][
L()]
,
2.4. LIKELIHOOD 29
where the second term is just the outer product of the score vector, and
_
R
d

_
R
d
1
L()
L() f(
1
, . . . ,
T
; ) d
1
d
T
=
_
R
d

_
R
d
f(
1
, . . . ,
T
; ) d
1
d
T
=
2
__
R
d

_
R
d
f(
1
, . . . ,
T
; ) d
1
d
T
_
=
2
1
= 0.
It follows that
_
R
d

_
R
d
log L() f(
1
, . . . ,
T
; ) d
1
d
T
=
_
R
d

_
R
d
s(
1
, . . . ,
T
; )s(
1
, . . . ,
T
; )
f(
1
, . . . ,
T
; ) d
1
d
T
,
where the left-hand side is the expectation of the Hessian matrix and the right-hand
side is negative of the variance-covariance matrix of s(z
1
, . . . , z
T
; ), both with respect
to the postulated density function f. If f(
1
, . . . ,
T
;
o
) is the true density function,
the variance-covariance matrix of s(z
1
, . . . , z
T
;
o
) is known as the information matrix.
Together with Lemma 2.8, we have the so-called information matrix equality.
Lemma 2.9 If there exists
o
such that f(
1
, . . . ,
T
;
o
) is the joint density function
of the random vectors z
1
, . . . , z
T
. Then under regularity conditions,
IE[
2
log L(
o
)] + var(s(z
1
, . . . , z
T
;
o
)) = 0,
where
2
log L(
o
) is the Hessian matrix of log L evaluated at
o
, and IE and var are
taken with respect to the true density function.
Remark: When f is not the true density function, Lemma 2.8 and 2.9 need not hold.
That is, neither IE[s(z
1
, . . . , z
T
; )] nor
IE[
2
log L()] + var(s(z

1
, . . . , z
T
; ))
is necessarily zero.
2.5 Estimation
2.5.1 Point Estimation
Let
o
denote a parameter vector associated with the joint distribution of T random
vectors z
1
, . . . , z
T
. A point estimator (or simply an estimator) for
o
is a function of
these random vectors:
= h(z
1
, . . . , z
T
),
where h is some function. An estimator is clearly a random vector. Once the observed
values of z
1
, . . . , z
T
are plugged into this function, we obtain a point estimate. That is,
a point estimate is just a particular value that an estimator may assume.
A simple principle of constructing estimators for moments is known as analog estima-
tion. This principle suggests to estimate population moments using their nite-sample
counterparts. For example, given a sample of T random variables z
1
, . . . , z
T
with the
common k
th
moment IE(z
k
1
), the analog estimator for IE(z
k
1
) is simply the sample av-
erage of z
k
i
:
1
T
T
t=1
z
k
i
.
In particular, the sample mean z is the analog estimator for the population mean.
To estimate the parameter vector
o
, it is also natural to maximize the associated
likelihood function L(). The resulting solution is known as the maximum likelihood
estimator (MLE) for
o
, denoted as

or

T
, where the subscript T indicates that this
is an estimator based on a sample of T observations. As the maximum of a function
is invariant with respect to monotonic transformations, it is quite common to compute
the MLE by maximizing the log-likelihood function log L(). It follows that the score
vector evaluated at

must be zero; i.e., s(
1
, . . . ,
T
;
) = 0.
2.5.2 Criteria for Point Estimators
Let

be an estimator for
o
. The dierence IE(
)
o
is called the bias of

. An
estimator is said to be unbiased if it has zero bias, i.e.,
IE(
) =
o
;
otherwise, it is biased. Unbiasedness does not ensure that an estimate is close to the true
parameter, however. In fact, it is even possible that all possible values of an unbiased
estimator deviate from the true parameter by a constant.
2.5. ESTIMATION 31
Given two unbiased estimators, it is therefore natural to choose the one whose values
are more concentrated around the true parameter. For real-valued unbiased estimators,
this amounts to selecting an estimator with a smaller variance. If they are vector-
valued, we adopt the following eciency criterion. An unbiased estimator

1
is said to
be better (more ecient) than an unbiased estimator

2
if
var(a
2
) var(a
1
),
for all non-zero vectors a. This is equivalent to the condition that
a
[var(
2
) var(
1
)]a 0,
i.e., var(
2
) var(
1
) is a positive semi-denite matrix. Given a class of unbiased
estimators, if one of them is better than all other estimators in that class, it is the
best (most ecient) within this class.
More generally, we can compare estimators based on mean squared error (MSE):
IE[(

o
)(

o
)
]
= IE[(
IE(
) + IE(
)
o
)(
IE(
) + IE(
)
o
)
]
= var(
) + [IE(
)
o
] [IE(
)
o
]
,
where the second term is the outer product of the bias vector. An estimator

1
(not
necessarily unbiased) is said to be better (more ecient) than

2
if MSE(
2
)MSE(
1
)
is a positive semi-denite matrix. Clearly, the MSE criterion reduces to the previous
variance-based criterion when estimators are unbiased.
The following result shows that the inverse of the information matrix is a lower
bound, also known as the Cramer-Rao lower bound, for the variance-covariance matrix
of any unbiased estimator.
Lemma 2.10 (Cramer-Rao) If there exists
o
such that f(
1
, . . . ,
T
;
o
) is the joint
density function of the random vectors z
1
, . . . , z
T
. Let

denote an unbiased estimator
for based on these random vectors. If var(s(z
1
, . . . , z
T
;
o
)) is positive denite,
var(
) var(s(z
1
, . . . , z
T
;
o
))
1
Proof: We rst note that for any unbiased estimator

for ,
_
R
d

_
R
d
(
) s(
1
, . . . ,
T
; ) f(
1
, . . . ,
T
; ) d
1
d
T
=
_
R
d

_
R
d
s(
1
, . . . ,
T
; ) f(
1
, . . . ,
T
; ) d
1
d
T
=
__
R
d

_
R
d
f(
1
, . . . ,
T
; ) d
1
d
T
_
=
= I,
where the third equality holds because

is unbiased for when f(
1
, . . . ,
T
; ) is the
associated density function. Thus,
cov(
, s(z
1
, . . . , z
T
;
o
)) = I.
The assertion now follows from Lemma 2.5, the multivariate version of the Cauchy-
Schwarz inequality.
By Lemma 2.10, an unbiased estimator is the best if its variance-covariance matrix
achieves the Cramer-Rao lower bound; this is not a necessary condition, however.
2.5.3 Interval Estimation
While a point estimate is a particular value representing the unknown parameter, in-
terval estimation results in a range of values that may contain the unknown parameter
with certain probability.
Suppose that there is an estimate

for the true parameter
o
and a function q(
,
o
)
whose distribution is known. Then, given a probability value , we can nd suitable
values a and b such that
IPa < q(
,
o
) < b = .
Solving the inequality above for
o
we may obtain an interval containing
o
. This leads
to the probability statement:
IP <
o
< = ,
where and depend on a, b, and

. We can then conclude that we are 100 percent
sure that the interval (, ) contains
o
. Here, is the condence coecient, and (, )
2.6. HYPOTHESIS TESTING 33
is the associated condence interval for
o
. Given the estimate

, it is easily seen that
the larger the value of , the wider is the associated condence interval.
Let A
1
denote the event that a condence interval contains
1
and A
2
the event
that a condence interval contains
2
. The intersection A = A
1
A
2
is thus the event
that a condence box covers both parameters. When A
1
and A
2
are independent
such that IP(A
1
) = IP(A
2
) = , we have IP(A) =
2
. When these two events are not
independent (e.g., the parameter estimators of
1
and
2
are correlated), it becomes
dicult to determine IP(A). As such, nding a proper condence box based on
individual condence intervals is by no means an easy job. On the other hand, if a
function q(
1
,

2
,
1
,
2
) with a known distribution is available, we can, for a given ,
nd the values a and b such that
IPa < q(
1
,

2
,
1
,
2
) < b = .
By solving the inequality above for
1
and
2
we may obtain a condence region in
which the point (
1
,
2
) lies with probability .
2.6 Hypothesis Testing
2.6.1 Basic Concepts
Given a sample of data, it is often desirable to check if certain characteristics of the
underlying random mechanism (population) are supported by these data. For this pur-
pose, a hypothesis of these characteristics must be specied, and a test is constructed so
as to generate a rule of rejecting or accepting (not rejecting) the postulated hypothesis.
The hypothesis being tested is called the null hypothesis, denoted as H
0
; the other
states or values of the characteristics of interest form an alternative hypothesis, denoted
as H
1
. Hypotheses are usually formulated in terms of the parameters of models. For
example, one may specify that H
0
:
o
= a for some a and H
1
:
o
,= a. Here, H
0
is
a simple hypothesis in the sense that the parameter vector being tested takes a single
value, but H
1
is a composite hypothesis in that the parameter vector may take more
than one values. Given a sample of random variables z
1
, . . . , z
T
, a test statistic is a
function of these random variables, denoted as T (z
1
, . . . , z
T
). The critical region C
of T (z
1
, . . . , z
T
) is the range of its possible values that lead to rejection of the null
hypothesis. In what follows, the set
=
1
, . . . ,
T
: T (
1
, . . . ,
T
) C
will also be referred to as the critical region of T . The complement of the critical region,
C
c
, is the region containing the values of T (z
1
, . . . , z
T
) that lead to acceptance of the
null hypothesis. We can also dene
c
=
1
, . . . ,
T
: T (
1
, . . . ,
T
) C
c
as the acceptance region of T .

A test may yield incorrect inferences. A test is said to commit the type I error if
it rejects the null hypothesis when the null hypothesis is in fact true; a test is said to
commit the type II error if it accepts the null hypothesis when the alternative hypothesis
is true. Suppose that we are interested in testing H
0
:
o
= a against H
1
:
o
= b. Let IP
0
be the probability when
o
= a and IP
1
the probability when
o
= b. The probability
of the type I error is then
= IP
0
((z
1
, . . . , z
T
) ) =
_
f
0
(
1
, . . . ,
T
; a) d
1
d
T
,
where f
0
(z
1
, . . . , z
T
; a) is the joint density with the parameter
o
= a. The value is
also known as the size or signicance level of the test. The probability of the type II
error is
= IP
1
((z
1
, . . . , z
T
)
c
) =
_
c
f
1
(
1
, . . . ,
T
; b) d
1
d
T
,
where f
1
(z
1
, . . . , z
T
; b) is the joint density with the parameter
o
= b. Clearly,
decreases when the critical region is smaller; in the mean time, increases due to a
larger
c
. Thus, there is usually a trade-o between these two error probabilities.
Note, however, that the probability of the type II error cannot be dened as above
when the alternative hypothesis is composite:
o

1
, where
1
is a set of parameter
values in the parameter space. Consider now the probability 1IP
1
(
c
) = IP
1
(), which
is the probability of rejecting the null hypothesis when H
1
is true. Thus, both IP
0
()
and IP
1
() are the probabilities of rejecting the null hypothesis under two dierent
parameter values. More generally, dene the power function of the test as
(
o
) = IP
o
(z
1
, . . . , z
T
) ,
where
o
varies in the parameter space. In particular, (a) = . For
o

1
, (
o
)
describes the ability of a test that can correctly detect the falsity of the null hypothesis;
these probabilities are also referred to as the powers of the test. The probability of the
type II error under the composite alternative hypothesis
o

1
can now be dened as
= max
o
1
[1 (
o
)].
2.6.2 Construction of Tests
Given the null hypothesis
o
= a, the test statistic T (z
1
, . . . , z
T
) is usually based on
the comparison of an estimator of
o
and the hypothesized value a. This statistic must
have a known distribution under the null hypothesis, which will be referred to as the
null distribution.
Given the statistic T (z
1
, . . . , z
T
), the probability IP
0
(T (z
1
, . . . , z
T
) C) can be
determined by the null distribution of T . If this probability is small, the event that
T (z
1
, . . . , z
T
) C would be considered unlikely or improbable under the null
hypothesis, while the event that T (z
1
, . . . , z
T
) C
c
would be considered likely or
probable. If the former event does occur (i.e., for data z
1
=
1
, . . . , z
T
=
T
,
T (
1
, . . . ,
T
) falls in C), it constitutes an evidence against the null hypothesis, so that
the null hypothesis is rejected; otherwise, we accept (do not reject) the null hypothesis.
Therefore, one should specify a small signicance level and determine the associated
critical region C by
= IP
0
T (z
1
, . . . , z
T
) C.
As such, we shall write the critical region for the signicance level as C
. This
approach ensures that, even though the decision of rejection might be wrong, the prob-
ability of making the type I error is no greater than . A test statistic is said to be
signicant if it is in the critical region; otherwise, it is insignicant.
Another approach is to reject the null hypothesis if
IP
0
v: v > T (
1
, . . . ,
T
)
is small. This probability is the tail probability of the null distribution and also known
as the p-value of the statistic T . Although this approach does not require specifying
the critical region, it is virtually the same as the previous approach.
The rationale of our test decision is that the null hypothesis is rejected because the
test statistic takes an unlikely value. It is then natural to expect that the calculated
statistic is relatively more likely under the alternative hypothesis. Given the null hy-
pothesis
o
= a and alternative hypothesis
o

1
, we would like to have a test such
that
(a) (
o
),
o

1
.
A test is said to be unbiased if its size is no greater than the powers under the alternative
hypothesis. Moreover, we would like to have a test that can detect the falsity of the
null hypothesis with probability approaching one when there is sucient information.
That is, for every
o

1
,
(
o
) = IP
o
T (z
1
, . . . , z
T
) C 1,
as T . A test is said to be consistent if its power approaches one when the sample
size becomes innitely large.
Example 2.11 Given the sample of i.i.d. normal random variables z
1
, . . . , z
T
with
mean
o
and variance one. We would like to test the null hypothesis
o
= 0. A natural
estimator for
o
is the sample average z = T
1
T
t=1
z
t
. It is well known that
T( z
o
) N(0, 1).
Hence,

T z N(0, 1) under the null hypothesis; that is, the null distribution of the
statistic

T z is the standard normal distribution. Given the signicance level , we
can determine the critical region C
using
= IP
0
(
T z C
).
Let denote the distribution function of the standard normal random variable. For
= 0.05, we know
0.05 = IP
0
(
T z > 1.645) = 1 (1.645).

The critical region is then (1.645, ); the null hypothesis is rejected if the calculated
statistic falls in this interval. When the null hypothesis is false, the distribution of

T z
is no longer N(0, 1) but is N(
o
, 1). Suppose that
o
> 0. Then,
IP
1
(
T z > 1.645) = IP
1
(
T( z
o
) > 1.645
T
o
).
Since

T( z
o
) N(0, 1) under the alternative hypothesis, we have the power:
IP
1
(
T z > 1.645) = 1 (1.645
T
o
).
Given that
o
> 0, this probability must be greater than the test size (0.05), so that
the test is unbiased. On the other hand, when T increases, 1.645
T
o
becomes even
smaller, so that the power improves. When T tends to innity, the power approaches
one, so that

T z is a consistent test.
References
Amemiya, Takeshi (1985). Advanced Econometrics, Cambridge, MA: Harvard Univer-
sity Press.
Amemiya, Takeshi (1994). Introduction to Statistics and Econometrics, Cambridge,
MA: Harvard University Press.
Rudin, Walter (1976). Principles of Mathematical Analysis, Third edition, New York,
NY: McGraw-Hill.
Chapter 3
Classical Least Squares Theory
3.1 Introduction
Economists have proposed numerous hypotheses and theories in order to describe the
behavior of economic agents and the relationships between economic variables. Al-
though these propositions may be theoretically appealing and logically correct, they
need not be practically relevant unless they are supported by real world data. A theory
with empirical evidence is of course more convincing. Therefore, empirical analysis has
become an indispensable ingredient of contemporary economic research. By economet-
rics we mean the statistical and mathematical methods that can be used to analyze
empirical relationships between economic variables.
A leading approach in econometrics is the regression analysis in which a regression
model of a collection of explanatory variables is specied to characterize the behavior
of the variable of interest. The simplest and most commonly used specication is the
linear model. Once a linear model is specied, it remains to estimate unknown model
parameters, test economic and econometric hypotheses, and draw inferences from these
results. This chapter is concerned with the most important estimation method in lin-
ear regression, the method of ordinary least squares. Readers can also nd related
topics in many econometrics textbooks, e.g., Davidson and MacKinnon (1993), Gold-
berger (1991), Greene (2000), Harvey (1990), Intriligator et al. (1996), Johnston (1984),
Judge et al. (1988), Maddala (1992), Ruud (2000), and Theil (1971), among others.
39
40 CHAPTER 3. CLASSICAL LEAST SQUARES THEORY
3.2 The Method of Ordinary Least Squares
Suppose that there is a variable, y, whose behavior over time (or across individual units)
is of interest to us. A theory may suggest that the behavior of y can be well characterized
by some function f of the variables x
1
, . . . , x
k
. Then, f(x
1
, . . . , x
k
) may be viewed as
a systematic component of y provided that no other variables can further account for
the residual behavior, y f(x
1
, . . . , x
k
). In the context of linear regression, the function
f is specied as a linear function. The method of ordinary least squares (OLS) enables
us to determine the linear weights (parameters) of this specication.
3.2.1 Simple Linear Regression
In simple linear regression, only one variable x is designated to describe the behavior of
the variable y. The linear specication is
+x,
where and are unknown parameters. We can then write
y = +x +e(, ),
where e(, ) = y x denotes the error resulted from this specication. In what
follows, y will be referred to as the dependent variable (regressand), and x will be
referred to as an explanatory variable (regressor). Note that the regressor x itself may
be a function of some other variables, e.g., x = z
2
or x = log z.
Suppose that we have T observations of the variables y and x. Given the linear
specication above, our objective is to nd suitable and such that the resulting
linear function best ts the data (y
t
, x
t
), t = 1, . . . , T. Here, the generic subscript t
is used for both cross-section and time-series data. The OLS method suggests to nd a
straight line whose sum of squared errors is as small as possible. This amounts to nd
and that minimize the following OLS criterion function:
Q(, ) :=
1
T
T
t=1
e
t
(, )
2
=
1
T
T
t=1
(y
t
x
t
)
2
.
The solutions can be easily obtained by solving the rst order conditions.
3.2. THE METHOD OF ORDINARY LEAST SQUARES 41
The rst order conditions of this minimization problem are:
Q(, )
=
2
T
T
t=1
(y
t
x
t
) = 0,
Q(, )
=
2
T
T
t=1
(y
t
x
t
)x
t
= 0.
Solving for and we have the following solutions:
T
=
T
t=1
(y
t
y)(x
t
x)
T
t=1
(x
t
x)
2
,

T
= y

T
x,
where y =
T
t=1
y
t
/T and x =
T
t=1
x
t
/T. As
T
and

T
are obtained by minimizing the
OLS criterion function, they are known as the OLS estimators of and , respectively.
The subscript T of
T
and

T
signies that these solutions are obtained from a sample
of T observations. Note that if x
t
is a constant c for every t, then x = c, and hence

T
cannot be computed.
The function y =
T
+

T
x is the estimated regression line with the intercept
T
and slope

T
. We also say that this line is obtained by regressing y on (the constant one
and) the regressor x. The regression line so computed gives the best t of data, in
the sense that any other linear function of x would yield a larger sum of squared errors.
For a given x
t
, the OLS tted value is a point on the regression line:
y
t
=
T
+

T
x
t
.
The dierence between y
t
and y
t
is the t
th
OLS residual:
e
t
:= y
t
y
t
,
which corresponds to the error of the specication as
e
t
= e
t
(
T
,

T
).
Note that regressing y on x and regressing x on y lead to dierent regression lines in
general, except when all (y
t
, x
t
) lie on the same line; see Exercise 3.9.
Remark: Dierent criterion functions would result in other estimators. For exam-
ple, the so-called least absolute deviation estimator can be obtained by minimizing the
average of the sum of absolute errors:
1
T
T
t=1
[y
t
x
t
[,
which in turn determines a dierent regression line. We refer to Manski (1991) for a
comprehensive discussion of this topic.
3.2.2 Multiple Linear Regression
More generally, we may specify a linear function with k explanatory variables to describe
the behavior of y:
1
x
1
+
2
x
2
+ +
k
x
k
,
so that
y =
1
x
1
+
2
x
2
+ +
k
x
k
+e(
1
, . . . ,
k
),
where e(
1
, . . . ,
k
) again denotes the error of this specication. Given a sample of T
observations, this specication can also be expressed as
y = X +e(), (3.1)
where = (
1

2

k
)
is the vector of unknown parameters, y and X contain all

the observations of the dependent and explanatory variables, i.e.,
y =
_
_
y
1
y
2
.
.
.
y
T
_
_
, X =
_
_
x
11
x
12
x
1k
x
21
x
22
x
2k
.
.
.
.
.
.
.
.
.
.
.
.
x
T1
x
T2
x
Tk
_
_
,
where each column of X contains T observations of an explanatory variable, and e()
is the vector of errors. It is typical to set the rst explanatory variable as the constant
one so that the rst column of X is the T 1 vector of ones, . For convenience, we
also write e() as e and its element e
t
() as e
t
.
Our objective now is to nd a k-dimensional regression hyperplane that best ts
the data (y, X). In the light of Section 3.2.1, we would like to minimize, with respect
to , the average of the sum of squared errors:
Q() :=
1
T
e()
e() =
1
T
(y X)
(y X). (3.2)
This is a well-dened problem provided that the basic identication requirement below
holds for the specication (3.1).
[ID-1] The T k data matrix X is of full column rank k.
Under [ID-1], the number of regressors, k, must be strictly less than the number of
observations, T. This is so because if k > T, the rank of X must be less than or equal to
T, and hence X cannot have full column rank. Moreover, [ID-1] requires that any linear
specication does not contain any redundant regressor; that is, any column vector of
X cannot be written as a linear combination of other column vectors. For example, X
contains a column of ones and a column of x
t
in simple linear regression. These two
columns would be linearly dependent if x
t
= c for every t. Thus, [ID-1] requires that x
t
in simple linear regression is not a constant.
The rst order condition of the OLS minimization problem is
Q() =
(y
y 2y
X +
X)/T = 0.
By the matrix dierentiation results in Section 1.2, we have
Q() = 2X
(y X)/T = 0.
Equivalently, we can write
X
X = X
y. (3.3)
These k equations, also known as the normal equations, contain exactly k unknowns.
Given [ID-1], X is of full column rank so that X
X is positive denite and hence

invertible by Lemma 1.13. It follows that the unique solution to the rst order condition
is
T
= (X
X)
1
X
y. (3.4)
Moreover, the second order condition is also satised because
Q() = 2(X
X)/T
is a positive denite matrix under [ID-1]. Thus,

T
is the unique minimizer of the OLS
criterion function and hence known as the OLS estimator of . This result is formally
stated below.
Theorem 3.1 Given the specication (3.1), suppose that [ID-1] holds. Then, the OLS
estimator

T
given by (3.4) uniquely minimizes the OLS criterion function (3.2).
If X is not of full column rank, its column vectors are linearly dependent and there-
fore satisfy an exact linear relationship. This is the problem of exact multicollinearity.
In this case, X
X is not invertible so that there exist innitely many solutions to the

normal equations X
X = X
y. As such, the OLS estimator

T
cannot be uniquely
determined. See Exercise 3.4 for a geometric interpretation of this result. Exact mul-
ticollinearity usually arises from inappropriate model specications. For example, in-
cluding both total income, total wage income, and total non-wage income as regressors
results in exact multicollinearity because total income is, by denition, the sum of wage
and non-wage income. See also Section 3.6.2 for another example. In what follows, the
identication requirement for the linear specication (3.1) is always assumed.
Remarks:
1. Theorem 3.1 does not depend on the true relationship between y and X. That
is, whether (3.1) is a correct specication is irrelevant to the existence and unique-
ness of the OLS estimator.
2. It is easy to verify that the magnitudes of the coecient estimates

i
, i = 1, . . . , k,
are aected by the measurement units of dependent and explanatory variables; see
Exercise 3.7. As such, a larger coecient estimate does not necessarily imply that
the associated regressor is more important in explaining the behavior of y. In fact,
the coecient estimates are not directly comparable in general; cf. Exercise 3.5.
Once the OLS estimator

T
is obtained, we can plug it into the original linear
specication and obtain the vector of OLS tted values:
y = X
T
.
The vector of OLS residuals is then
e = y y = e(
T
).
From the normal equations (3.3) we can deduce the following algebraic results. First,
the OLS residual vector must satisfy the normal equations:
X
(y X) = X
e = 0,
so that X
e = 0. When X contains a column of constants (i.e., a column of X is c,

where is the vector of ones), X
e = 0 implies
e =
T
t=1
e
t
= 0.
That is, the sum of OLS residuals must be zero. Second,
y
e =

T
X
e = 0.
Theorem 3.2 Given the specication (3.1), suppose that [ID-1] holds. Then, the vector
of OLS tted values y and the vector of OLS residuals e have the following properties.
(a) X
e = 0; in particular, if X contains a column of constants,

T
t=1
e
t
= 0.
(b) y
e = 0.
Note that when
e =
(y y) = 0, we have
1
T
T
t=1
y
t
=
1
T
T
t=1
y
t
.
That is, the sample average of the data y
t
is the same as the sample average of the tted
values y
t
when X contains a column of constants.
3.2.3 Geometric Interpretations
The OLS estimation result has nice geometric interpretations. These interpretations
have nothing to do with the stochastic properties to be discussed in Section 3.3, and
they are valid as long as the OLS estimator exists.
In what follows, we write P = X(X
X)
1
X
which is an orthogonal projection

matrix that projects vectors onto span(X) by Lemma 1.14. The vector of OLS tted
values can be written as
y = X(X
X)
1
X
y = Py.
Hence, y is the orthogonal projection of y onto span(X). The OLS residual vector is
e = y y = (I
T
P)y,
which is the orthogonal projection of y onto span(X)
and hence is orthogonal to y

and X; cf. Theorem 3.2. Consequently, y is the best approximation of y, given the
information contained in X, as shown in Lemma 1.10. Figure 3.1 illustrates a simple
case where there are only two explanatory variables in the specication.
The following results are useful in many applications.
y
e = (I P)y
x
1
x
2
x
1
1
x
2
2
Py = x
1
1
+x
2
2
Figure 3.1: The orthogonal projection of y onto span(x
1
,x
2
)
Theorem 3.3 (Frisch-Waugh-Lovell) Given the specication
y = X
1
1
+X
2
2
+e,
where X
1
is of full column rank k
1
and X
2
is of full column rank k
2
, let

T
=
(
1,T

2,T
)
denote the corresponding OLS estimators. Then,
1,T
= [X
1
(I P
2
)X
1
]
1
X
1
(I P
2
)y,
2,T
= [X
2
(I P
1
)X
2
]
1
X
2
(I P
1
)y,
where P
1
= X
1
(X
1
X
1
)
1
X
1
and P
2
= X
2
(X
2
X
2
)
1
X
2
.
Proof: These results can be directly veried from (3.4) using the matrix inversion
formula in Section 1.4. Alternatively, write
y = X
1
1,T
+X
2
2,T
+ (I P)y,
where P = X(X
X)
1
X
with X = [X
1
X
2
]. Pre-multiplying both sides by X
1
(I
P
2
), we have
X
1
(I P
2
)y
= X
1
(I P
2
)X
1
1,T
+X
1
(I P
2
)X
2
2,T
+X
1
(I P
2
)(I P)y.
The second term on the right-hand side vanishes because (I P
2
)X
2
= 0. For the
third term, we know span(X
2
) span(X), so that span(X)
span(X
2
)
. As each
column vector of I P is in span(X)
, I P is not aected if it is pojected onto

span(X
2
)
. That is,
(I P
2
)(I P) = I P.
Similarly, X
1
is in span(X), and hence (I P)X
1
= 0. It follows that
X
1
(I P
2
)y = X
1
(I P
2
)X
1
1,T
,
from which we obtain the expression for

1,T
. The proof for

2,T
is similar.
This result shows that

1,T
can be computed from regressing (I P
2
)y on (I
P
2
)X
1
, where (I P
2
)y and (I P
2
)X
1
are the residual vectors of the purging
regressions of y on X
2
and X
1
on X
2
, respectively. Similarly,

2,T
can be obtained by
regressing (I P
1
)y on (I P
1
)X
2
, where (I P
1
)y and (I P
1
)X
2
are the residual
vectors of the regressions of y on X
1
and X
2
on X
1
, respectively.
From Theorem 3.3 we can deduce the following results. Consider the regression of
(I P
1
)y on (I P
1
)X
2
. By Theorem 3.3 we have
(I P
1
)y = (I P
1
)X
2
2,T
+ residual vector, (3.5)
where the residual vector is
(I P
1
)(I P)y = (I P)y.
Thus, the residual vector of (3.5) is identical to the residual vector of regressing y on
X = [X
1
X
2
]. Note that (I P
1
)(I P) = I P implies P
1
= P
1
P. That is, the
orthogonal projection of y directly on span(X
1
) is equivalent to performing iterated
projections of y on span(X) and then on span(X
1
). The orthogonal projection part of
(3.5) now can be expressed as
(I P
1
)X
2
2,T
= (I P
1
)Py = (P P
1
)y.
These relationships are illustrated in Figure 3.2.
Similarly, we have
(I P
2
)y = (I P
2
)X
1
1,T
+ residual vector,
where the residual vector is also (I P)y, and the orthogonal projection part of this
regression is (P P
2
)y. See also Davidson and MacKinnon (1993) for more details.
y
e = (I P)y
x
1
x
2
Py
P
1
y
(P P
1
)y
(I P
1
)y
Figure 3.2: An illustration of the Frisch-Waugh-Lovell Theorem
Intuitively, Theorem 3.3 suggests that

1,T
in eect describes how X
1
characterizes
y, after the eect of X
2
is excluded. Thus,

1,T
is dierent from the OLS estimator of
regressing y on X
1
because the eect of X
2
is not controlled in the latter. These two
estimators would be the same if P
2
X
1
= 0, i.e., X
1
is orthogonal to X
2
. Also,

2,T
describes how X
2
characterizes y, after the eect of X
1
is excluded, and it is dierent
from the OLS estimator from regressing y on X
2
, unless X
1
and X
2
are orthogonal to
each other.
As an application, consider the specication with X = [X
1
X
2
], where X
1
con-
tains the constant term and a time trend variable t, and X
2
includes the other k 2
explanatory variables. This specication is useful when the variables of interest exhibit
a trending behavior. Then, the OLS estimators of the coecients of X
2
are the same
as those obtained from regressing (detrended) y on detrended X
2
, where detrended y
and X
2
are the residuals of regressing y and X
2
on X
1
, respectively. See Exercise 3.11
for another application.
3.2.4 Measures of Goodness of Fit
We have learned that from previous sections that, when the explanatory variables in a
linear specication are given, the OLS method yields the best t of data. In practice,
one may consdier a linear speccation with dierent sets of regressors and try to choose
a particular one from them. It is therefore of interest to compare the performance across
dierent specications. In this section we discuss how to measure the goodness of t of
a specication. A natural goodness-of-t measure is of course the sum of squared errors
e
e. Unfortunately, this measure is not invariant with respect to measurement units of

the dependent variable and hence is not appropriate for model comparison. Instead, we
consider the following relative measures of goodness of t.
Recall from Theorem 3.2(b) that y
e = 0. Then,
y
y = y
y + e
e + 2 y
e = y
y + e
e.
This equation can be written in terms of sum of squares:
T
t=1
y
2
t
. .
TSS
=
T
t=1
y
2
t
. .
RSS
+
T
t=1
e
2
t
. .
ESS
,
where TSS stands for total sum of squares and is a measure of total squared variations of
y
t
, RSS stands for regression sum of squares and is a measures of squared variations of
tted values, and ESS stands for error sum of squares and is a measure of squared vari-
ation of residuals. The non-centered coecient of determination (or non-centered R
2
)
is dened as the proportion of TSS that can be explained by the regression hyperplane:
R
2
=
RSS
TSS
= 1
ESS
TSS
. (3.6)
Clearly, 0 R
2
1, and the larger the R
2
, the better the model ts the data. In
particular, a model has a perfect t if R
2
= 1, and it does not account for any variation
of y if R
2
= 0. It is also easy to verify that this measure does not depend on the
measurement units of the dependent and explanatory variables; see Exercise 3.7.
As y
y = y
y, we can also write

R
2
=
y
y
y
y
=
( y
y)
2
(y
y)( y
y)
.
It follows from the discussion of inner product and Euclidean norm in Section 1.2 that
the right-hand side is just cos
2
, where is the angle between y and y. Thus, R
2
can be
interpreted as a measure of the linear association between these two vectors. A perfect
t is equivalent to the fact that y and y are collinear, so that y must be in span(X).
When R
2
= 0, y is orthogonal to y so that y is in span(X)
.
It can be veried that when a constant is added to all observations of the depen-
dent variable, the resulting coecient of determination also changes. This is clearly a
drawback because a sensible measure of t should not be aected by the location of the
dependent variable. Another drawback of the coecient of determination is that it is
non-decreasing in the number of variables in the specication. That is, adding more
variables to a linear specication will not reduce its R
2
. To see this, consider a speci-
cation with k
1
regressors and a more complex one containing the same k
1
regressors and
additional k
2
regressors. In this case, the former specication is nested in the latter,
in the sense that the former can be obtained from the latter by setting the coecients
of those additional regressors to zero. Since the OLS method searches for the best t of
data without any constraint, the more complex model cannot have a worse t than the
specications nested in it. See also Exercise 3.8.
A measure that is invariant with respect to constant addition is the centered co-
ecient of determination (or centered R
2
). When a specication contains a constant
term,
T
t=1
(y
t
y)
2
. .
Centered TSS
=
T
t=1
( y
t
y)
2
. .
Centered RSS
+
T
t=1
e
2
t
. .
ESS
,
where

y = y =
T
t=1
y
t
/T. Analogous to (3.6), the centered R
2
is dened as
Centered R
2
=
Centered RSS
Centered TSS
= 1
ESS
Centered TSS
. (3.7)
Centered R
2
also takes on values between 0 and 1 and is non-decreasing in the number
of variables in the specication. In contrast with non-centered R
2
, this measure excludes
the eect of the constant term and hence is invariant with respect to constant addition.
When a speccation contains a constant term, we have
T
t=1
(y
t
y)( y
t
y) =
T
t=1
( y
t
y + e
t
)( y
t
y) =
T
t=1
( y
t
y)
2
,
because

T
t=1
y
t
e
t
=
t
t=1
e
t
= 0 by Theorem 3.2. It follows that
R
2
=
T
t=1
( y
t
y)
2
T
t=1
(y
t
y)
2
=
[
T
t=1
(y
t
y)( y
t
y)]
2
[
T
t=1
(y
t
y)
2
][
T
t=1
( y
t
y)
2
]
.
That is, the centered R
2
is also the squared sample correlation coecient of y
t
and y
t
,
also known as the squared multiple correlation coecient. If a specication does not
contain a constant term, the centered R
2
may be negative; see Exercise 3.10.
Both centered and non-centered R
2
are still non-decreasing in the number of regres-
sors. As such, if one try to determine a specication based on their R
2
, the specication
3.3. STATISTICAL PROPERTIES OF THE OLS ESTIMATORS 51
with more regressors would be chosen. A modied measure is the adjusted R
2
,

R
2
,
which is the centered R
2
adjusted for the degrees of freedom:
R
2
= 1
e
e/(T k)
(y
y T y
2
)/(T 1)
.
This measure can also be expressed in dierent forms:
R
2
= 1
T 1
T k
(1 R
2
) = R
2
k 1
T k
(1 R
2
).
That is,

R
2
is the centered R
2
with a penalty term depending on model complexity
and explanatory ability. Observe that when k increases, (k 1)/(T k) increases but
1 R
2
decreases. Whether the penalty term is larger or smaller depends on the trade-
o between these two terms. Thus,

R
2
need not be increasing with the number of
explanatory variables. Clearly,

R
2
< R
2
except for k = 1 or R
2
= 1. It can also be
veried that

R
2
< 0 when R
2
< (k 1)/(T 1).
Remark: As dierent dependent variables have dierent TSS, the associated speci-
cations are therefore not comparable in terms of their R
2
. For example, R
2
of the
specications with y and log y as dependent variables are not comparable.
3.3 Statistical Properties of the OLS Estimators
Readers should have noticed that the previous results, which are either algebraic or
geometric, hold regardless of the random nature of data. To derive the statistical
properties of the OLS estimator, some probabilistic conditions must be imposed.
3.3.1 Classical Conditions
The following conditions on data are usually known as the classical conditions.
[A1] X is non-stochastic.
[A2] y is a random vector such that
(i) IE(y) = X
o
for some
o
;
(ii) var(y) =
2
o
I
T
for some
2
o
> 0.
[A3] y is a random vector such that y N(X
o
,
2
o
I
T
) for some
o
and
2
o
> 0.
Condition [A1] is not crucial, but it is quite convenient for subsequent analysis.
Note that IE(y) is the averaging behavior of y and may be interpreted as a system-
atic component of y. [A2](i) thus ensures that the postulated linear function X is a
specication of this systematic component, correct up to unknown parameters. Con-
dition [A2](ii) regulates that the variance-covariance matrix of y depends only on one
parameter
2
o
; such a matrix is also known as a scalar covariance matrix. Under [A2](ii),
y
t
, t = 1, . . . , T, have the constant variance
2
o
and are pairwise uncorrelated (but not
necessarily independent). Although conditions [A2] and [A3] impose the same struc-
tures for the mean and variance of y, the latter is much stronger because it also species
the distribution of y. We have seen in Section 2.3 that uncorrelated normal random
variables are also independent. Therefore, y
t
, t = 1, . . . , T, are i.i.d. (independently
and identically distributed) normal random variables under [A3]. The linear specica-
tion (3.1) with [A1] and [A2] is known as the classical linear model, and (3.1) with [A1]
and [A3] is also known as the classical normal linear model. The limitations of these
conditions will be discussed in Section 3.7.
In addition to

T
, the new unknown parameter var(y
t
) =
2
o
in [A2](ii) and [A3]
should be estimated as well. The OLS estimator for
2
o
is

2
T
=
e
e
T k
=
1
T k
T
t=1
e
2
t
, (3.8)
where k is the number of regressors. While

T
is a linear estimator in the sense that
it is a linear transformation of y,
2
T
is not. In the sections below we will derive the
properties of the OLS estimators

T
and
2
T
under these classical conditions.
3.3.2 Without the Normality Condition
Under the imposed classical conditions, the OLS estimators have the following statistical
properties.
Theorem 3.4 Consider the linear specication (3.1).
(a) Given [A1] and [A2](i),

T
is unbiased for
o
.
(b) Given [A1] and [A2],
2
T
is unbiased for
2
o
.
(c) Given [A1] and [A2], var(
T
) =
2
o
(X
X)
1
.
Proof: Given [A1] and [A2](i),

T
is unbiased because
IE(
T
) = (X
X)
1
X
IE(y) = (X
X)
1
X
X
o
=
o
.
To prove (b), recall that (I
T
P)X = 0 so that the OLS residual vector can be written
as
e = (I
T
P)y = (I
T
P)(y X
o
).
Then, e
e = (y X
o
)
(I
T
P)(y X
o
) which is a scalar, and
IE( e
e) = IE[trace(y X
o
)
(I
T
P)(y X
o
)]
= IE[trace(y X
o
)(y X
o
)
(I
T
P)].
By interchanging the trace and expectation operators, we have from [A2](ii) that
IE( e
e) = traceIE[(y X
o
)(y X
o
)
(I
T
P)]
= traceIE[(y X
o
)(y X
o
)
](I
T
P)
= trace
2
o
I
T
(I
T
P)
=
2
o
trace(I
T
P).
By Lemmas 1.12 and 1.14, trace(I
T
P) = rank(I
T
P) = T k. Consequently,
IE( e
e) =
2
o
(T k),
so that
IE(
2
T
) = IE( e
e)/(T k) =
2
o
.
This proves the unbiasedness of
2
T
. Given that

T
is a linear transformation of y, we
have from Lemma 2.4 that
var(
T
) = var((X
X)
1
X
y)
= (X
X)
1
X
(
2
o
I
T
)X(X
X)
1
=
2
o
(X
X)
1
.
This establishes (c).
It can be seen that the unbiasedness of

T
does not depend on [A2](ii), the variance
property of y. It is also clear that when
2
T
is unbiased, the estimator
var(
T
) =
2
T
(X
X)
1
is also unbiased for var(
T
). The result below, known as the Gauss-Markov theorem,
indicates that when [A1] and [A2] hold,

T
is not only unbiased but also the best (most
ecient) among all linear unbiased estimators for
o
.
Theorem 3.5 (Gauss-Markov) Given the linear specication (3.1), suppose that [A1]
and [A2] hold. Then the OLS estimator

T
is the best linear unbiased estimator (BLUE)
for
o
.
Proof: Consider an arbitrary linear estimator

T
= Ay, where A is non-stochastic.
Writing A = (X
X)
1
X
+C,

T
=

T
+Cy. Then,
var(
T
) = var(
T
) + var(Cy) + 2 cov(
T
, Cy).
By [A1] and [A2](i),
IE(
T
) =
o
+CX
o
.
Since
o
is arbitrary, this estimator would be unbiased if, and only if, CX = 0. This
property further implies that
cov(
T
, Cy) = IE[(X
X)
1
X
(y X
o
)y
]
= (X
X)
1
X
IE[(y X
o
)y
]C
= (X
X)
1
X
(
2
o
I
T
)C
= 0.
Thus,
var(
T
) = var(
T
) + var(Cy) = var(
T
) +
2
o
CC
,
where
2
o
CC
is clearly a positive semi-denite matrix. This shows that for any linear
unbiased estimator

T
, var(
T
) var(
T
) is positive semi-denite, so that

T
is more
ecient.
Example 3.6 Given the data [y X], where X is a nonstochastic matrix and can be
partitioned as [X
1
X
2
]. Suppose that IE(y) = X
1
b
1
for some b
1
and var(y) =
2
o
I
T
for some
2
o
> 0. Consider rst the specication that contains only X
1
but not X
2
:
y = X
1
1
+e.
Let

b
1,T
denote the resulting OLS estimator. It is clear that

b
1,T
is still a linear estimator
and unbiased for b
1
by Theorem 3.4(a). Moreover, it is the BLUE for b
1
by Theorem 3.5
with the variance-covariance matrix
var(
b
1,T
) =
2
o
(X
1
X
1
)
1
,
by Theorem 3.4(c).
Consider now the linear specication that involves both X
1
and irrelevant regressors
X
2
:
y = X +e = X
1
1
+X
2
2
+e.
Thus, this specication cannot be a correct specication unless some of the parameters
(
2
) are restricted to zero. Let

T
= (
1,T

2,T
)
be the OLS estimator of . Using

Theorem 3.3, we nd
IE(
1,T
) = IE
_
[X
1
(I
T
P
2
)X
1
]
1
X
1
(I
T
P
2
)y
_
= b
1
,
IE(
2,T
) = IE
_
[X
2
(I
T
P
1
)X
2
]
1
X
2
(I
T
P
1
)y
_
= 0,
where P
1
= X
1
(X
1
X
1
)
1
X
1
and P
2
= X
2
(X
2
X
2
)
1
X
2
. This shows that

T
is
unbiased for (b
1
0
. Also,
var(
1,T
) = var([X
1
(I
T
P
2
)X
1
]
1
X
1
(I
T
P
2
)y)
=
2
o
[X
1
(I
T
P
2
)X
1
]
1
.
Given that P
2
is a positive semi-denite matrix,
X
1
X
1
X
1
(I
T
P
2
)X
1
= X
1
P
2
X
1
,
must also be positive semi-denite. It follows from Lemma 1.9 that
[X
1
(I
T
P
2
)X
1
]
1
(X
1
X
1
)
1
is a positive semi-denite matrix. This shows that

b
1,T
is more ecient than

1,T
, as it
ought to be. When X
1
X
2
= 0, i.e., the columns of X
1
are orthogonal to the columns
of X
2
, we immediately have (I
T
P
2
)X
1
= X
1
, so that

1,T
=

b
1,T
. In this case,
estimating a more complex specication does not result in eciency loss.
Remark: The Gauss-Markov theorem does not apply to the estimators for the spec-
ication y = X
1
1
+ X
2
2
+ e because, unlike [A2](i), the true parameter vector
o
= (b
1
0
is not arbitrary but involves the restriction that some of its elements must
be zero. This example thus shows that when this restriction is not taken into account,
the resulting OLS estimator, while being unbiased, is no longer the most ecient.
3.3.3 With the Normality Condition
We have learned that the normality condition [A3] is much stronger than [A2]. With
this stronger condition, more can be said about the OLS estimators.
Theorem 3.7 Given the linear specication (3.1), suppose that [A1] and [A3] hold.
(a)

T
N(
o
,
2
o
(X
X)
1
).
(b) (T k)
2
T
/
2
o

2
(T k).
(c)
2
T
has mean
2
o
and variance 2
4
o
/(T k).
Proof: As

T
is a linear transformation of y, it is also normally distributed as
T
N(
o
,
2
o
(X
X)
1
),
by Lemma 2.6, where its mean and variance-covariance matrix are as in Theorem 3.4(a)
and (c). To prove the assertion (b), we again write e = (I
T
P)(y X
o
) and deduce
(T k)
2
T
/
2
o
= e
e/
2
o
= y
(I
T
P)y
,
where y
= (y X
o
)/
o
. Let C be the orthogonal matrix that diagonalizes the
symmetric and idempotent matrix I
T
P. Then, C
(I
T
P)C = . Since rank(I
T

P) = T k, contains T k eigenvalues equal to one and k eigenvalues equal to zero
by Lemma 1.11. Without loss of generality we can write
y
(I
T
P)y
= y
C[C
(I
T
P)C]C
_
I
Tk
0
0 0
_
,
where = C
. Again by Lemma 2.6, y
N(0, I
T
) under [A3]. Hence,
N(0, I
T
), so that
i
are independent, standard normal random variables. Consequently,
y
(I
T
P)y
=
Tk
i=1
2
i

2
(T k).
This proves (b). Noting that the mean of
2
(T k) is T k and variance is 2(T k),
the assertion (c) is just a direct consequence of (b).
Suppose that we believe that [A3] is true and specify the log-likelihood function of
y as:
log L(,
2
) =
T
2
log(2)
T
2
log
2
1
2
2
(y X)
(y X).
The rst order conditions of maximizing this log-likelihood are
log L(,
2
) =
1
2
X
(y X) = 0,
2 log L(,
2
) =
T
2
2
+
1
2
4
(y X)
(y X) = 0,
and their solutions are the MLEs

T
and
2
T
. The rst k equations above are equivalent
to the OLS normal equations (3.3). It follows that the OLS estimator

T
is also the
MLE

T
. Plugging

T
into the rst order conditions we can solve for
2
and obtain

2
T
=
(y X
T
)
(y X
T
)
T
=
e
e
T
, (3.9)
which is dierent from the OLS variance estimator (3.8).
The conclusion below is stronger than the Gauss-Markov theorem (Theorem 3.5).
Then the OLS estimators

T
and
2
T
are the best unbiased estimators for
o
and
2
o
,
respectively.
Proof: The score vector is
s(,
2
) =
_
_
1
2
X
(y X)
T
2
2
+
1
2
4
(y X)
(y X)
_
_
,
and the Hessian matrix of the log-likelihood function is
_
_

1
2
X
X
1
4
X
(y X)
4
(y X)
X
T
2
4

1
6
(y X)
(y X)
_
_
.
It is easily veried that when [A3] is true, IE[s(
o
,
2
o
)] = 0, and the expected value of
the Hessian matrix evaluated at
o
and
2
o
is
_
_

1
2
o
X
X 0
0
T
2
4
o
_
_
.
The information matrix equality (Lemma 2.9) ensures that the negative of this matrix
equals the information matrix. The inverse of the information matrix is then
_
_

2
o
(X
X)
1
0
0
2
4
o
T
_
_
,
which is the Cramer-Rao lower bound by Lemma 2.10. Clearly, var(
T
) achieves this
lower bound so that

T
must be the best unbiased estimator for
o
. Although the
variance of
2
T
is greater than the lower bound, it can be shown that
2
T
is still the best
unbiased estimator for
2
o
; see Rao (1973, ) for a proof.
Remark: Comparing to the Gauss-Markov theorem, Theorem 3.8 gives a stronger
result at the expense of a stronger condition (the normality condition [A3]). The OLS
estimators now are the best (most ecient) in a much larger class of estimators, namely,
the class of unbiased estimators. Note also that Theorem 3.8 covers
2
T
, whereas the
Gauss-Markov theorem does not.
3.4 Hypotheses Testing
After a specication is estimated, it is often desirable to test various economic and
econometric hypotheses. Given the classical conditions [A1] and [A3], we consider the
linear hypothesis
R
o
= r, (3.10)
where R is a q k non-stochastic matrix with rank q < k, and r is a vector of pre-
specied, hypothetical values.
3.4.1 Tests for Linear Hypotheses
If the null hypothesis (3.10) is true, it is reasonable to expect that R
T
is close to
the hypothetical value r; otherwise, they should be quite dierent. Here, the closeness
between R
T
and r must be justied by the null distribution of the test statistics.
If there is only a single hypothesis, the null hypothesis (3.10) is such that R is a row
vector (q = 1) and r is a scalar. Note that a single hypothesis may involve two or more
parameters. Consider the following statistic:
R
T
r
o
[R(X
X)
1
R
]
1/2
.
By Theorem 3.7(a),

T
N(
o
,
2
o
(X
X)
1
), and hence
R
T
N(R
o
,
2
o
R(X
X)
1
R
).
Under the null hypothesis, we have
R
T
r
o
[R(X
X)
1
R
]
1/2
=
R(
T

o
)
o
[R(X
X)
1
R
]
1/2
N(0, 1). (3.11)
3.4. HYPOTHESES TESTING 59
Although the left-hand side has a known distribution, it cannot be used as a test statistic
because
o
is unknown. Replacing
o
by its OLS estimator
T
yields an operational
statistic:
=
R
T
r

T
[R(X
X)
1
R
]
1/2
. (3.12)
The null distribution of is given in the result below.
Then under the null hypothesis (3.10) with R a 1 k vector,
t(T k),
where is given by (3.12).
Proof: We rst write the statistic as
=
R
T
r
o
[R(X
X)
1
R
]
1/2
_
(T k)
2
T
/
2
o
T k
,
where the numerator is distributed as N(0, 1) by (3.11), and (T k)
2
T
/
2
o
is distributed
as
2
(T k) by Theorem 3.7(b). Hence, the square of the denominator is a central
2
random variable divided by its degrees of freedom T k. The assertion follows if we
can show that the numerator and denominator are independent. Note that the random
components of the numerator and denominator are, respectively,

T
and e
e, where

T
and e are two normally distributed random vectors with the covariance matrix
cov( e,

T
) = IE[(I
T
P)(y X
o
)y
X(X
X)
1
]
= (I
T
P) IE[(y X
o
)y
]X(X
X)
1
=
2
o
(I
T
P)X(X
X)
1
= 0.
Since uncorrelated normal random vectors are also independent,

T
is independent of
e. By Lemma 2.1, we conclude that

T
is also independent of e
e.
The statistic is known as the t statistic, and its null distribution is t(T k) by
Theorem 3.9. When the alternative hypothesis is R
o
,= r, this is a two-sided test;
when the alternative hypothesis is R
o
> r (or R
o
< r), this is a one-sided test. For
each test, we rst choose a small signicance level and then determine the critical
region C
. For the two-sided t test, we can nd the values t

/2
(T k) from the t table
such that
= IP < t
/2
(T k) or > t
/2
(T k)
= 1 IPt
/2
(T k) t
/2
(T k).
The critical region is then
C
= (, t
/2
(T k)) (t
/2
(T k), ),
and t
/2
(T k) are the critical values at the signicance level . For the alternative
hypothesis R
o
> r, the critical region is (t
(T k), ), where t
(T k) is the critical
value such that
= IP > t
(T k).
Similarly, for the alternative R
o
< r, the critical region is (, t
(T k)).
The null hypothesis is rejected at the signicance level when falls in the critical
region. As is small, the event C
is unlikely under the null hypothesis. When

does take an extreme value relative to the critical values, it is an evidence against the
null hypothesis. The decision of rejecting the null hypothesis could be wrong, but the
probability of the type I error will not exceed . When takes a reasonable value
in the sense that it falls in the complement of the critical region, the null hypothesis is
not rejected.
Example 3.10 To test a single coecient equal to zero:
i
= 0, we choose R as the
transpose of the ith Cartesian unit vector:
R = [ 0 0 1 0 0 ].
Let m
ii
be the i
th
diagonal element of M
1
= (X
X)
1
. Then, R(X
X)
1
R
= m
ii
.
The t statistic for this hypothesis, also known as the t ratio, is
=
i,T

T
m
ii
t(T k).
When a t ratio rejects the null hypothesis, it is said that the corresponding estimated
coecient is signicantly dierent from zero; econometric packages usually report t
ratios along with the coecient estimates.
Example 3.11 To test the single hypothesis
i
+
j
= 0, we set R as
R = [ 0 0 1 0 0 1 0 0 ].
Hence, R(X
X)
1
R
= m
ii
+ 2m
ij
+m
jj
, where m
ij
is the (i, j)
th
element of M
1
=
(X
X)
1
. The t statistic is
=
i,T
+

j,T

T
(m
ii
+ 2m
ij
+m
jj
)
1/2
t(T k).
Several hypotheses can also be tested jointly. Consider the null hypothesis R
o
= r,
where R is now a q k matrix (q 2) and r is a vector. This hypothesis involves q
single hypotheses. Similar to (3.11), we have under the null hypothesis that
[R(X
X)
1
R
]
1/2
(R
T
r)/
o
N(0, I
q
).
Therefore,
(R
T
r)
[R(X
X)
1
R
]
1
(R
T
r)/
2
o

2
(q). (3.13)
Again, we can replace
2
o
by its OLS estimator
2
T
to obtain an operational statistic:
=
(R
T
r)
[R(X
X)
1
R
]
1
(R
T
r)

2
T
q
. (3.14)
The next result gives the null distribution of .
Then under the null hypothesis (3.10) with R a q k matrix with rank q < k, we have
F(q, T k),
where is given by (3.14).
Proof: Note that
=
(R
T
r)
[R(X
X)
1
R
]
1
(R
T
r)/(
2
o
q)
(T k)

2
T
2
o
_
(T k)
.
In view of (3.13) and the proof of Theorem 3.9, the numerator and denominator terms
are two independent
2
random variables, each divided by its degrees of freedom. The
assertion follows from the denition of F random variable.
The statistic is known as the F statistic. We reject the null hypothesis at the
signicance level when is too large relative to the critical value F
(q, T k) of the
F table, where F
(q, T k) is such that

= IP > F
(q, T k).
If there is only a single hypothesis, the F statistic is just the square of the corresponding
t statistic. When rejects the null hypothesis, it simply suggests that there is evidence
against at least one single hypothesis. The inference of a joint test is, however, not
necessary the same as the inference of individual tests; see also Section 3.5.
Example 3.13 Joint null hypothesis: H
o
:
1
= b
1
and
2
= b
2
. The F statistic is
=
1
2
2
T
_

1,T
b
1
2,T
b
2
_
_
m
11
m
12
m
21
m
22
_
1
_

1,T
b
1
2,T
b
2
_
F(2, T k),
where m
ij
is as dened in Example 3.11.
Remark: For the null hypothesis of s coecients being zero, if the corresponding F
statistic > 1 ( < 1), dropping these s regressors will reduce (increase)

R
2
; see
Exercise 3.12.
3.4.2 Power of the Tests
Recall that the power of a test is the probability of rejecting the null hypothesis when the
null hypothesis is indeed false. In this section, we consider the hypothesis R
o
= r +,
where characterizes the deviation from the null hypothesis, and analyze the power
performance of the t and F tests.
Then under the hypothesis that R
o
= r+, where R is a q k matrix with rank q < k,
we have
F(q, T k;
D
1
, 0),
where is given by (3.14), D =
2
o
[R(X
X)
1
R
], and
D
1
is the non-centrality
parameter of the numerator term.
Proof: When R
o
= r +,
[R(X
X)
1
R
]
1/2
(R
T
r)/
o
= [R(X
X)
1
R
]
1/2
[R(
T

o
) +]/
o
.
Given [A3],
[R(X
X)
1
R
]
1/2
R(
T

o
)/
o
. N(0, I
q
),
and hence
[R(X
X)
1
R
]
1/2
(R
T
r)/
o
N(D
1/2
, I
q
).
It follows from Lemma 2.7 that
(R
T
r)
[R(X
X)
1
R
]
1
(R
T
r)/
2
o

2
(q;
D
1
),
which is the non-central
2
distribution with q degrees of freedom and the non-centrality
parameter
D
1
. This is in contrast with (3.13) which has a central
2
distribution
under the null hypothesis. As (T k)
2
T
/
2
o
is still distributed as
2
(T k) by The-
orem 3.7(b), the assertion follows because the numerator and denominator of are
independent.
Clearly, when the null hypothesis is correct, we have = 0, so that F(q, T k).
Theorem 3.14 thus includes Theorem 3.12 as a special case. In particular, for testing a
single hypothesis, we have
t(T k; D
1/2
),
which reduces to t(T k) when = 0, as in Theorem 3.9.
Theorem 3.14 implies that when R
o
deviates farther from the hypothetical value
r, the non-centrality parameter
D
1
increases, and so does the power. We illustrate
this point using the following two examples, where the power are computed using the
GAUSS program. For the null distribution F(2, 20), the critical value at 5% level is 3.49.
Then for F(2, 20;
1
, 0) with the non-centrality parameter
1
= 1, 3, 5, the probabilities
that exceeds 3.49 are approximately 12.1%, 28.2%, and 44.3%, respectively. For the
null distribution F(5, 60), the critical value at 5% level is 2.37. Then for F(5, 60;
1
, 0)
with
1
= 1, 3, 5, the probabilities that exceeds 2.37 are approximately 9.4%, 20.5%,
and 33.2%, respectively. In both cases, the power increases with the non-centrality
parameter.
3.4.3 An Alternative Approach
Given the specication (3.1), we may take the constraint R
o
= r into account and
consider the constrained OLS estimation that nds the saddle point of the Lagrangian:
min
,
1
T
(y X)
(y X) + (R r)
,
where is the q 1 vector of Lagrangian multipliers. It is straightforward to show that
the solutions are
T
= 2[R(X
X/T)
1
R
]
1
(R
T
r),
T
=

T
(X
X/T)
1
R
T
/2,
(3.15)
which will be referred to as the constrained OLS estimators.
Given

T
, the vector of constrained OLS residuals is
e = y X
T
= y X
T
+X(
T
) = e +X(
T
).
It follows from (3.15) that
T
= (X
X/T)
1
R
T
/2
= (X
X)
1
R
[R(X
X)
1
R
]
1
(R
T
r).
The inner product of e is then
e
e = e
e + (
T
)
X(
T
)
= e
e + (R
T
r)
[R(X
X)
1
R
]
1
(R
T
r).
Note that the second term on the right-hand side is nothing but the numerator of the
F statistic (3.14). The F statistic now can be written as
=
e
e e
e
q
2
T
=
(ESS
c
ESS
u
)/q
ESS
u
/(T k)
, (3.16)
where ESS
c
= e
e and ESS
u
= e
e denote, respectively, the ESS resulted from con-

strained and unconstrained estimations. Dividing the numerator and denominator of
(3.16) by centered TSS (y
y T y
2
) yields another equivalent expression for :
=
(R
2
u
R
2
c
)/q
(1 R
2
u
)/(T k)
, (3.17)
where R
2
c
and R
2
u
are, respectively, the centered coecient of determination of con-
strained and unconstrained estimations. As the numerator of (3.17), R
2
u
R
2
c
, can be
interpreted as the loss of t due to the imposed constraint, the F test is in eect a
loss-of-t test. The null hypothesis is rejected when the constrained specication ts
data much worse.
Example 3.15 Consider the specication: y
t
=
1
+
2
x
t2
+
3
x
t3
+ e
t
. Given the
hypothesis (constraint)
2
=
3
, the resulting constrained specication is
y
t
=
1
+
2
(x
t2
+x
t3
) +e
t
.
3.5. CONFIDENCE REGIONS 65
By estimating these two specications separately, we obtain ESS
u
and ESS
c
, from which
the F statistic can be easily computed.
Example 3.16 Test the null hypothesis that all the coecients (except the constant
term) equal zero. The resulting constrained specication is y
t
=
1
+e
t
, so that R
2
c
= 0.
Then, (3.17) becomes
=
R
2
u
/(k 1)
(1 R
2
u
)/(T k)
F(k 1, T k),
which requires only estimation of the unconstrained specication. This test statistic is
also routinely reported by most of econometric packages and known as the regression
F test.
3.5 Condence Regions
In addition to point estimators for parameters, we may also be interested in nding
condence intervals for parameters. A condence interval for
i,o
with the condence
coecient (1 ) is the interval (g
, g
) that satises
IP g

i,o
g
= 1 .
That is, we are (1 ) 100 percent sure that such an interval would include the true
parameter
i,o
.
From Theorem 3.9, we know
IP
_
t
/2
(T k)
i,T

i,o

T
m
ii
t
/2
(T k)
_
= 1 ,
where m
ii
is the i
th
diagonal element of (X
X)
1
, and t
/2
(T k) is the critical value
of the (two-sided) t test at the signicance level . Equivalently, we have
IP
_
i,T
t
/2
(T k)
T
m
ii

i,o

i,T
+t
/2
(T k)
T
m
ii
_
= 1 .
This shows that the condence interval for
i,o
can be constructed by setting
g
i,T
t
/2
(T k)
T
m
ii
,
g
i,T
+t
/2
(T k)
T
m
ii
.
It should be clear that the greater the condence coecient (i.e., smaller), the larger
is the magnitude of the critical values t
/2
(T k) and hence the resulting condence
interval.
The condence region for R
o
with the condence coecient (1 ) satises
IP(
T

o
)
[R(X
X)
1
R
]
1
R(
T

o
)/(q
2
T
) F
(q, T k)
= 1 ,
where F
(q, T k) is the critical value of the F test at the signicance level .

Example 3.17 The condence region for (
1,o
= b
1
,
2,o
= b
2
). Suppose T k = 30
and = 0.05, then F
0.05
(2, 30) = 3.32. In view of Example 3.13,
IP
_
_
_
1
2
2
T
_

1,T
b
1
2,T
b
2
_
_
m
11
m
12
m
21
m
22
_
1
_

1,T
b
1
2,T
b
2
_
3.32
_
_
_
= 0.95,
which results in an ellipse with the center (
1,T
,

2,T
).
Remark: A point (
1,o
,
2,o
) may be outside the joint condence ellipse but inside
the condence box formed by individual condence intervals. Hence, each t ratio may
show that the corresponding coecient is insignicantly dierent from zero, while the F
test indicates that both coecients are not jointly insignicant. It is also possible that
(
1
,
2
) is outside the condence box but inside the joint condence ellipse. That is,
each t ratio may show that the corresponding coecient is signicantly dierent from
zero, while the F test indicates that both coecients are jointly insignicant. See also
an illustrative example in Goldberger (1991, Chap. 19).
3.6 Multicollinearity
In Section 3.2.2 we have seen that a linear specication suers from the problem of
exact multicollinearity if the basic identiability requirement (i.e., X is of full column
rank) is not satised. In this case, the OLS estimator cannot be computed as (3.4).
This problem may be avoided by modifying the postulated specications.
3.6.1 Near Multicollinearity
In practice, it is more common that explanatory variables are related to some extent but
do not satisfy an exact linear relationship. This is usually referred to as the problem of
near multicollinearity. But as long as there is no exact multicollinearity, parameters can
still be estimated by the OLS method, and the resulting estimator remains the BLUE
under [A1] and [A2].
3.6. MULTICOLLINEARITY 67
Nevertheless, there are still complaints about near multicollinearity in empirical
studies. In some applications, parameter estimates are very sensitive to small changes
in data. It is also possible that individual t ratios are all insignicant, but the regres-
sion F statistic is highly signicant. These symptoms are usually attributed to near
multicollinearity. This is not entirely correct, however. Write X = [x
i
X
i
], where X
i
is the submatrix of X excluding the i
th
column x
i
. By the result of Theorem 3.3, the
variance of

i,T
can be expressed as
var(
i,T
) = var([x
i
(I P
i
)x
i
]
1
x
i
(I P
i
)y) =
2
o
[x
i
(I P
i
)x
i
]
1
,
where P
i
= X
i
(X
i
X
i
)
1
X
i
. It can also be veried that
var(
i,T
) =

2
o
T
t=1
(x
ti
x
i
)
2
(1 R
2
(i))
,
where R
2
(i) is the centered coecient of determination from the auxiliary regression of
x
i
on X
i
. When x
i
is closely related to other explanatory variables, R
2
(i) is high so
that var(
i,T
) would be large. This explains why

i,T
are sensitive to data changes and
why corresponding t ratios are likely to be insignicant. Near multicollinearity is not a
necessary condition for these problems, however. Large var(
i,T
) may also arise due to
small variations of x
ti
and/or large
2
o
.
Even when a large value of var(
i,T
) is indeed resulted from high R
2
(i), there is
nothing wrong statistically. It is often claimed that severe multicollinearity can make
an important variable look insignicant. As Goldberger (1991) correctly pointed out,
this statement simply confuses statistical signicance with economic importance. These
large variances merely reect the fact that parameters cannot be precisely estimated
from the given data set.
Near multicollinearity is in fact a problem related to data and model specication.
If it does cause problems in estimation and hypothesis testing, one may try to break the
approximate linear relationship by, e.g., adding more observations to the data set (if
plausible) or dropping some variables from the current specication. More sophisticated
statistical methods, such as the ridge estimator and principal component regressions,
may also be used; details of these methods can be found in other econometrics textbooks.
3.6.2 Digress: Dummy Variables
A linear specication may include some qualitative variables to indicate the presence or
absence of certain attributes of the dependent variable. These qualitative variables are
typically represented by dummy variables which classify data into dierent categories.
For example, let y
i
denote the annual salary of college teacher i and x
i
the years of
teaching experience. Consider the dummy variable: D
i
= 1 if i is a male and D
i
= 0 if
i is a female. Then, the specication
y
i
=
0
+
1
D
i
+x
i
+e
i
yields two regression lines with dierent intercepts. The male regression line has the
intercept
0
+
1
, and the female regression line has the intercept
0
. We may test
the hypothesis
1
= 0 to see if there is a dierence between the starting salaries of male
and female teachers.
This specication can be expanded to incorporate an interaction term between D
and x:
y
i
=
0
+
1
D
i
+
0
x
i
+
1
(D
i
x
i
) +e
i
,
which yields two regression lines with dierent intercepts and slopes. The slope of the
male regression line is mow
0
+
1
, whereas the slope of the female regression line
is
0
. By testing
1
= 0, we can check whether teaching experience is treated the same
in determining salaries for male and female teachers.
Suppose that we want to know if the education level of the head of household aects
family consumption pattern. We may classify data into three groups: below high school,
high school only, college or higher. Let D
1i
= 1 if i has a high school degree only and
D
1i
= 0 otherwise, and D
2i
= 1 if i has a college or higher degree and D
2i
= 0 otherwise.
Then, similar to the previous example, the following specication,
y
i
=
0
+
1
D
1i
+
2
D
2i
+x
i
+e
i
,
yields three regression lines. The below-high-school regression line has the intercept
0
,
the high-school regression line has the intercept
0
+
1
, and the college regression line
has the intercept
0
+
2
. Various interesting hypotheses can be tested based on this
specication.
Remark: The preceding examples show that, when a specication contains a con-
stant term, the number of dummy variables is always one less than the number of
categories that dummy variables try to classify. Otherwise, the specication has exact
multicollinearity; this is the so-called dummy variable trap.
3.7. LIMITATIONS OF THE CLASSICAL CONDITIONS 69
3.7 Limitations of the Classical Conditions
The previous estimation and testing results are based on the classical conditions. As
these conditions may be violated in practice, it is important to understand their limi-
tations.
Condition [A1] postulates that explanatory variables are non-stochastic. Although
this condition is quite convenient and facilitates our analysis, it is not practical. When
the dependent variable and regressors are economic variables, it does not make too
much sense to treat only the dependent variable as a random variable. This condition
may also be violated when a lagged dependent variable is included as a regressor, as in
many time-series analysis. Hence, it would be more reasonable to allow regressors to be
random as well.
In [A2](i), the linear specication X is assumed to be correct up to some unknown
parameters. It is possible that the systematic component IE(y) is in fact a non-linear
function of X. If so, the estimated regression hyperplane could be very misleading. For
example, an economic relation may change from one regime to another at some time
point so that IE(y) is better characterized by a piecewise liner function. This is known
as the problem of structural change; see e.g., Exercise 3.14. Even when IE(y) is a linear
function, the specied X may include some irrelevant variables or omit some important
variables. Example 3.6 shows that in the former case, the OLS estimator

T
remains
unbiased but is less ecient. In the latter case, it can be shown that

T
is biased but
with a smaller variance-covariance matrix; see Exercise 3.6.
Condition [A2](ii) may also easily break down in many applications. For example,
when y
t
is the consumption of the t
th
household, it is likely that y
t
has smaller variation
for low-income families than for high-income families. When y
t
denotes the GDP growth
rate of the t
th
year, it is also likely that y
t
are correlated over time. In both cases, the
variance-covariance matrix of y cannot be expressed as
2
o
I
T
. A consequence of the
failure of [A2](ii) is that the OLS estimator for var(
T
),
2
T
(X
X)
1
, is biased, which
in turn renders the tests discussed in Section 3.4 invalid.
Condition [A3] may fail when y
t
have non-normal distributions. Although the BLUE
property of the OLS estimator does not depend on normality, [A3] is crucial for deriving
the distribution results in Section 3.4. When [A3] is not satised, the usual t and F
tests do not have the desired t and F distributions, and their exact distributions are
typically unknown. This causes serious problems for hypothesis testing.
Our discussion thus far suggests that the classical conditions are quite restrictive.
In subsequent chapters, we will try to relax these conditions and discuss more generally
applicable methods. These methods play an important role in contemporary empirical
studies.
Exercises
3.1 Construct a linear regression model for each equation below:
y = x
, y = e
x
, y =
x
x
, y =
e
+x
1 +e
+x
.
3.2 Use the general formula (3.4) to nd the OLS estimators from the specications
below:
y
t
= +x
t
+e, t = 1, . . . , T,
y
t
= +(x
t
x) +e, t = 1, . . . , T,
y
t
= x
t
+e, t = 1, . . . , T.
Compare the resulting regression lines.
3.3 Given the specication y
t
= + x
t
+ e, t = 1, . . . , T, assume that the classical
conditions hold. Let
T
and

T
be the OLS estimators for and , respectively.
(a) Apply the general formula of Theorem 3.4(c) to show that
var(
T
) =
2
o
T
t=1
x
2
t
T
T
t=1
(x
t
x)
2
,
var(
T
) =
2
o
1
T
t=1
(x
t
x)
2
,
cov(
T
,

T
) =
2
o
x
T
t=1
(x
t
x)
2
.
What kind of data can make the variances of the OLS estimators smaller?
(b) Suppose that a prediction y
T+1
=
T
+

T
x
T+1
is made based on the new
observation x
T+1
. Show that
IE( y
T+1
y
T+1
) = 0,
var( y
T+1
y
T+1
) =
2
o
_
1 +
1
T
+
(x
T+1
x)
2
T
t=1
(x
t
x)
2
_
.
What kind of x
T+1
can make the variance of prediction error smaller?
3.4 Given the specication (3.1), suppose that X is not of full column rank. Does
there exist a unique y span(X) that minimizes (y y)
(y y)? If yes, is there

a unique

T
such that y = X
T
? Why or why not?
3.5 Given the estimated model
y
t
=

1,T
+

2,T
x
t2
+ +

k,T
x
tk
+ e
t
,
consider the standardized regression:
y
t
=

2,T
x
t2
+ +

k,T
x
tk
+ e
t
,
where

i,T
are known as the beta coecients, and
y
t
=
y
t
y
s
y
, x
ti
=
x
ti
x
i
s
x
i
,
t
=
e
t
s
y
,
with s
2
y
= (T 1)
1
T
t=1
(y
t
y)
2
is the sample variance of y
t
and for each
i, s
2
x
i
= (T 1)
1
T
t=1
(x
ti
x
i
)
2
is the sample variance of x
ti
. What is the
relationship between

i,T
and

i,T
? Give an interpretation of the beta coecients.
3.6 Given the following specication
y = X
1
1
+e,
where X
1
(T k
1
) is a non-stochastic matrix, let

b
1,T
denote the resulting OLS
estimator. Suppose that IE(y) = X
1
b
1
+ X
2
b
2
for some b
1
and b
2
, where X
2
(T k
2
) is also a non-stochastic matrix and b
2
,= 0.
(a) Is

b
1,T
unbiased?
(b) Is
2
T
unbiased?
(c) What is var(
b
1,T
)?
(d) Let

T
= (
1,T

2,T
)
denote the OLS estimator obtained from estimating

the specication: y = X
1
1
+X
2
2
+e. Compare var(
1,T
) and var(
b
T
).
(e) Does your result in (d) change when X
1
X
2
= 0?
3.7 Given the specication (3.1), will the changes below aect the resulting OLS
estimator

T
, t ratios, and R
2
?
(a) y
= 1000 y and X are used as the dependent and explanatory variables.

(b) y and X
= 1000 X are used as the dependent and explanatory variables.

(c) y
and X
are used as the dependent and explanatory variables.

3.8 Let R
2
k
denote the centered R
2
obtained from the model with k explanatory vari-
ables.
(a) Show that
R
2
k
=
k
i=1
iT
T
t=1
(x
ti
x
i
)y
t
T
t=1
(y
t
y)
2
,
where

iT
is the ith element of

T
, x
i
=
T
t=1
x
ti
/T, and y =
T
t=1
y
t
/T.
(b) Show that R
2
k
R
2
k1
.
3.9 Consider the following two regression lines: y = +

x and x = +

y. At
which point do these two lines intersect? Using the result in Exercise 3.8 to show
that these two regression lines coincide if and only if the centered R
2
s for both
regressions are one.
3.10 Given the specication (3.1), suppose that X does not contain the constant term.
Show that the centered R
2
need not be bounded between zero and one if it is
computed as (3.7).
3.11 Rearrange the matrix X as [x
i
X
i
], where x
i
is the ith column of X. Let u
i
and
v
i
denote the residual vectors of regressing y on X
i
and x
i
on X
i
, respectively.
Dene the partial correlation coecient of y and x
i
as
r
i
=
u
i
v
i
(u
i
u
i
)
1/2
(v
i
v
i
)
1/2
.
Let R
2
i
and R
2
be obtained from the regressions of y on X
i
and y on X, respec-
tively.
(a) Apply the Frisch-Waugh-Lovell Theorem to show
I P = (I P
i
)
(I P
i
)x
i
x
i
(I P
i
)
x
i
(I P
i
)x
i
,
where P = X(X
X)
1
X
and P
i
= X
i
(X
i
X
i
)
1
X
i
. Note that this result
can also be derived using the matrix inversion formula (e.g. Greene (1993,
p. 27)).
(b) Show that (1 R
2
)/(1 R
2
i
) = 1 r
2
i
, and use this result to verify
R
2
R
2
i
= r
2
i
(1 R
2
i
).
What does this result tell you?
(c) Let
i
denote the t ratio of

iT
, the ith element of

T
obtained from regressing
y on X. First show that
2
i
= (T k)r
2
i
/(1r
2
i
), and use this result to verify
r
2
i
=
2
i
/(
2
i
+T k).
(d) Combine the results in (b) and (c) to show
R
2
R
2
i
=
2
i
(1 R
2
)/(T k).
What does this result tell you?
3.12 Suppose that a linear model with k explanatory variables has been estimated.
(a) Show that
2
T
= Centered TSS(1

R
2
)/(T 1). What does this result tell
you?
(b) Suppose that we want to test the hypothesis that s coecients are zero. Show
that the F statistic can be written as
=
(T k +s)
2
c
(T k)
2
u
s
2
u
,
where
2
c
and
2
u
are the variance estimates of the constrained and uncon-
strained models, respectively. Let a = (T k)/s. Show that

2
c

2
u
=
a +
a + 1
.
(c) Based on the results in (a) and (b), what can you say when > 1 and < 1?
3.13 For the linear specication y = X +e, an alternative expression of k m linear
restrictions on can be expressed as = S + d, where is a m-dimensional
vector of unknown parameters, S is a km matrix of pre-specied constants with
full column rank, and c is a vector of pre-specied constants.
(a) By incorporating this restriction into the specication, nd the OLS estima-
tor

of .
(b) The constrained least squares estimator of is

c
= S
+d. Show that
c
= Q
S
+ (I Q
S
)d,
where Q
S
= S(S
XS)
1
S
X. Is this decomposition orthogonal?

(c) Show that
X
c
= P
XS
y + (I P
XS
)Xd,
where P
XS
= XS(S
XS)
1
S
. Use a graph to illustrate this result.

3.14 (The Chow Test) Consider the model of a one-time structural change at a known
change point:
_
y
1
y
2
_
=
_
X
1
0
X
2
X
2
__

o
o
_
+
_
e
1
e
2
_
,
where y
1
and y
2
are T
1
1 and T
2
1, X
1
and X
2
are T
1
k and T
2
k,
respectively. The null hypothesis is
o
= 0. How would you test this hypothesis
based on the constrained and unconstrained models?
References
Davidson, Russell and James G. MacKinnon (1993). Estimation and Inference in Econo-
metrics, New York, NY: Oxford University Press.
Goldberger, Arthur S. (1991). A Course in Econometrics, Cambridge, MA: Harvard
University Press.
Greene, William H. (2000). Econometric Analysis, 4th ed., Upper Saddle River, NJ:
Prentice Hall.
Harvey, Andrew C. (1990). The Econometric Analysis of Time Series, Second edition.,
Cambridge, MA: MIT Press.
Intriligator, Michael D., Ronald G. Bodkin, and Cheng Hsiao (1996). Econometric
Models, Techniques, and Applications, Second edition, Upper Saddle River, NJ:
Prentice Hall.
Johnston, J. (1984). Econometric Methods, Third edition, New York, NY: McGraw-Hill.
Judge, Georgge G., R. Carter Hill, William E. Griths, Helmut L utkepohl, and Tsoung-
Chao Lee (1988). Introduction to the Theory and Practice of Econometrics, Sec-
ond edition, New York, NY: Wiley.
Maddala, G. S. (1992). Introduction to Econometrics, Second edition, New York, NY:
Macmillan.
Manski, Charles F. (1991). Regression, Journal of Economic Literature, 29, 3450.
Rao, C. Radhakrishna (1973). Linear Statistical Inference and Its Applications, Second
edition, New York, NY: Wiley.
Ruud, Paul A. (2000). An Introduction to Classical Econometric Theory, New York,
NY: Oxford University Press.
Theil, Henri (1971). Principles of Econometrics, New York, NY: Wiley.
Chapter 4
Generalized Least Squares
Theory
4.1 Introduction
In Chapter 3.7 we have seen that the classical conditions need not hold in practice.
Although these conditions have no eect on the OLS method per se, they do aect
the properties of the OLS estimators and resulting test statistics. In particular, when
the elements of y have unequal variances and/or are correlated, there is no guarantee
that the OLS estimator is the most ecient within the class of linear unbiased (or the
class of unbiased) estimators. Moreover, hypothesis testing based on the standard OLS
estimator of the variance-covariance matrix becomes invalid.
In this chapter, the method of generalized least squares (GLS) is introduced to im-
prove upon estimation eciency. A drawback of the GLS method is that it is dicult
to implement. In practice, certain structures (assumptions) must be imposed on var(y)
so that a feasible GLS estimator can be computed. This approach results in two further
diculties, however. First, the postulated structures on var(y) need not be correctly
specied. Consequently, the resulting feasible GLS estimator may not be as ecient
as one would like. Second, the nite-sample properties of feasible GLS estimators are
not easy to establish. Exact tests based on the feasible GLS results are thus not read-
ily available. More detailed discussions of the GLS theory can also be found in e.g.,
Amemiya (1985) and Greene (2000).
77
78 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY
4.2 The Method of Generalized Least Squares
4.2.1 When y Does Not Have a Scalar Covariance Matrix
Given the linear specication (3.1):
y = X +e,
suppose that, in addition to the conditions [A1] and [A2](i),
var(y) =
o
,
where
o
is a positive denite matrix but cannot be written as
2
o
I
T
for any positive
number
2
o
. That is, the elements of y may not have a constant variance, nor are
they required to be uncorrelated. As [A1] and [A2](i) still hold, the OLS estimator

T
remains unbiased by Theorem 3.4(a), and
var(
T
) = (X
X)
1
X
o
X(X
X)
1
, (4.1)
in contrast with Theorem 3.4(c). In view of Theorem 3.5, there is no guarantee that
the OLS estimator is the BLUE for
o
. Similarly, when [A3] fails such that
y N(X
o
,
o
),
we have
T
N(
o
, (X
X)
1
X
o
X(X
X)
1
);
cf. Theorem 3.7(a). In this case,

T
need not be the BUE for
o
.
Apart from eciency, a more serious consequence of the failure of [A3] is that the
statistical tests based on the standard OLS estimation results become invalid. Recall
that the OLS estimator for var(
T
) is
var(
T
) =
2
T
(X
X)
1
,
which is, in general, a biased estimator for (4.1). As the t and F statistics depend on
the elements of var(
T
), they no longer have the desired t and F distributions under
the null hypothesis. Consequently, the inferences based on these tests become invalid.
4.2. THE METHOD OF GENERALIZED LEAST SQUARES 79
4.2.2 The GLS Estimator
The GLS method focuses on the eciency issue resulted from the failure of the classical
condition [A2](ii). Let G be a T T non-stochastic matrix. Consider the transformed
specication
Gy = GX +Ge,
where Gy denotes the transformed dependent variable and GX is the matrix of trans-
formed explanatory variables. It can be seen that GX also has full column rank k
provided that G is nonsingular. Thus, the identication requirement for the specica-
tion (3.1) carries over under nonsingular transformations. It follows that can still be
estimated by the following OLS estimator:
(X
GX)
1
X
Gy. (4.2)
Given that the original variables y and X satisfy [A1] and [A2](i), it is easily seen
that the transformed variables Gy and GX also satisfy these two conditions because
GX is non-stochastic and IE(Gy) = GX
o
. When var(y) =
o
,
var(Gy) = G
o
G
.
If G is such that G
o
G
=
2
o
I
T
for some positive number
2
o
, the condition [A2](ii)
would also hold. Since the classical conditions are all satised, the OLS estimator
(X
GX)
1
X
Gy
is still the BLUE for
o
by Theorem 3.5. This suggests that, as far as eciency is
concerned, one should estimate from the transformed specication such that the
transformation matrix G is nonsingular and G
o
G
=
2
o
I
T
.
To nd a desirable transformation matrix G, note that
o
is symmetric and positive
denite and that
o
can be orthogonally diagonalized as C
o
C = , where C is the
matrix of eigenvectors corresponding to the matrix of eigenvalues . For
1/2
o
=
C
1/2
C
(or
1/2
o
=
1/2
C
), we have
1/2
o

o
1/2
o
= I
T
.
This result immediately suggests that the desired matrix G should be proportional to
1/2
o
, i.e., G = c
1/2
o
for some constant c. Given this choice of G, we have
var(Gy) = G
o
G
= c
2
I
T
,
a scalar covariance matrix, so that [A2](ii) also holds. It follows that the estimator (4.2)
with G = c
1/2
o
is the BLUE for
o
. This estimator is known as the GLS estimator
and reads
GLS
= (c
2
X
1
o
X)
1
(c
2
X
1
o
y) = (X
1
o
X)
1
X
1
o
y. (4.3)
It should be clear that the GLS estimator cannot be computed unless
o
is known. As
GLS
does not depend on c, it suces to choose the transformation matrix G =
1/2
o
.
For G =
1/2
o
, let y
= Gy, X
= GX, and e
= Ge. The transformed specica-

tion is
y
= X
+e
, (4.4)
As the GLS estimator is the OLS estimator for the specication (4.4), it can also be
interpreted as a minimizer of the following GLS criterion function:
Q(;
o
) =
1
T
(y
(y
) =
1
T
(y X)
1
o
(y X). (4.5)
This is a generalized version of the standard OLS criterion function (3.2) because it is
the average of a weighted sum of squared errors. Thus, the GLS and OLS estimators
are not equivalent in general, except in some exceptional cases; see e.g. Exercise 4.1.
Similar to the OLS method, dene the vector of GLS tted values as
y
GLS
= X(X
1
o
X)
1
X
1
o
y.
The vector of GLS residuals is
e
GLS
= y y
GLS
.
As X(X
1
o
X)
1
X
1
o
is idempotent but not symmetric, y
GLS
is an oblique (but
not orthogonal) projection of y onto span(X). It can also be veried that the vector of
GLS residuals is not orthogonal to X or any linear combination of the column vectors
of X, i.e.,
e
GLS
X = y
[I
T

1
o
X(X
1
o
X)
1
X
]X ,= 0.
In fact, e
GLS
is orthogonal to span(
1
o
X). It follows that
e
e e
GLS
e
GLS
.
This shows that the OLS method still yields a better t of original data.
4.2. THE METHOD OF GENERALIZED LEAST SQUARES 81
Remark: A natural measure of goodness of t for the specication (3.1) estimated
using the GLS method is
Centered R
2
GLS
= 1
e
GLS
e
GLS
Centered TSS of y
,
where the denominator is the TSS of the original dependent variable y. A major problem
of this measure is that it need not be bounded between zero and one; see Exercise 4.2.
Thus, R
2
GLS
is not a proper criterion for model comparison. Using R
2
from the trans-
formed specication (4.4) is also inadequate because it can only measure the variation
of the transformed dependent variable y
, but not the variation of the original variable

y.
4.2.3 Properties of the GLS Estimator
We have seen that the GLS estimator is, by construction, the BLUE for
o
under [A1]
and [A2](i). Its variance-covariance matrix is
var(
GLS
) = var((X
1
o
X)
1
X
1
o
y) = (X
1
o
X)
1
. (4.6)
Theorem 4.1 (Aitken) Given the specication (3.1), suppose that [A1] and [A2](i)
hold and that var(y) =
o
is a positive denite matrix. Then

GLS
is the BLUE for
o
with the variance-covariance matrix (X
1
o
X)
1
.
As the GLS estimator is the BLUE,
var(
T
) var(
GLS
)
must be a positive semi-denite matrix. This can also be veried directly; see Exer-
cise 4.3.
For convenience, we introduce the following condition.
[A3
] y N(X
o
,
o
), where
o
is a positive denite matrix.
The following result is an immediate consequence of Theorem 3.7(a).
Theorem 4.2 Given the specication (3.1), suppose that [A1] and [A3
] hold. Then
GLS
N(
o
, (X
1
o
X)
1
).
Moreover, if we believe that [A3
] is true, the log-likelihood function is

log L(;
o
) =
T
2
log(2)
1
2
log(det(
o
))
1
2
(y X)
1
o
(y X). (4.7)
The rst order conditions of maximizing this log-likelihood function with respect to
are
X
1
o
(y X) = 0,
so that the MLE is
T
= (X
1
o
X)
1
X
1
o
y.
Thus, when
o
is known, the GLS estimator is also the MLE under [A3
]. The infor-
mation matrix is then
IE[X
1
o
(y X)(y X)
1
o
X]
=
o
= X
1
o
X.
The variance-covariance matrix of the GLS estimator thus also achieves the Cramer-Rao
lower bound. We have shown:
Theorem 4.3 Given the specication (3.1), suppose that [A1] and [A3
] hold. Then
GLS
is the BUE for
o
.
Under the null hypothesis R
o
= r, it is readily seen from Theorem 4.2 that
(R
GLS
r)
[R(X
1
o
X)
1
R
]
1
(R
GLS
r)
2
(q).
The left-hand side can serve as a test statistic provided that
o
is known.
4.2.4 FGLS Estimator
In practice,
o
is typically unknown and must be estimated. Substituting an estimator
T
for
o
in (4.3) yields the feasible generalized least squares (FGLS) estimator
FGLS
= (X
1
T
X)
1
X
1
T
y.
which is readily computed from data. Note, however, that
o
contains too many
(T(T + 1)/2) parameters. Proper estimation of
o
would not be possible unless fur-
ther restrictions on
o
are imposed. Under dierent assumptions on var(y),
o
has
a simpler structure with much fewer (say, p T) unknown parameters and may be
4.3. HETEROSKEDASTICITY 83
properly estimated; see Sections 4.3 and 4.4. FGLS estimation crucially depends on
these assumptions.
A clear disadvantage of FGLS estimators is that their nite sample properties are
usually unknown. Note that

T
is, in general, a function of y, so that

FGLS
is a
complex function of the elements of y. It is therefore dicult, if not impossible, to
derive the nite-sample properties, such as expectation, variance and distribution, of
FGLS
. Consequently, the eciency gain of an FGLS estimator is not at all clear, and
exact tests are not available. One must rely on the asymptotic properties of

FGLS
to
draw statistical inferences.
4.3 Heteroskedasticity
In this section, we consider a simpler structure of
o
such that
o
is diagonal with
possibly dierent diagonal elements:
o
= diag[
2
1
, . . . ,
2
T
] =
_
2
1
0 0
0
2
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
2
T
_
_
, (4.8)
where diag is the operator that puts its arguments on the main diagonal of a matrix.
That is, the elements of y are uncorrelated but may have dierent variances. When y
t
,
t = 1, . . . , T, have a constant variance, they are said to be homoskedastic; otherwise,
they are heteroskedastic.
To compute the GLS estimator, the desired transformation matrix is
1/2
o
=
_
1
1
0 0
0
1
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
1
T
_
_
.
As
o
still contains T unknown parameters, an even simpler structure of
o
is needed
to ensure proper FGLS estimation.
4.3.1 Tests for Heteroskedasticity
It is clear that the OLS method would prevail unless there is evidence that
o
,=
2
o
I
T
.
It is therefore natural to rst study the tests of the null hypothesis of homoskedasticity
against some form of heteroskedasticity. Such tests are usually based on some simplied
parametric specications of var(y
t
).
The simplest possible form of heteroskedastic y
t
is groupwise heteroskedasticity. Sup-
pose that data can be classied into two groups: group one contains T
1
observations
with the constant variance
2
1
, and group two contains T
2
observations with the constant
variance
2
2
. This assumption simplies
o
in (4.8) to a matrix of only two unknown
parameters:
o
=
_

2
1
I
T
1
0
0
2
2
I
T
2
_
, (4.9)
The null hypothesis of homoskedasticity is
2
1
=
2
2
=
2
o
; the alternative hypothesis is,
without loss of generality,
2
1
>
2
2
.
Consider now two regressions based on the observations of the group one and group
two, respectively. Let
2
T
1
and
2
T
2
denote the resulting OLS variance estimates. In-
tuitively, whether
2
T
1
is close to
2
T
2
constitutes an evidence for or against the null
hypothesis. Under [A1] and [A3
] with (4.9),
(T
1
k)
2
T
1
/
2
1

2
(T
1
k),
(T
2
k)
2
T
2
/
2
2

2
(T
2
k),
by Theorem 3.7(b). As y
t
are independent, these two
2
random variables are also mu-
tually independent. Note that
2
T
1
and
2
T
2
must be computed from separate regressions
so as to ensure independence. Then under the null hypothesis,
:=

2
T
1

2
T
2
=
(T
1
k)
2
T
1
2
o
(T
1
k)
_
(T
2
k)
2
T
2
2
o
(T
2
k)
F(T
1
k, T
2
k);
this is the F test for groupwise heteroskedasticity.
More generally, the variances of y
t
may be changing with the values of a particular
explanatory variable, say x
j
. That is, for some constant c > 0,
2
t
= c x
2
tj
.
Thus, the larger the magnitude of x
tj
, the greater is
2
t
. An interesting feature of this
specication is that
2
t
may take distinct values for every t, yet
o
contains only one
unknown parameter c. The null hypothesis is then
2
t
=
2
o
for all t, and the alternative
hypothesis is, without loss of generality,
2
(1)

2
(2)
. . .
2
(T)
,
4.3. HETEROSKEDASTICITY 85
where
2
(i)
denotes the i
th
largest variance. The so-called Goldfeld-Quandt test is of the
same form as the F test for groupwise heteroskedasticity but with the following data
grouping procedure.
(1) Rearrange observations according to the values of some explanatory variable x
j
in a descending order.
(2) Divide the rearranged data set into three groups with T
1
, T
m
, and T
2
observations,
respectively.
(3) Drop the T
m
observations in the middle group and perform separate OLS regres-
sions using the data in the rst and third groups.
(4) The statistic is the ratio of the variance estimates:

2
T
1
/
2
T
2
F(T
1
k, T
2
k).
If the data are rearranged according to the values of x
j
in an ascending order, the
resulting statistic should be computed as

2
T
2
/
2
T
1
F(T
2
k, T
1
k).
In a time-series study, the variances may be decreasing (increasing) over time. In this
case, data rearrangement would not be needed. Note that dropping the observations in
the middle group enhances the tests ability of discriminating variances in the rst and
third groups. It is usually suggested that no more than one third of the observations
should be dropped; it is also typical to set T
1
T
2
. Clearly, this test would be pow-
erful provided that one can correctly identify the source of heteroskedasticity (i.e., the
explanatory variable that determines variances). On the other hand, nding such an
explanatory variable may not be easy.
An even more general form of heteroskedastic covariance matrix is such that the
diagonal elements
2
t
= h(
0
+z
1
),
where h is some function and z
t
is a p 1 vector of exogenous variables aecting the
variances of y
t
. This assumption simplies
o
to a matrix of p+1 unknown parameters.
Tests against this class of alternatives can be derived under the likelihood framework,
and their distributions can only be analyzed asymptotically. This will not be discussed
until Chapter ??.
4.3.2 GLS Estimation
If the test for groupwise heteroskedasticity rejects the null hypothesis, one might believe
that
o
is given by (4.9). Accordingly, the specied linear specication may be written
as:
_
y
1
y
2
_
=
_
X
1
X
2
_
+
_
e
1
e
2
_
,
where y
1
is T
1
1, y
2
is T
2
1, X
1
is T
1
k, and X
2
is T
2
k. A transformed
specication is
_
y
1
/
1
y
2
/
2
_
=
_
X
1
/
1
X
2
/
2
_
+
_
e
1
/
1
e
2
/
2
_
,
where the transformed y
t
, t = 1, . . . , T, have constant variance one. It follows that the
GLS and FGLS estimators are, respectively,
GLS
=
_
X
1
X
1
2
1
+
X
2
X
2
2
2
_
1
_
X
1
y
1
2
1
+
X
2
y
2
2
2
_
,
FGLS
=
_
X
1
X
1

2
1
+
X
2
X
2

2
2
_
1
_
X
1
y
1

2
1
+
X
2
y
2

2
2
_
,
where
2
T
1
and
2
T
2
are, again, the OLS variance estimates obtained from separate re-
gressions using T
1
and T
2
observations, respectively. Observe that

FGLS
is not a linear
estimator in y so that its nite-sample properties are not clear.
If the Goldfeld-Quandt test rejects the null hypothesis, one might believe that
2
t
=
c x
2
tj
. A transformed specication is then
y
t
x
tj
=
j
+
1
1
x
tj
+ +
j1
x
t,j1
x
tj
+
j+1
x
t,j+1
x
tj
+ +
k
x
tk
x
tj
+
e
t
x
tj
,
where var(y
t
/x
tj
) = c :=
2
o
. This is a very special case where the GLS estimator is
readily computed as the OLS estimator for the transformed specication. Clearly, the
validity of the GLS method crucially depends on whether the explanatory variable x
j
can be correctly identied.
When
2
t
= h(
0
+z
1
), it is typically dicult to implement an FGLS estimator,
especially when h is nonlinear. If h is the identity function, one may regress the squared
OLS residuals e
2
t
on z
t
to obtain estimates for
0
and
1
. Of course, certain con-
straint must be imposed to ensure the tted values are non-negative. The nite-sample
properties of this estimator are dicult to analyze, however.
Remarks:
4.4. SERIAL CORRELATION 87
1. When a test for heteroskedasticity rejects the null hypothesis, there is really no
guarantee that the alternative hypothesis (say, groupwise heteroskedasticity) must
provide a correct description of var(y
t
).
2. When a form of heteroskedasticity is incorrectly specied, it is likely that the
resulting FGLS estimator is less ecient than the OLS estimator.
3. As discussed in Section 4.2.3, the nite-sample properties of FGLS estimators and
hence the exact tests are usually not available. One may appeal to asymptotic
theory to construct proper tests.
4.4 Serial Correlation
Another leading example that var(y) ,=
2
o
I
T
is when the elements of y are correlated
so that the o-diagonal elements of
o
are non-zero. This phenomenon is more com-
mon in time series data, though it is not necessary so. When time series data y
t
are
correlated over time, they are said to exhibit serial correlation. For cross-section data,
the correlations of y
t
are usually referred to as spatial correlation. We will concentrate
on serial correlation.
4.4.1 A Simple Model of Serial Correlation
Consider time series y
t
, t = 1, . . . , T, with the constant variance
2
o
. Then, the correla-
tion coecient between y
t
and y
ti
is
corr(y
t
, y
ti
) =
cov(y
t
, y
ti
)
_
var(y
t
) var(y
ti
)
=
cov(y
t
, y
ti
)
2
o
, i = 0, 1, 2, . . . , t 1;
in particular, corr(y
t
, y
t
) = 1. Such correlations are also known as the autocorrelations
of y
t
. Similarly, cov(y
t
, y
ti
), i = 0, 1, 2, . . . , t 1, are known as the autocovariances of
y
t
.
A very simple specication of autocovariances is
cov(y
t
, y
ti
) = cov(y
t
, y
t+i
) = c
i
2
o
,
where c is a constant, so that corr(y
t
, y
ti
) = c
i
. That is, the autocovariances and
autocorrelations depend only i, the time periods between two observations, but not on
t. Moreover, the correlations between two observations decay exponentially fast when i
increases. Equivalently, we may write
cov(y
t
, y
ti
) = c cov(y
t
, y
ti+1
).
Letting corr(y
t
, y
ti
) =
i
, we have
i
= c
i1
. (4.10)
From this recursion we immediately see that c =
1
which must be bounded between
1 and 1. It follows that var(y) is
o
=
2
o
_
_
1
1

2
1

T1
1
1
1
1

T2
1
2
1

1
1
T3
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
T1
1

T2
1

T3
1
1
_
_
. (4.11)
To avoid singularity,
1
cannot be 1.
A novel feature of this specication is that it, while permitting non-zero o-diagonal
elements of
o
, involves only two unknown parameters:
2
o
and
1
. The transformation
matrix is then
1/2
o
=
1
o
_
_
1 0 0 0 0
1
2
1
1
1
2
1
0 0 0
0

1
1
2
1
1
1
2
1
0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0
1
1
2
1
0
0 0 0

1
1
2
1
1
1
2
1
_
_
.
Note that this choice of
1/2
o
is not symmetric. As any matrix that is a constant
proportion to
1/2
o
can also serve as a transformation matrix for GLS estimation, the
so-called Cochrane-Orcutt Transformation is based on
V
1/2
o
=
o
_
1
2
1
1/2
o
=
_
_
_
1
2
1
0 0 0 0
1
1 0 0 0
0
1
1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1 0
0 0 0
1
1
_
_
,
which depends only on
1
.
The data from the Cochrane-Orcutt transformation are
y
1
= (1
2
1
)
1/2
y
1
, x
1
= (1
2
1
)
1/2
x
1
,
y
t
= y
t
1
y
t1
, x
t
= x
t
1
x
t1
, t = 2, , T,
where x
t
is the t
th
column of X
. It is then clear that

var(y
1
) = (1
2
1
)
2
o
,
var(y
t
) =
2
o
+
2
1
2
o
2
2
1
2
o
= (1
2
1
)
2
o
, t = 2, . . . , T.
Moreover, for each i,
cov(y
t
, y
ti
) = cov(y
t
, y
ti
)
1
cov(y
t1
, y
ti
)
1
cov(y
t
, y
ti1
)
2
1
cov(y
t1
, y
ti1
)
= 0.
Hence, the transformed variable y
t
satises the classical conditions, as it ought to be.
Then provided that
1
is known, regressing y
t
on x
t
yields the GLS estimator for
o
.
4.4.2 An Alternative View
There is an alternative approach to generate the variance-covariance matrix (4.11).
Under [A2](i), let
:= y X
o
.
The vector is usually referred to as the vector of disturbances. Note that is not
the same as the residual vector e. While the former is not observable because
o
is
unknown, the later is obtained from OLS estimation and hence observable. Under [A2],
IE() = 0 and
var(y) = var() = IE(
).
The variance and covariance structure of y is thus the same as that of .
A time series is said to be weakly stationary if its mean, variance, and autocovari-
ances are all independent of the time index t. Thus, a weakly stationary series cannot
exhibit trending behavior and has relatively stable uctuations. In particular, a time se-
ries with zero mean, a constant variance, and zero autocovariances is weakly stationary
and also known as a white noise. Let u
t
be a white noise with IE(u
t
) = 0, IE(u
2
t
) =
2
u
,
and IE(u
t
u
) = 0 for t ,= . Now suppose that the elements of is generated as a weakly

stationary AR(1) process (autoregressive process of order 1):
t
=
1
t1
+u
t
, (4.12)
with
0
= 0. By recursive substitution, (4.12) can be expressed as
t
=
t1
i=0
i
1
u
ti
, (4.13)
a weighted sum of current and previous random innovations (shocks).
It follows from (4.13) that IE(
t
) = 0 and IE(u
t
,
ts
) = 0 for all t and s 1. By
weak stationarity, var(
t
) is a constant, so that for all t,
var(
t
) =
2
1
var(
t1
) +
2
u
=
2
u
/(1
2
1
).
Clearly, the right-hand side would not be meaningful unless [
1
[ < 1. The autocovari-
ance of
t
and
t1
is, by weak stationarity,
IE(
t
t1
) =
1
IE(
2
t1
) =
1
2
u
1
2
1
.
This shows that
1
= corr(
t
,
t1
) = corr(y
t
, y
t1
) =
1
.
Similarly,
IE(
t
t2
) =
1
IE(
t1
t2
) =
2
1
2
u
1
2
1
,
so that
corr(
t
,
t2
) =
1
corr(
t
,
t1
) =
2
1
.
More generally, we can write for i = 1, 2, . . . ,
corr(
t
,
ti
) =
1
corr(
t
,
ti+1
) =
i
1
,
which depend only on i, the time dierence between two s, but not on t. This is
precisely what we postulated in (4.10). The variance-covariance matrix
o
under this
structure is also (4.11) with
2
o
=
2
u
/(1
2
1
).
The AR(1) structure of disturbances also permits a straightforward extension. Con-
sider the disturbances that are generated as an AR(p) process (autoregressive process
of order p):
t
=
1
t1
+ +
p
tp
+u
t
, (4.14)
where the coecients
1
, . . . ,
p
should also be restricted to ensure weak stationarity;
we omit the details. Of course,
t
may follow dierent structures and are still serially
correlated. For example,
t
may be generated as an MA(1) process (moving average
process of order 1):
t
= u
t
+
1
u
t1
, [
1
[ < 1,
where u
t
is a white noise; see e.g., Exercise 4.5.
4.4.3 Tests for AR(1) Disturbances
As the AR(1) structure of disturbances is one of the most commonly used specication of
serial correlation, we now consider the tests of the null hypothesis of no serial correlation
(
1
=
1
= 0) against AR(1) disturbances. We discuss only the celebrated Durbin-
Watson test and Durbins h test; the discussion of other large-sample tests will be
deferred to Chapter 6.
In view of the AR(1) structure, a natural estimator of
1
is the OLS estimator of
regressing the OLS residual e
t
on its immediate lag e
t1
:

T
=
T
t=2
e
t
e
t1
T
t=2
e
2
t1
. (4.15)
The Durbin-Watson statistic is
d =
T
t=2
( e
t
e
t1
)
2
T
t=1
e
2
t
.
When the sample size T is large, it can be seen that
d = 2 2
T
T
t=2
e
2
t1
T
t=1
e
2
t
e
2
1
+ e
2
T
T
t=1
e
2
t
2(1
T
).
For 0 <
T
1 (1
T
< 0), the Durbin-Watson statistic is such that 0 d < 2
(2 < d 4), which suggests that there is some positive (negative) serial correlation.
Hence, this test essentially checks whether
T
is suciently close to zero (i.e., d is
close to 2).
A major diculty of the Durbin-Watson test is that the exact null distribution of d
depends on the matrix X and therefore varies with data. As such, the critical values
of d cannot be tabulated. Nevertheless, it has been shown that the null distribution of
d lies between the distributions of a lower bound (d
L
) and an upper bound (d
U
) in the
following sense. Given the signicance level , let d
, d
L,
and d
U,
denote, respectively,
the critical values of d, d
L
and d
U
. For example, IPd < d
= ). Then for each ,

d
L,
< d
< d
U,
. While the distribution of d is data dependent, the distributions of d
L
and d
U
are independent of X. Thus, the critical values d
L,
and d
U,
can be tabulated.
One may rely on these critical values to construct a conservative decision rule.
Specically, when the alternative hypothesis is
1
> 0 (
1
< 0), the decision rule of
the Durbin-Watson test is:
(1) Reject the null if d < d
L,
(d > 4 d
L,
).
(2) Do not reject the null if d > d
U,
(d < 4 d
U,
).
(3) Test is inconclusive if d
L,
< d < d
U,
(4 d
L,
> d > 4 d
U,
).
This is not completely satisfactory because the test may yield no conclusion. Some
econometric packages such as SHAZAM now compute the exact Durbin-Watson dis-
tribution for each regression and report the exact p-values. When such as program is
available, this test does not have to rely on the critical values of d
L
and d
U
, and it is al-
ways conclusive. Note that the tabulated critical values of the Durbin-Watson statistic
are for the specications with a constant term; the critical values for the specications
without a constant term can be found in Farebrother (1980).
Another problem with the Durbin-Watson statistic is that its null distribution holds
only under the classical conditions [A1] and [A3]. In the time series context, it is quite
common to include a lagged dependent variable as a regressor so that [A1] is violated.
A leading example is the specication
y
t
=
1
+
2
x
t2
+ +
k
x
tk
+y
t1
+e
t
.
This model can also be derived from certain behavioral assumptions; see Exercise 4.6.
It has been shown that the Durbin-Watson statistic under this specication is biased
toward 2. That is, this test would not reject the null hypothesis even when serial
correlation is present. On the other hand, Durbins h test is designed specically for
the specications that contain a lagged dependent variable. Let
T
be the OLS estimate
of and var(
T
) be the OLS estimate of var(
T
). The h statistic is
h =
T
T
1 T var(
T
)
,
and its asymptotic null distribution is N(0, 1). A clear disadvantage of Durbins h test
is that it cannot be calculated when var(
T
) 1/T. This test can also be derived as a
Lagrange Multiplier test; see Chapter ??
If we have quarterly data and want to test for the fourth-order serial correlation,
the statistic analogous to the Durbin-Watson statistic is
d
4
=
T
t=5
( e
t
e
t4
)
2
T
t=1
e
2
t
;
see Wallis (1972) for corresponding critical values.
4.4.4 FGLS Estimation
Recall that
o
depends on two parameters
2
o
and
1
. We may use a generic notation
(
2
, ) to denote this function of
2
and . In particular,
o
= (
2
o
,
1
). Similarly, we
may also write V () such that V
o
= V (
1
). The transformed data based on V ()
1/2
are
y
1
() = (1
2
)
1/2
y
1
, x
1
() = (1
2
)
1/2
x
1
,
y
t
() = y
t
y
t1
, x
t
() = x
t
x
t1
, t = 2, , T.
Hence, y
t
= y
t
(
1
) and x
t
= x
t
(
1
).
To obtain an FGLS estimator, we must rst estimate
1
by some estimator
T
and
then construct the transformation matrix as

V
1/2
T
= V (
T
)
1/2
. Here,
T
may be com-
puted as in (4.15); other estimators for
1
may also be used, e.g.,
T
=
T
(T k)/(T 1).
The transformed data are then y
t
(
T
) and x
t
(
T
). An FGLS estimator is obtained by
regressing y
t
(
T
) on x
t
(
T
). Such an estimator is known as the Prais-Winsten estimator
or the Cochrane-Orcutt estimator when the rst observation is dropped in computation.
The following iterative procedure is also commonly employed in practice.
(1) Perform OLS estimation and compute
T
as in (4.15) using the OLS residuals e
t
.
(2) Perform the Cochrane-Orcutt transformation based on
T
and compute the re-
sulting FGLS estimate

FGLS
by regressing y
t
(
T
) on x
t
(
T
).
(3) Compute a new
T
as in (4.15) with e
t
replaced by the FGLS residuals
e
t,FGLS
= y
t
x
FGLS
.
(4) Repeat steps (2) and (3) until
T
converges numerically, i.e., when
T
from two
consecutive iterations dier by a value smaller than a pre-determined convergence
criterion.
Note that steps (1) and (2) above already generate an FGLS estimator. More iterations
do not improve the asymptotic properties of the resulting estimator but may have a
signicant eect in nite samples. This procedure can be extended easily to estimate
the specication with higher-order AR disturbances.
Alternatively, the Hildreth-Lu procedure adopts grid search to nd the
1
(1, 1)
that minimizes the sum of squared errors of the model. This procedure is computation-
ally intensive, and it is dicult to implement when
t
have an AR(p) structure with
p > 2.
In view of the log-likelihood function (4.7), we must compute det(
o
). Clearly,
det(
o
) =
1
det(
1
o
)
=
1
[det(
1/2
o
)]
2
.
In terms of the notations in the AR(1) formulation,
2
o
=
2
u
/(1
2
1
), and
1/2
o
=
1
o
_
1
2
1
V
1/2
o
=
1
u
V
1/2
o
.
As det(V
1/2
o
) = (1
2
1
)
1/2
, we then have
det(
o
) = (
2
u
)
T
(1
2
1
)
1
.
The log-likelihood function for given
2
u
and
1
is
log L(;
2
u
,
1
)
=
T
2
log(2)
T
2
log(
2
u
) +
1
2
log(1
2
1
)
1
2
2
u
(y
(y
).
Clearly, when
2
u
and
1
are known, the MLE of is just the GLS estimator.
If
2
u
and
1
are unknown, the log-likelihood function reads:
log L(,
2
, )
=
T
2
log(2)
T
2
log(
2
) +
1
2
log(1
2
)
1
2
2
(1
2
)(y
1
x
1
)
2
1
2
2
T
t=2
[(y
t
x
t
) (y
t1
x
t1
)]
2
,
4.5. LINEAR PROBABILITY MODEL 95
which is a nonlinear function of the parameters. Nonlinear optimization methods are
therefore needed to compute the MLEs of ,
2
, and . For a given , estimating
by regressing e
t
() = y
t
x
t
on e
t1
() is equivalent to maximizing the last term of
the log-likelihood function above. This does not yield an MLE because the other terms
involving , namely,
1
2
log(1
2
)
1
2
2
(1
2
)(y
1
x
1
)
2
,
have been ignored. This shows that the aforementioned iterative procedure does not
result in the MLEs.
Remark: Exact tests based on FGLS estimation results are not available because the
nite-sample distribution of the FGLS estimator is, again, unknown. Asymptotic theory
is needed to construct proper tests.
4.5 Linear Probability Model
In some applications researchers are interested in analyzing why consumers own a house
or participate a particular event. The ownership or the choice of participation are
typically represented by a binary variable that takes the values one and zero. If the
dependent variable in a linear regression is binary, we will see below that both the OLS
and FGLS methods are not appropriate.
Let x
t
denote the t
th
column of X
. The t
th
observation of the linear specication
y = X +e can be expressed as
y
t
= x
t
+e
t
.
For the binary dependent variable y whose t
th
observation is y
t
= 1 or 0, we know
IE(y
t
) = IP(y
t
= 1).
Thus, x
t
is just a specication of the probability that y
t
= 1. As such, the linear
specication of binary dependent variables is usually referred to as the linear probability
model.
When [A1] and [A2](i) hold for a linear probability model,
IE(y
t
) = IP(y
t
= 1) = x
o
,
and the OLS estimator is unbiased for
o
. Note, however, that the variance of y
t
is
var(y
t
) = IP(y
t
= 1)[1 IP(y
t
= 1)].
Under [A1] and [A2](i),
var(y
t
) = x
o
(1 x
o
),
which varies with x
t
. Thus, the linear probability model suers from the problem of
heteroskedasticity, and the OLS estimator is not the BLUE for
o
. Apart from the
eciency issue, the OLS method is still not appropriate for the linear probability model
because the OLS tted values need not be bounded between zero and one. When x
T
is negative or greater than one, it cannot be interpreted as a probability and hence
becomes meaningless.
Although the GLS estimator is the BLUE, it is not available because
o
, and hence
var(y
t
), is unknown. Nevertheless, if y
t
are uncorrelated so that var(y) is diagonal, an
FGLS estimator may be obtained using the transformation matrix
1/2
T
= diag
_
x
T
(1 x
T
)]
1/2
, x
T
(1 x
T
)]
1/2
, . . . ,
x
T
(1 x
T
)]
1/2
_
,
where

T
is the OLS estimator of
o
. Such an estimator breaks down when

1/2
T
is
not available (i.e., when x
T
is negative or greater than one). Even when

1/2
T
can be
computed, there is still no guarantee that the FGLS tted values are bounded between
zero and one. This shows that the FGLS method may not always be a solution when
the OLS method fails.
This example also illustrates the importance of data characteristics in estimation and
modeling. Without taking into account the binary nature of the dependent variable,
even the FGLS method may be invalid. More appropriate methods for specications
with binary dependent variables will be discussed in Chapter ??.
4.6 Seemingly Unrelated Regressions
In many econometric practices, it is also important to jointly study the behavior of sev-
eral dependent variables. For example, the input demands of an rm may be described
using a system of linear regression functions in which each regression represents the
demand function of a particular input.
Consider the specication of a system of N equations, each with k
i
explanatory
variables and T observations. Specically,
y
i
= X
i
i
+e
i
, i = 1, 2, . . . , N, (4.16)
4.6. SEEMINGLY UNRELATED REGRESSIONS 97
where for each i, y
i
is T 1, X
i
is T k
i
, and
i
is k
i
1. The system (4.16) is
also known as a specication of seemingly unrelated regressions (SUR). Stacking the
equations of (4.16) yields
_
_
y
1
y
2
.
.
.
y
N
_
_
. .
y
=
_
_
X
1
0 0
0 X
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 X
N
_
_
. .
X
_
2
.
.
.
N
_
_
. .
+
_
_
e
1
e
2
.
.
.
e
N
_
_
. .
e
. (4.17)
This is a linear specication (3.1) with k =
N
i=1
k
i
explanatory variables and TN obser-
vations. It is not too hard to see that the whole system (4.17) satises the identication
requirement whenever every specication of (4.16) does.
Suppose that the classical conditions [A1] and [A2] hold for each specied linear
regression in the system. Then under [A2](i), there exists
o
= (
o,1
. . .
o,N
)
such
that IE(y) = X
o
. The OLS estimator obtained from (4.17) is therefore unbiased.
Note, however, that [A2](ii) for each linear regression ensures only that, for each i,
var(y
i
) =
2
i
I
T
;
there is no restriction on the correlations between y
i
and y
j
. The variance-covariance
matrix of y is then
var(y) =
o
=
_
2
1
I
T
cov(y
1
, y
2
) cov(y
1
, y
N
)
cov(y
2
, y
1
)
2
2
I
T
cov(y
2
, y
N
)
.
.
.
.
.
.
.
.
.
.
.
.
cov(y
N
, y
1
) cov(y
N
, y
2
)
2
N
I
T
_
_
. (4.18)
That is, the vector of stacked dependent variables violates [A2](ii), even when each
individual dependent variable has a scalar variance-covariance matrix. Consequently,
the OLS estimator of the whole system,

TN
= (X
X)
1
X
y, is not the BLUE in

general. In fact, owing to the block-diagonal structure of X,

TN
simply consists of
N equation-by-equation OLS estimators and hence ignores the correlations between
equations and heteroskedasticity across equations.
In practice, it is also typical to postulate that for i ,= j,
cov(y
i
, y
j
) =
ij
I
T
,
that is, y
it
and y
jt
are contemporaneously correlated but y
it
and y
j
, t ,= , are serially
uncorrelated. Under this condition, (4.18) simplies to
o
= S
o
I
T
with
S
o
=
_
2
1

12

1N
21

2
2

2N
.
.
.
.
.
.
.
.
.
.
.
.
N1

N2

2
N
_
_
.
As
1
o
= S
1
o
I
T
, the GLS estimator of (4.17) is
GLS
= [X
(S
1
o
I
T
)X]
1
X
(S
1
o
I
T
)y,
and its covariance matrix is [X
(S
1
o
I
T
)X]
1
.
It is readily veried that when
ij
= 0 for all i ,= j, S
o
becomes a diagonal matrix,
and so is
o
. The resulting GLS estimator thus reduces to the OLS estimator. This
should not be too surprising because estimating the SUR system would be unnecessary
if the dependent variables are in fact uncorrelated. (Note that the heteroskedasticity
across equations does not aect this result.) If all the equations in the system have
the same regressors, i.e., X
i
= X
0
(say), the GLS estimator is also the same as the
OLS estimator; see e.g., Exercise 4.7. More generally, it can be shown that there would
not be much eciency gain for GLS estimation if y
i
and y
j
are less correlated and/or
X
i
and X
j
are highly correlated; see e.g., Goldberger (1991, p. 328) for an illustrative
example.
The FGLS estimator can be obtained by replacing S
1
o
with

S
1
TN
, where

S
TN
is an
N N matrix computed as
S
TN
=
1
T
_
_
e
1
e
2
.
.
.
e
N
_
_
_
e
1
e
2
. . . e
N
_
,
where e
i
is the OLS residual vector of the i
th
equation. The elements of this matrix
are

2
i
=
e
i
e
i
T
, i = 1, . . . , N,

ij
=
e
i
e
j
T
, i ,= j, i, j = 1, . . . , N.
Note that

S
TN
is of an outer product form and hence a positive semi-denite matrix.
One may also replace the denominator of
2
i
with T k
i
and the denominator of
ij
4.7. MODELS FOR PANEL DATA 99
with T max(k
i
, k
j
). The resulting estimator

S
TN
need not be positive semi-denite,
however.
Remark: The estimator

S
TN
mentioned above is valid provided that var(y
i
) =
2
i
I
T
and cov(y
i
, y
j
) =
ij
I
T
. If these assumptions do not hold, FGLS estimation would be
much more complicated. This may happen when heteroskedasticity and serial correla-
tions are present in each equation, or when cov(y
it
, y
jt
) changes over time.
4.7 Models for Panel Data
A data set that contains N cross-section units (individuals, families, rms, or countries),
each with some time-series observations, is known as a panel data set. Well known panel
data sets in the U.S. include the National Longitudinal Survey (NLS) of Labor Market
Experience and the Michigan Panel Study of Income Dynamics (PSID). Building these
data sets is very costly because they are obtained by tracking thousands of individuals
through time. Some panel data may be easier to establish; for example, the GDP data
for all G7 countries over 30 years also form a panel data set. Panel data permit analysis
of topics that could not be studied using only cross-section or time-series data. In this
section, we are mainly concerned with the panel data set that involves a large number
of cross-section units, each with a short time series.
4.7.1 Fixed Eects Model
Given a panel data set, the basic linear specication allowing for individual eects (i.e.,
eects that are changing across individual units but remain constant over time) is
y
it
= x
it
i
+e
it
, i = 1, . . . , N, t = 1, . . . , T,
where x
it
is k 1 and
i
depends only on i but not on t. Clearly, there is no time-
specic eect in this specication; this may be reasonable when only a short time series
is observed for each individual unit.
Analogous to the notations in the SUR system (4.16), we can also write the speci-
cation above as
y
i
= X
i
i
+e
i
, i = 1, 2, . . . , N, (4.19)
where y
i
is T 1, X
i
is T k, and e
i
is T 1. This is again a complex system involving
k N parameters. Here, the dependent variable y and explanatory variables X are the
same across individual units such that y
i
and X
i
are simply their observations for each
individual i. For example, y may be the family consumption expenditure, and each y
i
contains family is annual consumption expenditures. By contrast, y
i
and X
i
may be
dierent variables in a SUR system.
When T is small (i.e., observed time series are short), estimating (4.19) is not feasible.
A simpler form of (4.19) is such that only the intercept terms change with i and the
other parameters remain constant across i:
y
i
=
T
a
i
+Z
i
b +e
i
, i = 1, 2, . . . , N, (4.20)
where
T
is the T-dimensional vector of ones, [
T
Z
i
] = X
i
and [a
i
b
=
i
. Thus,
individual eects are completely captured by the intercept terms in (4.20). This sim-
plies (4.19) from kN to N +k 1 parameters. Note that this specication treats a
i
as
non-random parameters and is known as the xed eects model. Stacking N equations
in (4.20) together we obtain
_
_
y
1
y
2
.
.
.
y
N
_
_
. .
y
=
_
T
0 0
0
T
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
T
_
_
. .
D
_
_
a
1
a
2
.
.
.
a
N
_
_
. .
a
+
_
_
Z
1
Z
2
.
.
.
Z
N
_
_
. .
Z
b +
_
_
e
1
e
2
.
.
.
e
N
_
_
. .
e
. (4.21)
This is just a linear specication (3.1) with N + k 1 explanatory variables and TN
observations. Note that each column of D is in eect a dummy variable for the i
th
individual unit. In what follows, an individual unit will be referred to as a group.
Let z
it
denote the t
th
column of Z
i
, where Z
i
is the i
th
block of Z
. For z
it
, the
i
th
group average over time is
z
i
=
1
T
T
t=1
z
it
=
1
T
Z
T
;
for y
it
, the group average over time is
y
i
=
1
T
y
T
.
The overall sample average of z
it
(average over time and groups) is
z =
1
NT
N
i=1
T
t=1
z
it
=
1
NT
Z
NT
,
and the overall sample average of y
it
is
y =
1
NT
y
NT
.
Note that
z =
1
N
N
i=1
z
i
,
so that the overall sample average is simply the sample mean of all group averages.
Similarly, y is also the sample average of y
i
.
It can be seen that the specication (4.21) satises the identication requirement
provided that no column of Z
i
is a constant (i.e., there is no time invariant regressor
for each group). Once the identication requirement is satised, the OLS estimator can
be computed. By Theorem 3.3, the OLS estimator for b is
b
NT
= [Z
(I
NT
P
D
)Z]
1
Z
(I
NT
P
D
)y, (4.22)
where P
D
= D(D
D)
1
D
is a projection matrix. Thus,

b
NT
can be obtained by
regressing (I
NT
P
D
)y on (I
NT
P
D
)Z.
More specically, write D = I
N

T
, so that
P
D
= (I
N

T
)(I
N

T
)
1
(I
N

T
) = I
N

T
T
/T.
It follows that I
NT
P
D
= I
N
(I
T

T
T
/T) and that
(I
NT
P
D
)y =
_
_
_
_
_
_
_
y
1
y
2
.
.
.
y
N
_
_
_
_
_
_
_
_
_
_
_
_
_
_
T
y
1
T
y
2
.
.
.
T
y
N
_
_
_
_
_
_
_
,
where y
i
is the group average of the elements in y
i
. The t
th
observation in the i
th
block
of (I
NT
P
D
)y is then y
it
y
i
, the deviation of y
it
from its group average. Similarly,
(I
NT
P
D
)Z =
_
_
_
_
_
_
_
Z
1
Z
2
.
.
.
Z
N
_
_
_
_
_
_
_
_
_
_
_
_
_
_
T
z
T
z
2
.
.
.
T
z
N
_
_
_
_
_
_
_
,
with the t
th
observation in the i
th
block being (z
it
z
i
)
, the deviation of z
it
from its
group average. This shows that the OLS estimator (4.22) can be obtained by regressing
y
it
y
i
on z
it
z
i
, i = 1, . . . , N, and t = 1, . . . , T. That is,
b
NT
=
_
N
i=1
(Z
i
z
i
T
)(Z
i
T
z
i
)
_
1
_
N
i=1
(Z
i
z
i
T
)(y
i
T
y
i
)
_
=
_
N
i=1
T
t=1
(z
it
z
i
)(z
it
z
i
)
_
1
_
N
i=1
T
t=1
(z
it
z
i
)(y
it
y
i
)
_
.
(4.23)
This estimator is known as the within-groups estimator because it is based on the
observations that are deviations from their own group averages.
Let a
NT
denote the OLS estimator of the vector a of individual eects. By the facts
that
D
y = D
D a
NT
+D
b
NT
,
and that the OLS residual vector is orthogonal to D, a
NT
can be computed as
a
NT
= (D
D)
1
D
(y Z
b
NT
), (4.24)
with the i
th
element being
a
NT,i
=
1
T
(
T
y
i
T
Z
i
b
NT
) = y
i
z
b
NT
.
When the classical conditions [A1] and [A2](i) hold for every equation in (4.20), we
have
IE(y
i
) =
T
a
i,o
+Z
i
b
o
, i = 1, 2, . . . , N.
Then, the OLS estimators a
NT
and

b
NT
are unbiased for a
o
and b
o
, where the i
th
element of a
o
is a
i,o
. Similar to Section 4.6, to ensure the BLUE property of these
estimators, it is also required that var(y) is a scalar covariance matrix. This amounts
to requiring that var(y
i
) =
2
o
I
T
for all i and that cov(y
i
, y
j
) = 0 for every i ,= j.
Under these conditions, the variance-covariance matrix of

b
NT
is
var(
b
NT
) =
2
o
[Z
(I
NT
P
D
)Z]
1
=
2
o
_
N
i=1
T
t=1
(z
it
z
i
)(z
it
z
i
)
_
1
.
It is also easy to verify that the covariance matrix of the i
th
element of a
NT
is
var( a
NT,i
) =
1
T

2
o
+ z
i
[var(
b
NT
)] z
i
; (4.25)
see Exercise 4.8. The OLS estimator for the regression variance
2
o
is

2
NT
=
1
NT N k + 1
N
i=1
T
t=1
(y
it
a
NT,i
z
it
b
NT
)
2
.
Substituting
2
NT
into the formulae of var(
b
NT
) and var( a
NT,i
) we immediately obtain
their OLS estimators. On the other hand, if var(y
i
) =
2
i
I
T
so that the variances of
y
it
are constant within each group but dierent across groups, we have the problem of
heteroskedasticity. If cov(y
i
, y
j
) =
ij
I
T
for some i ,= j, we have spatial correlations
among groups, even though observations are serially uncorrelated. In both cases, the
OLS estimators are no longer the BLUEs, and FGLS estimation is needed.
Observe that when [A1] and [A2](i) hold for every equation in (4.20),
IE( y
i
) = a
i,o
+

Z
i
b
o
, i = 1, 2, . . . , N.
One may then expect to estimate the parameters from a specication based on group-
averages. In particular, the estimator
b
b
=
_
N
i=1
( z
i
z)( z
i
z)
_
1
_
N
i=1
( z
i
z)( y
i
y)
_
(4.26)
is the OLS estimator computed from the following specication:
y
i
= a + z
i
b +e
i
, i = 1, . . . , N. (4.27)
This is so because the sample means of y
i
and z
i
are just their respective overall
averages: y and z. The estimator (4.26) is known as the between-groups estimator
because it is based on the deviations of group averages from their overall averages. As
shown in Exercise 4.9, the between-groups estimator is biased for b
o
when xed eects
are present. This should not be surprising because, while there are N+k1 parameters
in the xed eects model, the specication (4.27) contains only N observations and only
permits estimation of k parameters.
Consider also a specication ignoring individual eects:
y
i
=
T
a +Z
i
b +e
i
, i = 1, . . . , N. (4.28)
The OLS estimator of b is
b
p
=
_
N
i=1
T
t=1
(z
it
z)(z
it
z)
_
1
_
N
i=1
T
t=1
(z
it
z)(y
it
y)
_
. (4.29)
When [A1] and [A2](i) hold for every equation in (4.20), one can see that (4.28) is in
eect a specication that omits n 1 relevant dummy variables. It follows that

b
p
is a biased estimator for b
o
. Alternatively, it can be shown that the estimator (4.29)
is a weighted sum of the between- and within-groups estimators and hence known as
the pooled estimator; see Exercise 4.10. The pooled estimator

b
p
is therefore biased
because

b
b
is. These examples show that neither the between-groups estimator nor the
pooled estimator is a proper choice for the xed eects model.
4.7.2 Random Eects Model
Given the specication (4.20) that allows for individual eects:
y
i
=
T
a
i
+Z
i
b +e
i
, i = 1, 2, . . . , N,
we now treat a
i
as random variables rather than parameters. Writing a
i
= a +u
i
with
a = IE(a
i
), the specication above can be expressed as
y
i
=
T
a +Z
i
b +
T
u
i
+e
i
, i = 1, 2, . . . , N. (4.30)
where
T
u
i
and e
i
form the error term. This specication diers from the xed eects
model in that the intercept terms do not vary across i. The presence of u
i
also makes
(4.30) dierent from the specication that does not allow for individual eects. Here,
group heterogeneity due to individual eects is characterized by the random variable u
i
and absorbed into the error term. Thus, (4.30) is known as the random eects model.
As far as regression coecients are concerned, (4.30) and (4.28) are virtually the
same. As such, the OLS estimator of b is just the pooled estimator

b
p
. The OLS
estimator of a is
a
p
= y z
b
p
.
If the classical conditions [A1] and [A2](i) hold for each equation such that
IE(y
i
) =
T
a
o
+Z
i
b
o
, i = 1, . . . , N,
b
p
and a
p
are unbiased for b
o
and a
o
. Note, however, that the pooled estimator would
be biased if the individual eects were xed, as shown in the preceding section.
When [A1] and [A2](i) hold, we can write
y
i
=
T
a
o
+Z
i
b
o
+
i
, (4.31)
where
i
=
T
u
i
+
i
. That is,
i
contains two components: the random eects
T
u
i
and the disturbance
i
which exists even when there is no random eect. Thus,
var(y
i
) =
2
u
T
+ var(
i
) + 2 cov(
T
u
i
,
i
),
where
2
u
is var(u
i
). As the rst term on the right-hand side above is a full matrix,
var(y
i
) is not a scalar covariance matrix in general. It follows that

b
p
and a
p
are not
the BLUEs.
To perform FGLS estimation, more conditions on var(y
i
) are needed. If var(
i
) =
2
o
I
T
and IE(u
i
i
) = 0, we obtain a simpler form of var(y
i
):
S
o
:= var(y
i
) =
2
u
T
+
2
o
I
T
.
Under additional conditions that IE(u
i
u
j
) = 0, E(u
i
j
) = 0 and IE(
i
j
) = 0 for all
i ,= j, we have cov(y
i
, y
j
) = 0. Hence, var(y) simplies to a block diagonal matrix:
o
:= var(y) = I
N
S
o
.
It can be veried that the desired transformation matrix for GLS estimation is
1/2
o
=
I
N
S
1/2
o
, where
S
1/2
o
= I
T

c
T
T
,
and c = 1
2
o
/(T
2
u
+
2
o
)
1/2
. Transformed data are S
1/2
o
y
i
and S
1/2
o
Z
i
, i = 1, . . . , N,
and their t
th
elements are, respectively, y
it
c y
i
and z
it
c z
i
. If
2
o
= 0 so that the
disturbances
i
are absent, we have c = 1, so that
1/2
o
= I
N
(I
T

T
T
/T) = I
NT
P
D
,
as in the xed eects model. Consequently, the GLS estimator of b is nothing but
the within-groups estimator (4.22). It can be shown that the GLS estimator is also a
weighted average of the within- and between-groups estimators.
To compute the FGLS estimator, we must estimate
2
u
and
2
o
. Pre-multiplying
(4.31) by
T
/T yields
y
i
= a
o
+ z
i
b
o
+u
i
+
i
.
Taking the dierence of y
i
and y
i
we have
y
i
T
y
i
= (Z
i
T
z
i
)b
o
+ (
i
i
),
which does not involve u
i
. This suggests that, even when random eects are present,
we can also estimate b
o
from the specication
y
i
T
y
i
= (Z
i
T
z
i
)b +e
i
, i = 1, . . . , N.
It is readily seen that the OLS estimator based on this specication is the within-groups
estimator

b
NT
. As u
i
have been eliminated, we can estimate
2
o
, the variance of
it
, by

2
=
1
NT N k + 1
N
i=1
T
i=1
[(y
it
y
i
) (z
it
z
i
)
b
NT
]
2
,
which is also the variance estimator in the xed eects model.
By (4.31),
y
i
= a
o
+ z
i
b
o
+u
i
+
i
.
This suggests that b
o
can be estimated from the specication based on group averages:
y
i
= a + z
i
b +e
i
, i = 1, . . . , N.
This specication is the same as (4.27) so that the OLS estimator of b is the between-
groups estimator

b
b
. The resulting OLS residuals are
e
i
= ( y
i
y) ( z
i
z)
b
b
, i = 1, . . . , N.
We can estimate
var(u
i
+
i
) =
2
u
+
2
o
/T
by

N
i=1
e
2
i
/(N k), from which the estimator for
2
u
can be calculated as

2
u
=
1
N k
N
i=1
e
2
i

2
T
.
With these two estimators
2
u
and
2
we can construct the transformation matrix for the

FGLS estimator,

S
1/2
. It is clear that the FGLS estimator is, again, a very complex
function of y.
4.8 Limitations of the FGLS Method
In this chapter we relax only the classical condition [A2](ii) while maintaining [A1]
and [A2](i). The limitations of [A1] and [A2](i) discussed in Chapter 3.7 therefore still
4.8. LIMITATIONS OF THE FGLS METHOD 107
exist. In particular, stochastic regressors and nonlinear specications are excluded in
the present context.
Although the GLS and FGLS methods are designed to improve on estimation e-
ciency when there is a non-scalar covariance matrix
o
, they also create further dicul-
ties. First, the GLS estimator is usually not available, except in some exceptional cases.
Second, a convenient FGLS estimator is available at the expense of more conditions on
o
. If these simplifying conditions are incorrectly imposed, the resulting FGLS estima-
tor may perform poorly. Third, the nite-sample properties of the FGLS estimator are
typically unknown. In general, we do not know if an FGLS estimator is unbiased, nor
do we know its eciency relative to the OLS estimator and its exact distribution. It is
therefore dicult to draw statistical inferences from FGLS estimation results.
Exercises
4.1 Given the linear specication y = X +e, suppose that the conditions [A1] and
[A2](ii) hold and that var(y) =
o
. If the matrix X contains k eigenvectors of
o
which are normalized to unit length. What are the resulting

T
and

GLS
?
Explain your result.
4.2 Show that R
2
GLS
need not be bounded between zero and one.
4.3 Given the linear specication y = X +e, suppose that the conditions [A1] and
[A2](i) hold and that var(y) =
o
. Show directly that
var(
T
) var(
GLS
)
4.4 Suppose that y = X
o
+ and the elements of are
t
=
1
t1
+ u
t
, where
1
= 1 and u
t
is a white noise with mean zero and variance
2
u
. What are the
properties of
t
? Is
t
still weakly stationary?
4.5 Suppose that y = X
o
+ and the elements of are
t
= u
t
+
1
u
t1
, where
[
1
[ < 1 and u
t
2
u
. Calculate
the variance, autocovariances, and autocorrelations of
t
and compare them with
those of AR(1) disturbances.
4.6 Let y
t
denote investment expenditure that is determined by expected earning x
t
:
y
t
= a
o
+b
o
x
t
+u
t
.
When x
t
is adjusted adaptively:
x
t
= x
t1
+ (1
o
)(x
t
x
t1
), 0 <
o
< 1,
show that y
t
can be represented by a model with a lagged dependent variable and
moving average disturbances.
4.7 Given the SUR specication (4.17), show that the GLS estimator is the same as
the OLS estimator when X
i
= X
0
for all i. Give an intuitive explanation of this
result.
4.8 Given the specication (4.20), suppose that [A1] and [A2](i) hold for each group
equation, var(y
i
) =
2
o
I
T
and cov(y
i
, y
j
) = 0 for i ,= j. Prove (4.25).
4.9 Given the specication (4.27), suppose that [A1] and [A2](i) hold for each group
equation. Find the expected value of the between-groups estimator (4.26).
4.10 Given the specication (4.28), show that the pooled estimator (4.29) is a weighted
sum of the within- and between-groups estimators. What is its expected value
when [A1] and [A2](i) hold for each group equation.
References
sity Press.
Farebrother, R. W. (1980). The Durbin-Watson test for serial correlation when there is
no intercept in the regression, Econometrica, 48, 15531563.
Goldberger, Arthur S. (1991). A Course in Econometrics, Cambridge, MA: Harvard
University Press.
Greene, William H. (2000). Econometric Analysis, Fourth ed., New York, NY: Macmil-
lan.
Wallis, K. F. (1972). Testing for fourth order autocorrelation in quarterly regression
equations, Econometrica, 40, 617636.
Chapter 5
Probability Theory
The purpose of this chapter is to summarize some important concepts and results in
probability theory to be used subsequently. We formally dene random variables and
moments (unconditional and conditional) under a measure-theoretic framework. Our
emphasis is on important limiting theorems, such as the law of large numbers and central
limit theorem, which play a crucial role in the asymptotic analysis of many econometric
estimators and tests. Davidson (1994) provides a complete and thorough treatment of
the topics in this chapter; see also Bierens (1994), Gallant (1997) and White (1984) for a
concise coverage. Many results here are taken freely from these references. The readers
may also consult other real analysis and probability textbooks for related topics.
5.1 Probability Space and Random Variables
5.1.1 Probability Space
The probability space associated with a random experiment is determined by three com-
ponents: the outcome space , a collection of events (subsets of ) T, and a probability
measure assigned to the elements in T. Given the subset A of , its complement is
A
c
= : , A.
In the probability space (, T, IP), T is a -algebra (-eld) in the sense that it
satises the following requirements:
1. T;
2. if A T, then A
c
T;
3. if A
1
, A
2
, . . . are in T, then
n=1
A
n
T.
109
110 CHAPTER 5. PROBABILITY THEORY
The rst and second properties imply that
c
= is also in T. Combining the second
and third properties we have from de Morgans law that
_

_
n=1
A
n
_
c
=
n=1
A
c
n
T.
A -algebra is thus closed under complementation, countable union and countable in-
tersection.
The probability measure IP: T [0, 1] is a real-valued set function satisfying the
following axioms:
1. IP() = 1;
2. IP(A) 0 for all A T;
3. if A
1
, A
2
, . . . T are disjoint, then IP(
n=1
A
n
) =
n=1
IP(A
n
).
From these axioms we easily deduce that IP() = 0, IP(A
c
) = 1 IP(A), IP(A) IP(B)
if A B, and
IP(A B) = IP(A) + IP(B) IP(A B).
Moreover, if A
n
is an increasing (decreasing) sequence in T with the limiting set A,
then lim
n
IP(A
n
) = IP(A).
Let ( be a collection of subsets of . The intersection of all the -algebras that
contain ( is the smallest -algebra containing (; see Exercise 5.1. This -algebra is
referred to as the -algebra generated by (, denoted as ((). When = R, the Borel
eld is the -algebra generated by all open intervals (a, b) in R. Note that open intervals,
closed intervals [a, b], half-open intervals (a, b] or half lines (, b] can be obtained from
each other by taking complement, union and/or intersection. For example,
(a, b] =
n=1
_
a, b +
1
n
_
, (a, b) =
_
n=1
_
a, b
1
n
_
.
Thus, the collection of all closed intervals (half-open intervals, half lines) generates the
same Borel eld. As such, open intervals, closed intervals, half-open intervals and half
lines are also known as Borel sets. The Borel eld on R
d
, denoted as B
d
, is generated
by all open hypercubes:
(a
1
, b
1
) (a
2
, b
2
) (a
d
, b
d
).
Equivalently, B
d
can be generated by all closed (half-open) hypercubes, or by
(, b
1
] (, b
2
] (, b
d
].
5.1. PROBABILITY SPACE AND RANDOM VARIABLES 111
5.1.2 Random Variables
Let B denote the Borel eld on R. A random variable z is a function z : R such
that for every B B, the inverse image of B under z is in T, i.e.,
z
1
(B) = : z() B T.
We also say that z is a T/B-measurable (or simply T-measurable) function. A R
d
-
valued random variable z is a function z: R
d
that is T/B
d
-measurable. Given the
random vector z, its inverse images z
1
(B) form a -algebra, denoted as (z). It can
be shown that (z) is the smallest -algebra contained in T such that z is measurable.
We usually interpret (z) as the set containing all the information associated with z.
A function g: R R is said to be B-measurable or Borel measurable if
R: g() b B.
If z is a random variable dened on (, T, IP), then g(z) is also a random variable dened
on the same probability space provided that g is Borel measurable. Note that the func-
tions we usually encounter are indeed Borel measurable; non-measurable functions are
very exceptional and hence are not of general interest. Similarly, for the d-dimensional
random vector z, g(z) is a random variable provided that g is B
d
-measurable.
Recall from Section 2.1 that the joint distribution function of z is the non-decreasing,
right-continuous function F
z
such that for = (
1
. . .
d
)
R
d
,
F
z
() = IP : z
1
()
1
, . . . , z
d
()
d
,
with
lim
1
, ... ,
d
F
z
() = 0, lim
1
, ... ,
d
F
z
() = 1.
The marginal distribution function of the i
th
component of z is such that
F
z
i
(
i
) = IP : z
i
()
i
= F
z
(, . . . , ,
i
, , . . . , ).
Note that while IP is a set function dened on T, the distribution function of z is a
point function dened on R
d
.
Two random variables y and z are said to be (pairwise) independent if, and only if,
for any Borel sets B
1
and B
2
,
IP(y B
1
and z B
2
) = IP(y B
1
) IP(z B
2
).
This immediately leads to the standard denition of independence: y and z are indepen-
dent if, and only if, their joint distribution is the product of their marginal distributions,
as in Section 2.1. A sequence of random variables z
i
is said to be totally independent
if
IP
_
all i
z
i
B
i
_
=

all i
IP(z
i
B
i
),
for any Borel sets B
i
. In what follows, a totally independent sequence will be referred
to an independent sequence or a sequence of independent variables for convenience. For
an independent sequence, we have the following generalization of Lemma 2.1.
Lemma 5.1 Let z
i
be a sequence of independent random variables and h
i
, i =
1, 2, . . . , be Borel-measurable functions. Then h
i
(z
i
) is also a sequence of independent
random variables.
5.1.3 Moments and Norms
The expectation of the i
th
element of z is
IE(z
i
) =
_
z
i
() d IP(),
where the right-hand side is a Lebesgue integral. In view of the distribution function
dened above, a change of causes the realization of z to change so that
IE(z
i
) =
_
R
d
i
dF
z
() =
_
R
i
dF
z
i
(
i
),
where F
z
i
is the marginal distribution function of the i
th
component of z, as dened in
Section 2.2. For the Borel measurable function g of z,
IE[g(z)] =
_
g(z()) d IP() =
_
R
d
g() dF
z
().
Other moments, such as variance and covariance, can also be dened as Lebesgue inte-
grals with respect to the probability measure; see Section 2.2.
A function g is said to be convex on a set S if for any a [0, 1] and any x, y in S,
g
_
ax + (1 a)y
_
ag(x) + (1 a)g(y);
g is concave on S if the inequality above is reversed. For example, g(x) = x
2
is convex,
and g(x) = log x for x > 0 is concave. The result below is concerned with convex
(concave) transformations.
5.1. PROBABILITY SPACE AND RANDOM VARIABLES 113
Lemma 5.2 (Jensen) For the Borel measurable function g that is convex on the sup-
port of the integrable random variable z, suppose that g(z) is also integrable. Then,
g(IE(z)) IE[g(z)];
the inequality reverses if g is concave.
For the random variable z with nite p
th
moment, let |z|
p
= [IE(z
p
)]
1/p
denote its
L
p
-norm. Also dene the inner product of two square integrable random variables z
i
and z
j
as their cross moment:
z
i
, z
j
) = IE(z
i
z
j
).
Then, L
2
-norm can be obtained from the inner product as |z
i
|
2
= z
i
, z
i
)
1/2
. It is easily
seen that for any c > 0 and p > 0,
c
p
IP([z[ c) = c
p
_
1
{:||c}
dF
z
()
_
{:||c}
[[
p
dF
z
() IE[z[
p
,
where 1
{:||c}
is the indicator function which equals one if [[ c and equals zero
otherwise. This establishes the following result.
Lemma 5.3 (Markov) Let z be a random variable with nite p
th
moment. Then,
IP([z[ c)
IE[z[
p
c
p
,
where c is a positive real number.
For p = 2, Lemma 5.3 is also known as the Chebyshev inequality. If c is small such that
IE[z[
p
/c
p
> 1, Markovs inequality is trivial. When c tends to innity, the probability
that z assumes very extreme values will be vanishing at the rate c
p
.
Another useful result in probability theory is stated below without proof.
Lemma 5.4 (Holder) Let y be a random variable with nite p
th
moment (p > 1) and
z a random variable with nite q
th
moment (q = p/(p 1)). Then,
IE[yz[ |y|
p
|z|
q
.
For p = 2, we have IE[yz[ |y|
2
|z|
2
. By noting that [ IE(yz)[ < IE[yz[, we immedi-
ately have the next result; cf. Lemma 2.3.
Lemma 5.5 (Cauchy-Schwartz) Let y and z be two square integrable random vari-
ables. Then,
[ IE(yz)[ |y|
2
|z|
2
.
Let y = 1 and x = z
p
. Then for q > p and r = q/p, H olders inequality also ensures
that
IE[z
p
[ |x|
r
|y|
r/(r1)
= [IE(z
pr
)]
1/r
= [IE(z
q
)]
p/q
.
This shows that when a random variable has nite q
th
moment, it must also have nite
p
th
moment for any p < q, as stated below.
Lemma 5.6 (Liapunov) Let z be a random variable with nite q
th
moment. Then
for p < q, |z|
p
|z|
q
.
The inequality below states that the L
p
-norm of a nite sum is less than the sum of
individual L
p
-norms.
Lemma 5.7 (Minkowski) Let z
i
, i = 1, . . . , n, be random variables with nite p
th
moment (p 1). Then,
_
_
_
_
_
n
i=1
z
i
_
_
_
_
_
p
i=1
|z
i
|
p
.
When there are only two random variables in the sum, this is just the triangle inequality
for L
p
-norms; see also Exercise 5.4.
5.2 Conditional Distribution and Moments
Given two events A and B in T, if it is known that B has occurred, the outcome space
is restricted to B, so that the outcomes of A must be in A B. The likelihood of A is
thus characterized by the conditional probability
IP(A [ B) = IP(A B)/ IP(B),
for IP(B) ,= 0. It can be shown that IP([B) satises the axioms for probability mea-
sures; see Exerise 5.5. This concept is readily extended to construct conditional density
function and conditional distribution function.
5.2. CONDITIONAL DISTRIBUTION AND MOMENTS 115
5.2.1 Conditional Distributions
Let y and z denote two integrable random vectors such that z has the density function
f
z
. For f
z
() ,= 0, dene the conditional density function of z given y = as
f
z|y
( [ y = ) =
f
z,y
(, )
f
y
()
,
which is clearly non-negative whenever it is dened. This function also integrates to
one on R
d
because
_
R
d
f
z|y
( [ y = ) d =
1
f
y
()
_
R
d
f
z,y
(, ) d =
1
f
y
()
f
y
() = 1.
Thus, f
z|y
is a legitimate density function. For example, the bivariate density function
of two random variables z and y forms a surface on the zy-plane. By xing y = ,
we obtain a cross section (slice) under this surface. Dividing the joint density by the
marginal density f
y
() amounts to adjusting the height of this slice so that the resulting
area integrates to one.
Given the conditional density function f
z|y
, we have for A B
d
,
IP(z A [ y = ) =
_
A
f
z|y
( [ y = ) d.
Note that this conditional probability is dened even when IP(y = ) may be zero. In
particular, when
A = (,
1
] (,
d
],
we obtain the conditional distribution function:
F
z|y
( [ y = ) = IP(z
1

1
, . . . , z
d

d
[ y = ).
When z and y are independent, the conditional density (distribution) simply reduces
to the unconditional density (distribution).
5.2.2 Conditional Moments
Analogous to unconditional expectation, the conditional expectation of the integrable
random variable z
i
given the information y = is
IE(z
i
[ y = ) =
_
R
i
dF
z|y
(
i
[ y = );
IE(z [ y = ) is dened elementwise. By allowing y to vary across all possible values ,
we obtain the conditional expectation function IE(z [ y) whose realization depends on
, the realization of y. Thus, IE(z [ y) is a function of y and hence a random vector.
More generally, we can take a suitable -algebra as a conditioning set and dene
IE(z [ (), where ( is a sub--algebra of T. Similar to the discussion above, IE(z [ ()
varies with the occurrence of each G (. Specically, for the integrable random vector
z, IE(z [ () is the (-measurable random variable satisfying
_
G
IE(z [ () d IP =
_
G
z d IP,
for all G (. By setting ( = (y), the -algebra generated by y, we can write
IE(z [ y) = IE[z [ (y)],
which is interpreted as the expectation of z given all the information associated with
y. Note that the unconditional expectation IE(z) can be viewed as the expectation of
z conditional on the trivial -algebra , , i.e., the smallest -algebra that contains
no extra information from any random vectors.
Similar to unconditional expectations, conditional expectations are monotonic: if
z x with probability one, then IE(z [ () IE(x [ () with probability one; in particular,
if z 0 with probability one, then IE(z [ () 0 with probability one. Moreover, If z
is independent of y, then IE(z [ y) = IE(z). For example, if z is a constant vector c
which is independent of any random variable, then IE(z [ y) = c. The linearity result
below is analogous to Lemma 2.2 for unconditional expectations.
Lemma 5.8 Let z (d 1) and y (c 1) be integrable random vectors and A (n d)
and B (n c) be non-stochastic matrices. Then with probability one,
IE(Az +By [ () = AIE(z [ () +BIE(y [ ().
If b (n1) is a non-stochastic vector, IE(Az +b [ () = AIE(z [ () +b with probability
one.
From the denition of conditional expectation, we immediately have
_
IE(z [ () d IP =
_
z d IP;
that is, IE[IE(z [ ()] = IE(z). This is known as the law of iterated expectations. As
IE(z) is also the conditional expectation with respect to the trivial (smallest) -algebra,
5.2. CONDITIONAL DISTRIBUTION AND MOMENTS 117
the equality above suggests that if conditional expectations are taken sequentially with
respect to dierent -algebras, only the one with respect to a smaller -algebra matters.
For example, for k random vectors y
1
, . . . , y
k
,
IE[IE(z [ y
1
, . . . , y
k
) [ y
1
, . . . , y
k1
] = IE(z [ y
1
, . . . , y
k1
).
A more general result is given below; see Exercise 5.6.
Lemma 5.9 (Law of Iterated Expectations) Let ( and H be two sub--algebras of
T such that ( H. Then for the integrable random vector z,
IE[IE(z [ H) [ (] = IE[IE(z [ () [ H] = IE(z [ ();
in particular, IE[IE(z [ ()] = IE(z).
For a (-measurable random vector z, the information in ( does not improve on our
understanding of z, so that IE(z [ () = z with probability one. That is, z can be
treated as known in IE(z [ () and taken out from the conditional expectation. Thus,
IE(zx
[ () = z IE(x
[ ().
In particular, z can be taken out from the conditional expectation when z itself is a
conditioning variable. This result is generalized as follows.
Lemma 5.10 Let z be a (-measurable random vector. Then for any Borel-measurable
function g,
IE[g(z)x [ (] = g(z) IE(x [ (),
with probability one.
Two square integrable random variables z and y are said to be orthogonal if their
inner product IE(zy) = 0. This denition allows us to discuss orthogonal projection in
the space of square integrable random vectors. Let z be a square integrable random
variable and z be a (-measurable random variable. Then, by Lemma 5.9 (law of iterated
expectations) and Lemma 5.10,
IE
__
z IE(z [ ()
_
z
= IE
_
IE
__
z IE(z [ ()
_
z [ (
_
= IE
_
IE(z [ () z IE(z [ () z
= 0.
That is, the dierence between z and its conditional expectation IE(z [ () must be
orthogonal to any (-measurable random variable. It can then be seen that for any
square integrable, (-measurable random variable z,
IE(z z)
2
= IE[z IE(z [ () + IE(z [ () z]
2
= IE[z IE(z [ ()]
2
+ IE[IE(z [ () z]
2
IE[z IE(z [ ()]
2
.
where in the second equality the cross-product term vanishes because both IE(z [ ()
and z are (-measurable and hence orthogonal to z IE(z [ (). That is, among all
(-measurable random variables that are also square integrable, IE(z [ () is the closest
to z in terms of the L
2
-norm. This shows that IE(z [ () is the orthogonal projection of
z onto the space of all (-measurable, square integrable random variables.
Lemma 5.11 Let z be a square integrable random variable. Then
IE[z IE(z [ ()]
2
IE(z z)
2
,
for any (-measurable random variable z.
In particular, let ( = (y), where y is a square integrable random vector. Lemma 5.11
implies that
IE
_
z IE
_
z [ (y)
_
2
IE
_
z h(y)
_
2
,
for any Borel-measurable function h such that h(y) is also square integrable. Thus,
IE[z [ (y)] minimizes the L
2
-norm |z h(y)|
2
, and its dierence from z is orthogonal
to any function of y that is also square integrable. We may then say that, given all the
information generated from y, IE(z [ (y)) is the best approximation of z in terms of
the L
2
-norm (the best L
2
predictor).
The conditional variance-covariance matrix of z given y is
var(z [ y) = IE
_
[z IE(z [ y)][z IE(z [ y)]
[ y
_
= IE(zz
[ y) IE(z [ y) IE(z [ y)
.
Similar to unconditional variance-covariance matrix, we have for non-stochastic matrices
A and b,
var(Az +b [ y) = A var(z [ y) A
,
5.3. MODES OF CONVERGENCE 119
which is nonsingular provided that Ahas full row rank and var(z [ y) is positive denite.
It can also be shown that
var(z) = IE[var(z [ y)] + var
_
IE(z [ y)
_
;
see Exercise 5.7. That is, the variance of y can be expressed as the sum of two com-
ponents: the mean of its conditional variance and the variance of its conditional mean.
This is also known as the decomposition of analysis of variance.
Example 5.12 Suppose that (y
is distributed as a multivariate normal random

vector:
_
y
x
_
N
__

y
x
_
,
_

y

xy
xy

x
__
.
It is well known that the conditional distribution of y given x is also normal. Moreover,
it can be shown that
IE(y [ x) =
y
xy
1
x
(x
x
),
a linear function of x. By the analysis-of-variance decomposition, the conditional
variance-covariance matrix of y is
var(y [ x) = var(y) var
_
IE(y [ x)
_
=
y
xy
1
x

xy
,
which does not depend on x.
5.3 Modes of Convergence
Consider now a sequence of random variables z
n
()
n=1,2,...
dened on the probability
space (, T, IP). For a given , z
n
is a realization (a sequence of sample values) of
the random element with the index n, and that for a given n, z
n
is a random variable
which assumes dierent values depending on . In this section we will discuss various
modes of convergence for sequences of random variables.
5.3.1 Almost Sure Convergence
We rst introduce the concept of almost sure convergence (convergence with probability
one). Suppose that z
n
is a sequence of random variables and z is a random variable,
all dened on the probability space (, T, IP). The sequence z
n
is said to converge to
z almost surely if, and only if,
IP(: z
n
() z() as n ) = 1,
denoted as z
n
a.s.
z or z
n
z a.s. Note that for a given , the realization z
n
() may
or may not converge to z(). Almost sure convergence requires that z
n
() z() for
almost all , except for those in a set with probability zero. That is, almost
all the realizations z
n
() will be eventually close to z() for all n suciently large; the
event that z
n
will not approach z is improbable. When z
n
and z are both R
d
-valued,
almost sure convergence is dened elementwise. That is, z
n
z a.s. if every element
of z
n
converges almost surely to the corresponding element of z.
The following result shows that continuous transformation preserves almost sure
convergence.
Lemma 5.13 Let g: R R be a function continuous on S
g
R.
[a] If z
n
a.s.
z, where z is a random variable such that IP(z S
g
) = 1, then g(z
n
)
a.s.
g(z).
[b] If z
n
a.s.
c, where c is a real number at which g is continuous, then g(z
n
)
a.s.
g(c).
Proof: Let
0
= : z
n
() z() and
1
= : z() S
g
. Thus, for (
0
1
),
continuity of g ensures that g(z
n
()) g(z()). Note that
(
0

1
)
c
=
c
0

c
1
,
which has probability zero because IP(
c
0
) = IP(
c
1
) = 0. It follows that
0

1
has
probability one. This proves that g(z
n
) g(z) with probability one. The second
assertion is just a special case of the rst result.
Lemma 5.13 is easily generalized to R
d
-valued random variables. For example,
z
n
a.s.
z implies
z
1,n
+z
2,n
a.s.
z
1
+z
2
,
z
1,n
z
2,n
a.s.
z
1
z
2
,
z
2
1,n
+z
2
2,n
a.s.
z
2
1
+z
2
2
,
where z
1,n
, z
2,n
are two elements of z
n
and z
1
, z
2
are the corresponding elements of z.
Also, provided that z
2
,= 0 with probability one, z
1,n
/z
2,n
z
1
/z
2
a.s.
5.3.2 Convergence in Probability
A weaker convergence concept is convergence in probability. A sequence of random
variables z
n
is said to converge to z in probability if for every > 0,
lim
n
IP(: [z
n
() z()[ > ) = 0,
or equivalently,
lim
n
IP(: [z
n
() z()[ ) = 1,
denoted as z
n
IP
z. We also say that z is the probability limit of z
n
, denoted as
plimz
n
= z. In particular, if the probability limit of z
n
is a constant c, all the probability
mass of z
n
will concentrate around c when n becomes large. For R
d
-valued random
variables z
n
and z, convergence in probability is also dened elementwise.
In the denition of convergence in probability, the events
n
() = : [z
n
()
z()[ vary with n, and convergence is referred to the probabilities of such events:
p
n
= IP(
n
()), rather than the random variables z
n
. By contrast, almost sure con-
vergence is related directly to the behaviors of random variables. For convergence in
probability, the event
n
that z
n
will be close to z becomes highly likely when n tends
to innity, or its complement (z
n
will deviate from z by a certain distance) becomes
highly unlikely when n tends to innity. Whether z
n
will converge to z is not of any
concern in convergence in probability.
More specically, let
0
denote the set of such that z
n
() converges to z(). For

0
, there is some m such that is in
n
() for all n > m. That is,
_
m=1
n=m
n
() T.
As
n=m
n
() is also in T and non-decreasing in m, it follows that
IP(
0
) IP
_

_
m=1
n=m
n
()
_
= lim
m
IP
_

n=m
n
()
_
lim
m
IP
_
m
()
_
.
This inequality proves that almost sure convergence implies convergence in probability,
but the converse is not true in general. We state this result below.
Lemma 5.14 If z
n
a.s.
z, then z
n
IP
z.
The following well-known example shows that when there is convergence in proba-
bility, the random variables themselves may not even converge for any .
Example 5.15 Let = [0, 1] and IP be the Lebesgue measure (i.e., IP(a, b] = b a
for (a, b] [0, 1]). Consider the sequence I
n
of intervals [0, 1], [0, 1/2), [1/2, 1], [0, 1/3),
[1/3, 2/3), [2/3, 1], . . . , and let z
n
= 1
In
be the indicator function of I
n
: z
n
() = 1 if
I
n
and z
n
= 0 otherwise. When n tends to innity, I
n
shrinks toward a singleton
which has the Lebesgue measure zero. For 0 < < 1, we then have
IP([z
n
[ > ) = IP(I
n
) 0,
which shows z
n
IP
0. On the other hand, it is easy to see that each [0, 1] must
be covered by innitely many intervals. Thus, given any [0, 1], z
n
() = 1 for
innitely many n, and hence z
n
() does not converge to zero. Note that convergence in
probability permits z
n
to deviate from the probability limit innitely often, but almost
sure convergence does not, except for those in the set of probability zero.
Intuitively, if var(z
n
) vanishes asymptotically, the distribution of z
n
would shrink
toward its mean IE(z
n
). If, in addition, IE(z
n
) tends to a constant c (or IE(z
n
) = c), then
z
n
ought to be degenerate at c in the limit. These observations suggest the following
sucient conditions for convergence in probability; see Exercises 5.8 and 5.9. In many
cases, it is easier to establish convergence in probability by verifying these conditions.
Lemma 5.16 Let z
n
be a sequence of square integrable random variables. If IE(z
n
)
c and var(z
n
) 0, then z
n
IP
c.
Analogous to Lemma 5.13, continuous functions also preserve convergence in prob-
ability.
Lemma 5.17 Let g: Rt R be a function continuous on S
g
R.
[a] If z
n
IP
z, where z is a random variable such that IP(z S
g
) = 1, then g(z
n
)
IP
g(z).
[b] (Slutsky) If z
n
IP
c, where c is a real number at which g is continuous, then
g(z
n
)
IP
g(c).
Proof: By the continuity of g, for each > 0, we can nd a > 0 such that
: [z
n
() z()[ : z() S
g
: [g(z
n
()) g(z())[ .
Taking complementation of both sides and noting that the complement of : z()
S
g
has probability zero, we have
IP([g(z
n
) g(z)[ > ) IP([z
n
z[ > ).
As z
n
converges to z in probability, the right-hand side converges to zero and so does
the left-hand side.
Lemma 5.17 is readily generalized to R
d
-valued random variables. For instance,
z
n
IP
z implies
z
1,n
+z
2,n
IP
z
1
+z
2
,
z
1,n
z
2,n
IP
z
1
z
2
,
z
2
1,n
+z
2
2,n
IP
z
2
1
+z
2
2
,
where z
1,n
, z
2,n
are two elements of z
n
and z
1
, z
2
are the corresponding elements of z.
Also, provided that z
2
,= 0 with probability one, z
1,n
/z
2,n
IP
z
1
/z
2
.
5.3.3 Convergence in Distribution
Another convergence mode, known as convergence in distribution or convergence in law,
concerns the behavior of the distribution functions of random variables. Let F
zn
and F
z
be the distribution functions of z
n
and z, respectively. A sequence of random variables
z
n
is said to converge to z in distribution, denoted as z
n
D
z, if
lim
n
F
zn
() = F
z
(),
for every continuity point of F
z
. That is, regardless the distributions of z
n
, convergence
in distribution ensures that F
zn
will be arbitrarily close to F
z
for all n suciently large.
The distribution F
z
is thus known as the limiting distribution of z
n
. We also say that
z
n
is asymptotically distributed as F
z
, denoted as z
n
A
F
z
.
For random vectors z
n
and z, z
n
D
z if the joint distributions F
z
n
converge
to F
z
for every continuity point of F
z
. It is, however, more cumbersome to show
convergence in distribution for a sequence of random vectors. The so-called Cramer-
Wold device allows us to transform this multivariate convergence problem to a univariate
one. This result is stated below without proof.
Lemma 5.18 (Cramer-Wold Device) Let z
n
be a sequence of random vectors in
R
d
. Then z
n
D
z if and only if
z
n
D

z for every R
d
such that
= 1.
There is also a uni-directional relationship between convergence in probability and
convergence in distribution. To see this, note that for some arbitrary > 0 and a
continuity point of F
z
, we have
IP(z
n
) = IP(z
n
[z
n
z[ ) + IP(z
n
[z
n
z[ > )
IP(z +) + IP([z
n
z[ > ).
Similarly,
IP(z ) IP(z
n
) + IP([z
n
z[ > ).
If z
n
IP
z, then by passing to the limit and noting that is arbitrary, the inequalities
above imply
lim
n
IP(z
n
) = IP(z ).
That is, F
zn
() F
z
(). The converse is not true in general, however.
When z
n
converges in distribution to a real number c, it is not dicult to show
that z
n
also converges to c in probability. In this case, these two convergence modes
are equivalent. To be sure, note that a real number c can be viewed as a degenerate
random variable with the distribution function:
F() =
_
0, < c,
1, c,
which is a step function with a jump point at c. When z
n
D
c, all the probability mass
of z
n
will concentrate at c as n becomes large; this is precisely what z
n
IP
c means.
More formally, for any > 0,
IP([z
n
c[ > ) = 1 [F
zn
(c +) F
zn
((c )
)],
where (c )
denotes the point adjacent to and less than c . Now, z

n
D
c implies
that F
zn
(c + ) F
zn
((c )
) converges to one, so that IP([z

n
c[ > ) converges to
zero. We summarizes these results below.
Lemma 5.19 If z
n
IP
z, thenz
n
D
z. For a constant c, z
n
IP
c is equivalent to
z
n
D
c.
The continuous mapping theorem below asserts that continuous functions preserve
convergence in distribution; cf. Lemmas 5.13 and 5.17.
5.4. ORDER NOTATIONS 125
Lemma 5.20 (Continuous Mapping Theorem) Let g : R R be a function con-
tinuous almost everywhere on R, except for at most countably many points. If z
n
D
z,
then g(z
n
)
D
g(z).
For example, if z
n
converges in distribution to the standard normal random variable, the
limiting distribution of z
2
n
is
2
(1). Generalizing this result to R
d
-valued random vari-
ables, we can see that when z
n
converges in distribution to the d-dimensional standard
normal random variable, the limiting distribution of z
n
z
n
is
2
(d).
Two sequences of random variables y
n
and z
n
are said to be asymptotically
equivalent if their dierences y
n
z
n
converge to zero in probability. Intuitively, the
limiting distributions of two asymptotically equivalent sequences, if exist, ought to be
the same. This is stated in the next result without proof.
Lemma 5.21 Let y
n
and z
n
be two sequences of random vectors such that y
n

z
n
IP
0. If z
n
D
z, then y
n
D
z.
The next result is concerned with two sequences of random variables such that one
converges in distribution and the other converges in probability.
Lemma 5.22 If y
n
converges in probability to a constant c and z
n
converges in dis-
tribution to z, then y
n
+ z
n
D
c + z, y
n
z
n
D
cz, and z
n
/y
n
D
z/c if c ,= 0.
5.4 Order Notations
It is typical to use order notations to describe the behavior of a sequence of numbers,
whether it converges or not. Let c
n
denote a sequence of positive real numbers.
1. Given a sequence b
n
, we say that b
n
is (at most) of order c
n
, denoted as b
n
=
O(c
n
), if there exists a < such that [b
n
[/c
n
for all suciently large n.
When c
n
diverges, b
n
cannot diverge faster than c
n
; when c
n
converges to zero, the
rate of convergence of b
n
is no slower than that of c
n
. For example, the polynomial
a +bn is O(n), and the partial sum of a bounded sequence

n
i=1
b
i
is O(n). Note
that an O(1) sequence is a bounded sequence.
2. Given a sequence b
n
, we say that b
n
is of smaller order than c
n
, denoted as
b
n
= o(c
n
), if b
n
/c
n
0. When c
n
diverges, b
n
must diverge slower than c
n
; when
c
n
converges to zero, the rate of convergence of b
n
should be faster than that of c
n
.
For example, the polynomial a +bn is o(n
1+
) for any > 0, and the partial sum
n
i=1
i
, [[ < 1, is o(n). Note that an o(1) sequence is a sequence that converges
to zero.
If b
n
is a vector (matrix), b
n
is said to be O(c
n
) (o(c
n
)) if every element of b
n
is O(c
n
)
(o(c
n
)). It is also easy to verify the following results; see Exercise 5.10.
Lemma 5.23 Let a
n
and b
n
be two non-stochastic sequences.
(a) If a
n
= O(n
r
) and b
n
= O(n
s
), then a
n
b
n
= O(n
r+s
) and a
n
+b
n
= O(n
max(r,s)
).
(b) If a
n
= o(n
r
) and b
n
= o(n
s
), then a
n
b
n
= o(n
r+s
) and a
n
+b
n
= o(n
max(r,s)
).
(c) If a
n
= O(n
r
) and b
n
= o(n
s
), then a
n
b
n
= o(n
r+s
) and a
n
+b
n
= O(n
max(r,s)
).
The order notations can be easily extended to describe the behavior of sequences of
random variables. A sequence of random variables z
n
is said to be O
a.s.
(c
n
) (or O(c
n
)
almost surely) if z
n
/c
n
is O(1) a.s., and it is said to be O
IP
(c
n
) (or O(c
n
) in probability)
if for every > 0, there is some such that
IP([z
n
[/c
n
) ,
for all n suciently large. Similarly, z
n
is o
a.s.
(c
n
) (or o(c
n
) almost surely) if z
n
/c
n
a.s.
0, and it is o
IP
(c
n
) (or o(c
n
) in probability) if z
n
/c
n
IP
0.
If z
n
is O
a.s.
(1) (o
a.s
(1)), we say that z
n
is bounded (vanishing) almost surely; if
z
n
is O
IP
(1) (o
IP
(1)), z
n
is bounded (vanishing) in probability. Note that Lemma 5.23
also holds for stochastic order notations. In particular, if a sequence of random variables
is bounded almost surely (in probability) and another sequence of random variables is
vanishing almost surely (in probability), the products of their corresponding elements
are vanishing almost surely (in probability). That is, y
n
= O
a.s.
(1) and z
n
= o
a.s
(1),
then y
n
z
n
is o
a.s
(1).
When z
n
D
z, we have known that z
n
does not converge in probability to z in
general, but more can be said about the behavior of z
n
. Let be a continuity point of
F
z
. Then for any > 0, we can choose a suciently large such that IP([z[ > ) < /2.
As z
n
D
z, we can also choose n large enough such that
IP([z
n
[ > ) IP([z[ > ) < /2,
which implies IP([z
n
[ > ) < . This shows that:
Lemma 5.24 Let z
n
be a sequence of random vectors such that z
n
D
z. Then
z
n
= O
IP
(1).
5.5. LAW OF LARGE NUMBERS 127
5.5 Law of Large Numbers
The law of large numbers is concerned with the averaging behavior of random variables
and is one of the most important results in probability theory. A sequence of integrable
random variables z
t
is said to obey a strong law of large numbers (SLLN) if
1
T
T
t=1
[z
t
IE(z
t
)]
a.s.
0, (5.1)
and it is said to obey a weak law of large numbers (WLLN) if
1
T
T
t=1
[z
t
IE(z
t
)]
IP
0. (5.2)
Thus, a law of large numbers ensures that the sample average essentially follows its
mean behavior; random irregularities (deviations from the mean) are eventually wiped
out by averaging. For a sequence of random vectors (matrices), a SLLN (WLLN) is
dened elementwise.
According to these denitions, z
t
may or may not be independent random variables,
and they are not required to have a constant mean and hence may have non-identical
distributions. When IE(z
t
) is a constant
o
, the SLLN (5.1) and WLLN (5.2) read,
respectively,
1
T
T
t=1
z
t
a.s.
o
,
1
T
T
t=1
z
t
IP
o
.
Below are two well known strong laws.
Lemma 5.25 (Kolmogorov) Let z
t
be a sequence of i.i.d. random variables with
mean
o
. Then
1
T
T
t=1
z
t
a.s.
o
.
Note that i.i.d. random variables need not obey Kolmogorovs SLLN if they do not
have a nite mean, e.g., the variables with the Cauchy distribution. Given that almost
sure convergence implies convergence in probability, the same condition in Lemma 5.25
ensures that z
t
obeys a WLLN.
Lemma 5.26 (Markov) Let z
t
be a sequence of independent random variables. If
for some > 0, IE[z
t
[
1+
are bounded for all t, then
1
T
T
t=1
[z
t
IE(z
t
)]
a.s.
0,
From this result we can see that independent random variables may still obey a SLLN
even when they do not have a common distribution. Comparing to Kolmogorovs SLLN,
Lemma 5.26 requires random variables to satisfy a stronger moment condition (their
(1 +)
th
moment must be bounded). A non-stochastic sequence, which can be viewed
as a sequence of independent random variables, obeys a SLLN if it is O(1).
The results above show that a SLLN (WLLN) holds provided that random vari-
ables satisfy certain conditions. The sucient conditions ensuring a SLLN (WLLN)
are usually imposed on the moments and dependence structure of random variables.
Specically, z
t
would obey a SLLN (WLLN) if z
t
have bounded moments up to some
order and are asymptotically independent in a proper sense. In some cases, it suces
to require corr(z
t
, z
tj
) converging to zero suciently fast as j , as shown in the
example below. Intuitively, random variables without some bounded moment may be-
have wildly such that their random irregularities cannot be completely averaged out.
For random variables with strong correlations, the variation of their partial sums may
grow too rapidly and cannot be eliminated by simple averaging. Thus, a sequence of
random variables must be well behaved to ensure a SLLN (WLLN).
Example 5.27 Suppose that y
t
is generated as a weakly stationary AR(1) process:
y
t
=
o
y
t1
+
t
, t = 1, 2, . . . ,
with y
0
= 0 and [
o
[ < 1, where
t
are i.i.d. random variables with mean zero and
variance
2
. In view of Section 4.4, we have IE(y
t
) = 0, var(y
t
) =
2
/(1
2
o
), and
cov(y
t
, y
tj
) =
j
o
2
1
2
o
.
These results ensure that IE(T
1
T
t=1
y
t
) = 0 and
var
_
T
t=1
y
t
_
=
T
t=1
var(y
t
) + 2
T1
=1
(T ) cov(y
t
, y
t
)
t=1
var(y
t
) + 2T
T1
=1
[ cov(y
t
, y
t
)[
= O(T).
The latter result shows that var
_
T
1
T
t=1
y
t
_
= O(T
1
) which converges to zero as T
approaches innity. It follows from Lemma 5.16 that
1
T
T
t=1
y
t
IP
0.
This shows that y
t
obeys a WLLN. Note that in this case, y
t
have a constant vari-
ance and cov(y
t
, y
tj
) goes to zero exponentially fast as j tends to innity. These two
properties in eect ensure a WLLN. Similarly, it can be shown that
1
T
T
t=1
y
2
t
IP
IE(y
2
t
) = var(y
t
).
That is, y
2
t
also obeys a WLLN. These results are readily generalized to weakly
stationary AR(p) processes.
It is more cumbersome to establish a strong law for weakly stationary processes.
The lemma below is convenient in practice; see Davidson (1994, p. 326) for a proof.
Lemma 5.28 Let y
t
=
j=
j
u
tj
, where u
t
are i.i.d. random variables with mean
zero and variance
2
. If
j
are absolutely summable, i.e.,

j=
[
j
[ < , then
T
t=1
y
t
/T
a.s.
0.
In Example 5.27, y
t
=

j=0
j
o
tj
with [
o
[ < 1. It is clear that

j=0
[
j
o
[ < .
Hence, Lemma 5.28 ensures that y
t
obeys a SLLN and the average of y
t
converges
to its mean (zero) almost surely. If y
t
= z
t
, then the average of z
t
converges to
IE(z
t
) = almost surely.
More generally, it is also possible that a sequence of weakly dependent and hetero-
geneously distributed random variables obeys a SLLN (WLLN). This usually requires
stronger conditions on their moments and dependence structure.
1
To avoid technicality,
we will not specify the regularity conditions that ensure a general SLLN (WLLN); see
White (1984) and Davidson (1994) for such conditions and the resulting strong and
weak laws. Instead, we use the following examples to illustrate why a WLLN and hence
a SLLN may fail to hold.
1
The notions of mixing sequence and mixingale allow the random variables to be dependent and
heterogeneously distributed. In their denitions, probabilistic structures are imposed to regulate the
dependence among random variables. Such sequences of random variables may obey a SLLN (WLLN)
if they are weakly dependent in the sense that the dependence of random variables zt on their distant
past ztj eventually vanishes at a suitable rate as j tends to innity.
Example 5.29 Consider the sequences t and t
2
, t = 1, 2, . . . . It is well known that
T
t=1
t = T(T + 1)/2,
T
t=1
t
2
= T(T + 1)(2T + 1)/6.
Hence,

T
t=1
t/T and

T
t=1
t
2
/T both diverge. In this example, the elements of these
two sequences diverge so that their partial sums grow too rapidly. Thus, these sequences
do not obey a SLLN.
Example 5.30 Suppose that
t
are i.i.d. random variables with mean zero and variance
2
. Thus, T
1
T
t=1
t
a.s.
0 by Kolmogorvs SLLN (Lemma 5.25). As IE[t
t
[
1+
=
O(t
1+
) which grows with t, t
t
does not have bounded (1+)
th
moment and therefore
does not obey Markovs SLLN (Lemma 5.26). Moreover, note that
var
_
T
t=1
t
t
_
=
T
t=1
t
2
var(
t
) =
2
T(T + 1)(2T + 1)
6
.
By Exercise 5.11,

T
t=1
t
t
= O
IP
(T
3/2
). It follows that T
1
T
t=1
t
t
= O
IP
(T
1/2
) which
diverges in probability. Thus, t
t
does not obey a WLLN either.
t
is generated as a random walk:
y
t
= y
t1
+
t
, t = 1, 2, . . . ,
with y
0
= 0, where
t
are i.i.d. random variables with mean zero and variance
2
.
Clearly,
y
t
=
t
i=1
i
,
which has mean zero and unbounded variance t
2
. For s < t, write
y
t
= y
s
+
t
i=s+1
i
= y
s
+v
ts
,
where v
ts
=
t
i=s+1
i
is independent of y
s
. We then have
cov(y
t
, y
s
) = IE(y
2
s
) = s
2
,
for t > s. Consequently,
var
_
T
t=1
y
t
_
=
T
t=1
var(y
t
) + 2
T1
=1
T
t=+1
cov(y
t
, y
t
).
It can be veried that the rst term on the right-hand side is
T
t=1
var(y
t
) =
T
t=1
t
2
= O(T
2
),
and that the second term is
2
T1
=1
T
t=+1
cov(y
t
, y
t
) = 2
T1
=1
T
t=+1
(t )
2
= O(T
3
).
Thus, var(
T
t=1
y
t
) = O(T
3
), so that
T
t=1
y
t
= O
IP
(T
3/2
) by Exercise 5.11. This shows
that
1
T
T
t=1
y
t
= O
IP
(T
1/2
),
which diverges in probability. Note that in this case, y
t
have unbounded variances and
strong correlations over time. Due to these correlations, the variation of the partial
sum of y
t
grows much too fast. (Recall that the variance of

T
t=1
y
t
is only O(T) in
Example 5.27.) Similarly, we can show that

T
t=1
y
2
t
= O
IP
(T
2
); see Exercise 5.12 for
a special case. Thus, y
2
t
does not obey a WLLN when y
t
follows a random walk. As
y
t
and y
2
t
do not obey a WLLN, they cannot obey a SLLN. The conclusions above
will not be altered when
t
is a white noise or a weakly stationary process.
The example below shows that a sequence of random variables need not obey a
WLLN even its partial sums are O
IP
(T).
t
y
t
= y
t1
+
t
, t = 1, 2, . . . ,
with y
0
= 0, as in Example 5.31. Then, the sequence y
t1
t
has mean zero and
var(y
t1
t
) = IE(y
2
t1
) IE(
2
t
) = (t 1)
4
.
More interestingly, it can be seen that for s < t,
cov(y
t1
t
, y
s1
s
) = IE(y
t1
y
s1
s
) IE(
t
) = 0.
We then have
var
_
T
t=1
y
t1
t
_
=
T
t=1
var(y
t1
t
) =
T
t=1
(t 1)
4
= O(T
2
),
and
T
t=1
y
t1
t
= O
IP
(T). Note, however, that var(T
1
T
t=1
y
t1
t
) converges to
4
/2,
rather than 0. Thus, T
1
T
t=1
y
t1
t
cannot behave like a non-stochastic number in
the limit. This shows that y
t1
t
does not obey a WLLN, and hence also does not
obey a SLLN.
5.6 Uniform Law of Large Numbers
In econometric analysis, it is also common to deal with functions of random variables and
model parameters. For example, q(z
t
()); ) is a random variable for a given parameter
, and it is function of for a given . When is xed, it is not dicult to impose
suitable conditions on q and z
t
such that q(z
t
(); ) obeys a SLLN (WLLN), as
discussed in Section 5.5. When assumes values in the parameter space , a SLLN
(WLLN) that does not depend on is then needed.
More specically, suppose that q(z
t
; ) obeys a SLLN for each :
Q
T
(; ) =
1
T
T
t=1
q(z
t
(); )
a.s.
Q(),
where Q() is a non-stochastic function of . As this convergent behavior may depend on
,
c
0
() = : Q
T
(; ) ,Q() varies with . When is an interval of R,
c
0
()
is an uncountable union of non-convergence sets and hence may not have probability
zero, even though each
c
0
() does. Thus, the event that Q
T
(; ) Q() for all , i.e.,
0
(), may occur with probability less than one. In fact, the union of all
c
0
()
may not even be in T (only countable unions of the elements in T are guaranteed to
be in T). If so, we cannot conclude anything about stochastic convergence. Worse still
is when also depends on T, as in the case where is replaced by the estimator

T
.
There may not exist a nite T
such that Q
T
(;

T
) are arbitrarily close to Q(;

T
) for
all T > T
.
These observations suggest that we should study convergence that is uniform on the
parameter space . In particular, Q
T
(; ) converges to Q() uniformly in almost
surely (in probability) if the largest possible dierence:
sup
[Q
T
() Q()[ 0, a.s. (in probability).
5.6. UNIFORM LAW OF LARGE NUMBERS 133
In what follows we always assume that this supremum is a random variables for all
T. The example below, similar to Example 2.14 of Davidson (1994), illustrates the
dierence between uniform and pointwise convergence.
Example 5.33 Let z
t
be i.i.d. random variables with zero mean and
q
T
(z
t
(); ) = z
t
() +
_
_
T, 0
1
2T
,
1 T,
1
2T
<
1
T
,
0,
1
T
< < .
Observe that for 1/T and = 0,
Q
T
(; ) =
1
T
T
t=1
q
T
(z
t
; ) =
1
T
T
t=1
z
t
,
which converges to zero almost surely by Kolmogorovs SLLN. Thus, for a given , we
can always choose T large enough such that Q
T
(; )
a.s.
0, where 0 is the pointwise
limit. On the other hand, it can be seen that = [0, ) and
sup
[Q
T
(; )[ = [ z
T
+ 1/2[
a.s.
1/2,
so that the uniform limit is dierent from the pointwise limit.
More generally, we consider a triangular array of functions q
Tt
(z
t
; ), t = 1, 2, . . . , T,
where z
t
are integrable random vectors and is the parameter vector taking values in
the parameter space R
m
. For notation simplicity, we will not explicitly write in
the functions. We say that q
Tt
(z
t
; ) obeys a strong uniform law of large numbers
(SULLN) if
sup
1
T
T
t=1
[q
Tt
(z
t
; ) IE(q
Tt
(z
t
; ))]
a.s.
0, (5.3)
cf. (5.1). Similarly, q
Tt
(z
t
; ) is said to obey a weak uniform law of large numbers
(WULLN) if the convergence condition above holds in probability. If q
Tt
is R
m
-valued
functions, the SULLN (WULLN) is dened elementwise.
We have seen that pointwise convergence does not imply uniform convergence. A
natural question one would ask is: what additional conditions are needed to guarantee
uniform convergence? Now let
Q
T
() =
1
T
T
t=1
[q
Tt
(z
t
; ) IE(q
Tt
(z
t
; ))].
Suppose that Q
T
satises the following Lipschitz-type continuity requirement: for ,
in ,
[Q
T
() Q
T
(
)[ C
T
|
| a.s.,
where | | denotes the Euclidean norm, and C
T
is a random variable bounded almost
surely and does not depend on . Under this condition, Q
T
(
) can be made arbitrarily

close to Q
T
(), provided that
is suciently close to . Using the triangle inequality

and taking supremum over we have
sup
[Q
T
()[ sup
[Q
T
() Q
T
(
)[ +[Q
T
(
)[.
Let denote an almost sure bound of C
T
. Then given any > 0, choosing
such that
|
| < /(2) implies

sup
[Q
T
() Q
T
(
)[ C
T
2

2
,
uniformly in T. Moreover, because Q
T
() converges to 0 almost surely for each in ,
[Q
T
(
)[ is also less than /2 for suciently large T. Consequently,

sup
[Q
T
()[ ,
for all T suciently large. As these results hold almost surely, we have a SULLN for
Q
T
(); the conditions ensuring a WULLN are analogous.
Lemma 5.34 Suppose that for each , q
Tt
(z
t
; ) obeys a SLLN (WLLN) and
that for ,
,
[Q
T
() Q
T
(
)[ C
T
|
| a.s.,
where C
T
is a random variable bounded almost surely (in probability) and does not
depend on . Then, q
Tt
(z
t
; ) obeys a SULLN (WULLN).
Lemma 5.34 is quite convenient for establishing a SULLN (WULLN) because it
requires only two conditions. First, the random functions must obey a standard SLLN
(WLLN) for each in the parameter space. Second, the random function must satisfy
a Lipschitz-type continuity condition. In particular, the continuity condition can be
veried by checking whether q
Tt
is suciently smooth in the second argument. Note,
however, that C
T
being bounded almost surely may imply that the random variables
in q
Tt
are also bounded almost surely. This requirement is much too restrictive in
5.7. CENTRAL LIMIT THEOREM 135
applications. Hence, a SULLN may not be readily obtained from Lemma 5.34. On
the other hand, a WULLN is practically more plausible because the requirement that
C
T
is O
IP
(1) is much weaker. For example, the boundedness of IE[C
T
[ is sucient for
C
T
being O
IP
(1) by Markovs inequality. For more specic conditions ensuring these
requirements we refer to Gallant and White (1988) and Bierens (1994).
5.7 Central Limit Theorem
The central limit theorem ensures that the distributions of suitably normalized averages
will be essentially close to the standard normal distribution, regardless of the original
distributions of random variables. This result is very useful and convenient in applica-
tions because, as far as approximation is concerned, we only have to consider a single
distribution for normalized sample averages.
Given a sequence of square integrable random variables z
t
, let z
T
= T
1
T
t=1
z
t
,

T
= T
1
T
t=1
IE(z
t
), and
2
T
= var
_
T
1/2
T
t=1
z
t
_
.
Then z
t
is said to obey a central limit theorem (CLT) if
2
T

2
o
> 0 such that
1
T
T
t=1
[z
t
IE(z
t
)] =
T( z
T

T
)
o
D
N(0, 1). (5.4)
Note that this denition requires neither IE(z
t
) nor var(z
t
) to be a constant; also, z
t
may or may not be a sequence of independent variables. The following are two well
known CLTs.
Lemma 5.35 (Lindeberg-Levy) Let z
t
be a sequence of i.i.d. random variables
with mean
o
and variance
2
o
> 0. Then,
T( z
T

o
)
o
D
N(0, 1).
A sequence of i.i.d. random variables need not obey this CLT if they do not have a nite
variance, e.g., random variables with t(2) distribution. Comparing to Lemma 5.25, one
can immediately see that the Lindeberg-Levy CLT requires a stronger condition (i.e.,
nite variance) than does Kolmogorovs SLLN.
Remark: In this example, z
T
converges to
o
in probability, and its variance
2
/T
vanishes when T tends to innity. To prevent having a degenerate distribution in the
limit, it is then natural to consider the normalized average T
1/2
( z
T

o
), which has a
constant variance
2
for all T. This explains why the normalizing factor T
1/2
is needed.
For a normalizing factor T
a
with a < 1/2, the normalized average still converges to zero
because its variance vanishes in the limit. For a normalizing factor T
a
with a > 1/2, the
normalized average diverges. In both cases, the resulting normalized averages cannot
have a well-behaved, non-degenerate distribution in the limit. Thus, it is usually said
that z
T
converges to
o
at the rate T
1/2
.
Lemma 5.36 Let z
Tt
be a triangular array of independent random variables with
mean
Tt
and variance
2
Tt
> 0 such that

2
T
=
1
T
T
t=1
2
Tt

2
o
> 0.
If for some > 0, IE[z
Tt
[
2+
are bounded for all t, then
T( z
T

T
)
o
D
N(0, 1).
Lemma 5.36 is a version of Liapunovs CLT. Note that this result requires a stronger con-
dition (the (2+)
th
moment must be bounded) than does Markovs SLLN(Lemma 5.26).
The sucient conditions that ensure a CLT are similar to but usually stronger
than those for a WLLN. That is, the sequence of random variables must have bounded
moment up to some higher order, and random variables must be asymptotically inde-
pendent of those in the distant past (such dependence must vanish suciently fast).
Moreover, it is also required that every random variable in the sequence is asymptoti-
cally negligible, in the sense that no random variable is inuential in aecting the partial
sums. Although we will not specify these regularity conditions explicitly, we note that
weakly stationary AR and MA processes obey a CLT in general. A sequence of weakly
dependent and heterogeneously distributed random variables may also obey a CLT, de-
pending on its moment and dependence structure. The following are examples that a
CLT does not hold.
Example 5.37 Suppose that
t
is a sequence of independent random variables with
mean zero, variance
2
, and bounded (2 +)
th
moment. From Example 5.29, we know
var(
T
t=1
t
t
) is O(T
3
), which implies that T
1/2
T
t=1
t
t
still has a diverging variance
(of order O(T
2
)). On the other hand, observe that
var
_
1
T
1/2
T
t=1
t
T

t
_
=
T(T + 1)(2T + 1)
6T
3

2

2
3
.
It follows from Lemma 5.36 that
3
T
1/2
t=1
t
T

t
D
N(0, 1).
These results show that (t/T)
t
obeys a CLT, whereas t
t
does not.
t
y
t
= y
t1
+
t
, t = 1, 2, . . . ,
with y
0
= 0, where
t
is a sequence of i.i.d. random variables with mean zero and
variance
2
. From Example 5.31 we have seen that y
t
and y
2
t
have unbounded vari-
ances and strong correlations over time. Hence, they do not obey a CLT. Example 5.32
also suggests that y
t1
t
does not obey a CLT.
Given a sequence of square integrable random vectors z
t
in R
d
, let
z
T
=
1
T
T
t=1
z
t
,

T
= T
1
T
t=1
IE(z
t
), and
T
= var
_
1
T
T
t=1
z
t
_
.
Then, z
t
is said to obey a CLT if
T

o
, a positive denite matrix, such that
1/2
o
1
T
T
t=1
[z
t
IE(z
t
)] =
1/2
o
T( z
T

T
)
D
N(0, I
d
).
By Lemma 5.18, this amounts to requiring that
z
t
obeys a CLT, for any R
d
such that
= 1.
Exercises
5.1 Let ( be a collection of subsets of . Show that the intersection of all the -
algebras on that contain ( is the smallest -algebra containing (.
5.2 Show that any half lines (, b] and [a, ) can be generated by open intervals in
R. Also show that any open interval (a, b) can be generated by closed intervals in
R.
5.3 Let y and z be two independent, integrable random variables. Show that IE(yz) =
IE(y) IE(z).
5.4 Let x and y be two random variables with nite p
th
moment (p > 1). Prove the
following triangle inequality:
|x +y|
p
|x|
p
+|y|
p
.
Hint: Write IE[x +y[
p
= IE([x +y[[x +y[
p1
) and apply H olders inequality.
5.5 In the probability space (, T, IP) suppose that we know the event B in T has
occurred. Show that the conditional probability IP([B) satises the axioms for
probability measures.
5.6 Prove the rst assertion of Lemma 5.9.
5.7 Prove that for the square integrable random vectors z and y,
var(z) = IE[var(z [ y)] + var(IE(z [ y)).
5.8 A sequence of square integrable random variables z
n
is said to converge to a
random variable z in L
2
(in quadratic mean) if
IE(z
n
z)
2
0.
Prove that L
2
convergence implies convergence in probability.
Hint: Apply Chebychevs inequality.
5.9 Show that a sequence of square integrable random variables z
n
converges to a
constant c in L
2
if and only if IE(z
n
) c and var(z
n
) 0.
5.10 Prove Lemma 5.23.
5.11 Suppose that IE(z
2
n
) = O(c
n
), where c
n
is a sequence of positive real numbers.
Show that z
n
= O
IP
(c
1/2
n
).
5.12 Suppose that y
t
is generated as a Gaussian random walk:
y
t
= y
t1
+
t
, t = 1, 2, . . . ,
with y
0
= 0, where
t
is a sequence of i.i.d. normal random variables with mean
zero and variance
2
. Show that

T
t=1
y
2
t
is O
IP
(T
2
).
References
Bierens, Herman J. (1994). Topics in Advanced Econometrics, New York, NY: Cam-
bridge University Press.
Davidson, James (1994). Stochastic Limit Theory, New York, NY: Oxford University
Press.
Gallant, A. Ronald (1997). An Introduction to Econometric Theory, Princeton, NJ:
Princeton University Press.
Gallant, A. Ronald and Halbert White (1988). A Unied Theory of Estimation and
Inference for Nonlinear Dynamic Models, Oxford, UK: Basil Blackwell.
White, Halbert (1984). Asymptotic Theory for Econometricians, Orlando, FL: Aca-
demic Press.
Chapter 6
Asymptotic Least Squares Theory
We have shown that the OLS estimator and related tests have good nite-sample prop-
erties under the classical conditions. These conditions are, however, quite restrictive
in practice, as discussed in Section 3.7. It is therefore natural to ask the following
questions. First, to what extent may we relax the classical conditions so that the OLS
method has broader applicability? Second, what are the properties of the OLS method
under more general conditions? The purpose of this chapter is to provide some answers
to these questions. In particular, the analysis in this chapter allows the observations
of each explanatory variable to be random variables, possibly weakly dependent and
heterogeneously distributed. This relaxation permits applications of the OLS method
to various data and models, but it also renders the analysis of nite-sample properties
dicult. Nonetheless, it is relatively easy to analyze the asymptotic performance of the
OLS estimator and construct large-sample tests. As the asymptotic results are valid
under more general conditions, the OLS method remains a useful tool in a wide variety
of applications.
6.1 When Regressors are Stochastic
Given the linear specication y = X + e, suppose now that X is stochastic. In this
case, [A2](i) must also be modied because IE(y) cannot be a random vector X
o
.
Even a condition on IE(y) is available, we are still unable to evaluate
IE(
T
) = IE[(X
X)
1
X
y],
because

T
now is a complex function of the elements of y and X. Similarly, a condition
on var(y) is of little use for calculating var(
T
).
141
142 CHAPTER 6. ASYMPTOTIC LEAST SQUARES THEORY
To ensure unbiasedness, it is typical to impose the condition: IE(y [ X) = X
o
for
some
o
, instead of [A2](i). Under this condition,
IE(
T
) = IE[(X
X)
1
X
IE(y [ X)] =
o
,
by Lemma 5.9 (law of iterated expectations). Note that the condition IE(y [ X) = X
o
implies
IE(y) = IE[IE(y [ X)] = IE(X)
o
,
again by the law of iterated expectations. Hence, IE(y) can be obtained from IE(y [
X) but not conversely. This shows that, when X is allowed to be stochastic, the
unbiasedness property of

T
would hold under a stronger condition.
Unfortunately, the condition IE(y [ X) = X
o
may not be realistic in some appli-
cations. To see this, let x
t
denote the t
th
column of X
and write the t

th
element of
IE(y [ X) = X
o
as
IE(y
t
[ x
1
, . . . , x
T
) = x
o
, t = 1, 2, . . . , T.
Consider time series data and the simple specication that x
t
contains only one regressor
y
t1
:
y
t
= y
t1
+e
t
, t = 1, 2, . . . , T.
In this case, the aforementioned condition reads:
IE(y
t
[ y
1
, . . . , y
T1
) =
o
y
t1
,
for some
o
. Note that for t = 1, . . . , T 1, IE(y
t
[ y
1
, . . . , y
T1
) = y
t
by Lemma 5.10.
The condition above then requires y
t
=
o
y
t1
with probability one. If y
t
is indeed
an AR(1) process: y
t
=
o
y
t1
+
t
such that
t
has a continuous distribution, the
event that y
t
=
o
y
t1
(i.e.,
t
= 0) can occur only with probability zero, violating the
imposed condition.
Suppose that IE(y [ X) = X
o
and var(y [ X) =
2
o
I
T
. It is easy to see that
var(
T
) = IE[(X
X)
1
X
(y X
o
)(y X
o
)
X(X
X)
1
]
= IE[(X
X)
1
X
var(y [ X)X(X
X)
1
]
=
2
o
IE(X
X)
1
,
which is dierent from the variance-covariance matrix when X is non-stochastic; cf.
Theorem 3.4(c). It is not always reasonable to impose such a condition on var(y [ X)
6.2. ASYMPTOTIC PROPERTIES OF THE OLS ESTIMATORS 143
either. Consider the previous example that x
t
= y
t1
. As IE(y
t
[ y
1
, . . . , y
T1
) = y
t
,
the conditional variance is
var(y
t
[ y
1
, . . . , y
T1
) = IE[y
t
IE(y
t
[ y
1
, . . . , y
T1
)]
2
[ y
1
, . . . , y
T1
= 0,
rather than a positive constant
2
o
.
Without the conditions on IE(y [ X) and var(y [ X), the mean and variance of the
OLS estimator remain unknown. Moreover, when X is stochastic, (X
X)
1
X
y need
not be normally distributed even when y is. Consequently, the results for hypothesis
testing discussed in Section 3.4 are invalid.
6.2 Asymptotic Properties of the OLS Estimators
Suppose that we observe the data (y
t
w
t
)
, where y
t
is the variable of interest (dependent
variable), and w
t
is an m 1 vector of exogenous variables. Let J
t
denote the
collection of random vectors w
1
, . . . , w
t
and
t
the collection of y
1
, . . . , y
t
. The set
of
t1
and J
t
generates a -algebra which represents the information set up to time
t. To account for the behavior of y
t
, we choose the vector of explanatory variables x
t
from the information set so that x
t
includes k elements of
t1
and J
t
. The linear
specication y = X +e can be expressed as
y
t
= x
t
+e
t
, t = 1, 2, . . . , T, (6.1)
where x
t
is the t
th
column of X
, i.e., the t
th
observation of all explanatory variables.
Under the present framework, regressors may be lagged dependent variables (taken from
t1
) and lagged exogenous variables (taken from J
t
). Including such variables in the
specication is quite helpful in capturing the dynamic behavior of data.
6.2.1 Consistency
Gien the specication (6.1), the OLS estimator can be written as
T
= (X
X)
1
X
y =
_
T
t=1
x
t
x
t
_
1
_
T
t=1
x
t
y
t
_
. (6.2)
The estimator

T
is said to be strongly consistent for the parameter vector
if

T
a.s.
as T tends to innity;

T
is said to be weakly consistent for
if

T
IP
. Strong
consistency asserts that

T
will be eventually close to
when enough information

(a suciently large sample) becomes available. Consistency is in sharp contrast with
unbiasedness. While an unbiased estimator of
is correct on average, there is no

guarantee that its values will be close to
, no matter how large the sample is.

To establish strong (weak) consistency, we impose the following conditions.
[B1] (y
t
w
t
)
is a sequence of random vectors and x

t
is also a random vector con-
taining some elements of
t1
and J
t
.
(i) x
t
x
t
obeys a SLLN (WLLN) such that lim
T
T
1
T
t=1
IE(x
t
x
t
) exists
and is nonsingular.
(ii) x
t
y
t
obeys a SLLN (WLLN).
[B2] For some
o
, IE(y
t
[
t1
, J
t
) = x
o
for all t.
One approach in the time-series analysis is to analyze the behavior of y
t
based solely on
its past behavior (lagged values). In this case, x
t
contains only the elements of
t1
,
and [B2] is modifed as IE(y
t
[
t1
) = x
o
for all t.
The condition [B1] explicitly allows the explanatory variables x
t
to be a random
vector which may contain one or more lagged dependent variables y
tj
and current and
past exogenous variables w
t
. [B1] also admits non-stochastic regressors which can be
viewed as independent, degenerate random vectors. Moreover, [B1](i) and (ii) regulate
the behaviors of y
t
and x
t
such that x
t
x
t
and x
t
y
t
must obey a SLLN (WLLN). On
the other hand, the deterministic time trend t and random walk are excluded because
they do not obey a SLLN (WLLN); see Examples 5.29 and 5.31.
Analogous to [A2](i), [B2] requires the linear function x
t
to be a correct speci-
cation of the conditional mean function, up to some unknown parameters. When x
t
is
non-stochastic, [B2] implies [A2](i) because by the law of iterated expectations,
IE(y
t
) = IE[IE(y
t
[
t1
, J
t
)] = x
o
.
Recall from Section 5.2 that the conditional mean IE(y
t
[
t1
, J
t
) is the orthogonal
projection of y
t
onto the space of all functions of the elements of
t1
and J
t
, where
orthogonality is dened in terms of the cross moment, the inner product in L
2
space.
Thus, the conditional mean function is the best approximation (in the mean squared
error sense) of y
t
based on the information generated by
t1
and J
t
.
As x
o
is the orthogonal projection of y
t
under [B2], it must be true that, for any
function g and any vector z
t
containing the elements of
t1
and J
t
,
IE[g(z
t
)(y
t
x
o
)] = 0,
by Lemma 5.11. That is, any function of z
t
must be orthogonal to the dierence between
y
t
and its orthogonal projection x
o
. If this condition does not hold for some g(z
t
), it
should be clear that x
o
cannot be IE(y
t
[
t1
, J
t
). In particular, if
IE[x
t
(y
t
x
o
)] ,= 0,
x
o
cannot be the conditional mean.
Unlike [A2](ii), the imposed conditions do not rule out serially correlated y
t
, nor do
they require the conditional variance var(y
t
[
t1
, J
t
) to be a constant. Moreover, x
t
may also be a sequence of weakly dependent and heterogeneously distributed random

variables, as long as it obeys a SLLN (WLLN). To summarize, the conditions here allow
data to exhibit various forms of dependence and heterogeneity. By contrast, the classical
conditions admit only serially uncorrelated and homoskedastic data.
Given [B1], dene the following limits:
M
xx
:= lim
T
1
T
T
t=1
IE(x
t
x
t
), M
xy
:= lim
T
1
T
T
t=1
IE(x
t
y
t
),
which are, respectively, the almost surely (probability) limits of the average of x
t
x
t
and
x
t
y
t
under a SLLN (WLLN). As matrix inversion is a continuous function and M
xx
is
invertible by [B1](i), Lemma 5.13 (Lemma 5.17) ensures that
_
1
T
T
t=1
x
t
x
t
_
1
M
1
xx
a.s. (in probability).
It follows from (6.2) that
T
=
_
1
T
T
t=1
x
t
x
t
_
1
_
1
T
T
t=1
x
t
y
t
_
M
1
xx
M
xy
This shows that the OLS estimator of has a well-dened limit under [B1].
Theorem 6.1 Given the linear specication (6.1), suppose that [B1] holds. Then,

T
is srongly (weakly) consistent for M
1
xx
M
xy
.
Theorem 6.1 holds regardless of [B2]; that is, whether (6.1) is the correct specication
or not is irrelevant.
Example 6.2 Given the simple AR(1) specication
y
t
= y
t1
+e
t
,
suppose that y
2
t
and y
t
y
t1
obey a SLLN (WLLN). Then, Theorem 6.1 ensures that
the OLS estimator

T

lim
T
1
T
T
t=2
IE(y
t
y
t1
)
lim
T
1
T
T
t=2
IE(y
2
t1
)
If IE(y
t
) = 0, var(y
t
) =
0
and cov(y
t
, y
t1
) =
1
for all t. This limit simplies to
1
/
0
.
By the law of iterated expectations, [B2] implies
IE(x
t
y
t
) = IE[x
t
IE(y
t
[
t1
, J
t
)] = IE(x
t
x
t
)
o
,
which in turn yields M
xy
= M
xx
o
. Thus, [B1] and [B2] together determine the almost
sure (probability) limit in Theorem 6.1 as
M
1
xx
M
xy
=
o
.
We have established the following result.
Theorem 6.3 Given the linear specication (6.1), suppose that [B1] and [B2] hold.
Then,

T
is strongly (weakly) consistent for the parameter vector
o
in [B2].
We state consistency in two results so as to clarify the convergence behavior of the
OLS estimator. First, as long as the data obey a proper SLLN (WLLN),

T
converges
to something. Second, the almost sure (probability) limit would be
o
if x
t
is the
correct specication for the conditional mean. These results are convenient for asserting
consistency. Once we believe (or are able to verify) that the data obey a SLLN (WLLN),
the conclusion of Theorem 6.1 immediately applies. If, further, we believe (or are able to
verify) that the specication is correct for the conditional mean, we can conclude from
Theorem 6.3 that the OLS estimator is strongly (weakly) consistent for the parameter
of interest.
y
t
= y
t1
+e
t
,
suppose that
y
t
=
o
y
t1
+u
t
,
where [
o
[ < 1 and u
t
is a sequence of unobservable, independent random variables
with mean zero and variance
2
u
. A process so generated is an AR(1) process. As all the
elements of
t1
are determined by u
s
for s t 1, it is then clear that these elements
and their functions must be independent of u
t
, by Lemma 5.1. It follows that
o
y
t1
is the conditional mean IE(y
t
[
t1
). Theorem 6.3 now ensures that
T

o
a.s. (in
probability). Note, however, that
o
y
t1
need not be the conditional mean if u
t
is a
white noise sequence.
Alternatively, we can establish consistency as follows. In view of Section 4.4, y
t
is
weakly stationary with mean zero, variance
2
u
/(1
2
o
) and
cov(y
t
, y
tj
) =
j
o
2
u
1
2
o
.
It follows from Example 6.2 that

T

cov(y
t
, y
t1
)
var(y
t1
)
=
o
Comparing to Example 6.2 we can see that the more we know about data, the more
precise we can say about the limit of the OLS estimator.
The examples below illustrate that when x
o
is not the desired conditional mean,
the OLS estimator still converges but may be inconsistent for
o
.
Example 6.5 Consider the specication
y
t
= x
t
+e
t
,
where x
t
is k
1
1. Suppose that
IE(y
t
[
t1
, J
t
) = x
o
+z
o
,
where z
t
(k
2
1) also contains the elements of
t1
and J
t
and is distinct from x
t
.
This is an example that a specication omits relevant variables (z
t
in the conditional
mean). When [B1] holds,
T
M
1
xx
M
xy
, a.s. (in probability),
by Theorem 6.1. In this case,
IE(x
t
y
t
) = IE(x
t
x
t
)
o
+ IE(x
t
z
t
)
o
,
so that
IE[x
t
(y
t
x
t
o
)] = IE(x
t
z
t
)
o
,
which is non-zero in general. This shows that x
o
cannot be the desired conditional
mean, and hence convergence to
o
is not guaranteed. In fact, the almost sure (proba-
bility) limit of

T
is
= M
1
xx
M
xy
=
o
+M
1
xx
M
xz
o
,
where M
xz
= lim
T
T
t=1
IE(x
t
z
t
)/T is assumed to exist. This limit is not
o
in
general. If the elements of x
t
are orthogonal to those of z
t
, i.e., IE(x
t
z
t
) = 0, we have
M
xz
= 0 and
=
o
. Then

T
converges to
o
almost surely (in probability), despite
that the linear function is misspecied for the conditional mean function.
y
t
= y
t1
+e
t
,
suppose that
y
t
=
o
y
t1
+u
t
+
o
u
t1
,
where [
o
[ < 1, [
o
[ < 1, and u
t
is a sequence of unobservable, independent random
variables with mean zero and variance
2
u
. A process so generated is known as an
ARMA(1,1) process (autoregressive and moving average process of order (1,1)). It is
easily shown that
o
y
t1
is not the conditional mean IE(y
t
[
t1
) because
IE[y
t1
(y
t
o
y
t1
)] = IE[y
t1
(u
t
+
o
u
t1
)] =
o
2
u
,
which is non-zero unless
o
= 0. As y
t
is weakly stationary such that
IE(y
t
y
t1
) =
o
IE(y
2
t1
) +
o
2
u
,
where IE(y
2
t1
) is a constant, we have from Example 6.2 that

T

IE(y
t
y
t1
)
IE(y
2
t1
)
=
o
+
o
2
u
IE(y
2
t1
)
This shows that
T
converges but is inconsistent for
o
. This is the case because
y
t
o
y
t1
are serially correlated and hence correlated with y
t1
. Clearly,
T
would be
consistent for
o
if
o
= 0, i.e., y
t
o
y
t1
are serially uncorrelated (in fact, independent)
and hence uncorrelated with y
t1
, as shown in Example 6.4.
This conclusion will not be altered if the lagged dependent variable is one of the
regressors:
y
t
= y
t1
+x
t
+e
t
.
Suppose that
y
t
=
o
y
t1
+x
o
+
t
,
where
t
are serially correlated. Then the OLS estimators for and are inconsistent
for
o
and
o
.
In the examples above, y
t
can be written as
y
t
= x
t
+
t
,
where
t
are disturbances. For Example 6.5,
=
o
,
t
= z
o
+ u
t
, and u
t
is such
that IE(u
t
[
t1
, J
t
) = 0; for Example 6.6,
=
o
,
t
= u
t
+
o
u
t1
, and u
t
is a
white noise sequence. By noting
T
=
+
_
1
T
T
t=1
x
t
x
t
_
1
_
1
T
T
t=1
x
t
(y
t
x
)
_
, (6.3)
we can see that

T
would converge to
provided that the second term on the right-

hand side vanishes in the limit. When x
t
are uncorrelated with the disturbances:
IE[x
t
(y
t
x
)] = 0 for all t, SLLN (WLLN) implies

1
T
T
t=1
x
t
(y
t
x
) 0 a.s. (in probability),

1
T
T
t=1
x
t
x
t
= O(1) a.s. (in probability).
It follows that the second term on the right-hand side of (6.3) does converge to zero
almost surely (in probability). Clearly, [B2] is a sucient condition for IE[x
t
(y
t

x
)] = 0 for
=
o
.
On the other hand, when IE[x
t
(y
t
x
)] = c ,= 0,
1
T
T
t=1
x
t
(y
t
x
) c a.s. (in probability),

so that

T
converges in probability to
+ M
1
xx
c, which diers from
by a xed
amount. That is, inconsistency would result if the regressor x
t
are correlated with the
disturbances
t
. Such correlations may be due to the correlation between the included
and excluded variables (Example 6.5) or the correlation between the lagged dependent
variables and serially correlated disturbances (Example 6.6).
While the eect of SLLN or WLLN (condition [B1]) is important in establishing
OLS consistency, it will be shown below that [B1] is not a necessary condition.
Example 6.7 Given the simple linear time trend specication:
y
t
= a +b t +e
t
,
suppose that [B2] holds: IE(y
t
[
t1
) = a
o
+ b
o
t. We have learned from Example 5.29
that t and t
2
do not obey a SLLN or a WLLN so that [B1] is violated. Nevertheless,
the OLS estimators of a and b remain consistent. In view of (6.3), the OLS estimator
of b is
b
T
=
T
t=1
_
t
T+1
2
_
y
t
T
t=1
_
t
T+1
2
_
2
= b
o
+
T
t=1
_
t
T+1
2
_
T
t=1
_
t
T+1
2
_
2
,
where
t
= y
t
a
o
b
o
t. We have seen in Example 5.30 that

T
t=1
t
is O
IP
(T
1/2
) and
T
t=1
t
t
is O
IP
(T
3/2
). While the numerator term is O
IP
(T
3/2
), the denominator grows
even faster:
T
t=1
_
t
T + 1
2
_
2
=
T
t=1
t
2
T(T + 1)
2
4
=
T(T + 1)(T 1)
12
= O(T
3
).
The entire second term thus vanishes in the limit, and

b
T
IP
b
o
. Similarly, we can
show
a
T
= y
b
T
T + 1
2
= a
o
+
(4T + 2)
T
t=1
t
6
T
t=1
t
t
T(T 1)
.
As the second term above is O
IP
(T
1/2
), a
T
IP
a
o
.
Example 6.8 Given a simple AR(1) specication:
y
t
= y
t1
+e
t
,
suppose that y
t
is a random walk:
y
t
= y
t1
+
t
,
where
t
are i.i.d. random variables. We have seen in Example 5.31 that y
t
and y
2
t
do not obey a SLLN (WLLN). By (6.3), the OLS estimator of can be written as

T
= 1 +
T
t=1
y
t1
T
t=1
y
2
t1
.
From Examples 5.31 and 5.32 we know that the numerator on the right-hand side above
is O
IP
(T), while the denominator is O
IP
(T
2
). Consequently,
T
IP
1.
When
t
is a weakly stationary ARMA process and exhibits serial correlations,
y
t1
is not the conditional mean of y
t
because IE(y
t1
t
) is non-zero. Nevertheless,
T
t=1
y
t1
T
t=1
y
2
t1
=
O
IP
(T)
O
IP
(T
2
)
= O
IP
(T
1
), (6.4)
so that
T
is still weakly consistent for 1.
Remark: Example 6.8 demonstrates that the OLS estimator may still be consistent
even when a lagged dependent variable and serially correlated disturbances are both
present. This is because

T
t=1
y
2
t1
in (6.4) grows much faster and hence is able to
eliminate all the correlations between y
t1
and
t
asymptotically. If

T
t=1
y
2
t1
and
T
t=1
y
t1
t
in (6.4) grow at the same rate, these correlations would not vanish in the
limit and therefore cause inconsistency, as shown in Example 6.6.
6.2.2 Asymptotic Normality
We say that

T
is asymptotically normally distributed (about
) if
T(
)
D
N(0, D
o
),
where D
o
is a positive-denite matrix. That is, the sequence of properly normalized
T
converges in distribution to a multivariate normal random vector. As D
o
is the
covariance matrix of the limiting normal distribution, it is also known as the asymptotic
covariance matrix of

T(
). Equivalently, we may also express asymptotic

normality by
D
1/2
o
T(
)
D
N(0, I
k
).
It should be emphasized that asymptotic normality here is referred to

T(
)
rather than

T
; the latter has only a degenerate distribution in the limit by strong
(weak) consistency.
When

T(
) has a limiting distribution, it is O

IP
(1) by Lemma 5.24. There-
fore,

is necessarily O
IP
(T
1/2
), so that

T
tend to
at the rate T
1/2
. Thus,
we know not only consistency but also the rate of convergence to
. An estimator that
is consistent at the rate T
1/2
is usually referred to as a
T-consistent estimator.
Some consistent estimators may converge more quickly. In Example 6.7, the estimator
b
T
of the slope coecient in the simple time trend specication converges to b
o
at the
rate T
3/2
, whereas the estimator of the intercept is

T-consistent. Also, the OLS
estimator for the AR(1) specication is T-consistent when y
t
is a random walk but
T-consistent when y
t
is a weakly stationary process; see Examples 6.8.
To ensure asymptotic normality, we impose an additional condition.
[B3] For some
, x
t
(y
t
x
) is a sequence of random vectors with mean zero and

obeys a CLT.
If we write
y
t
= x
+
t
,
[B3] requires that IE(x
t
t
) = 0, i.e., the regressors x
t
and disturbances
t
are uncorre-
lated. Moreover,
V
T
:= var
_
1
T
T
t=1
x
t
t
_
V
o
,
a positive-denite matrix, and
V
1/2
o
_
1
T
T
t=1
x
t
t
_
D
N(0, I
k
).
In view of (6.3), the normalized OLS estimator is
T(
) =
_
1
T
T
t=1
x
t
x
t
_
1
_
1
T
T
t=1
x
t
t
_
=
_
1
T
T
t=1
x
t
x
t
_
1
V
1/2
o
_
V
1/2
o
_
1
T
T
t=1
x
t
t
__
.
(6.5)
Given the SLLN (WLLN) condition [B1] and the CLT condition [B3], the rst term on
the right-hand side of (6.5) converges in probability to M
1
xx
, and the last term in the
square bracket converges in distribution to N(0, I
k
). It follows from Lemma 5.22 that
(6.5) converges in distribution to
M
1
xx
V
1/2
o
N(0, I
k
)
d
= N(0, M
1
xx
V
o
M
1
xx
),
where
d
= stands for equality in distribution. We have established the following asymp-
totic normality result.
Then,
T(
)
D
N(0, D
o
),
where D
o
= M
1
xx
V
o
M
1
xx
, or equivalently,
D
1/2
o
T(
)
D
N(0, I
k
),
where D
1/2
o
= V
1/2
o
M
xx
.
As long as we believe (or are able to verify) that the data have proper SLLN (WLLN)
and CLT eects, we can invoke Theorem 6.9 and assert asymptotic normality of the
(properly normalized) OLS estimator. In particular, this result may hold for weakly
dependent and heterogeneously distributed data; neither independence nor normality is
required. By contrast, the normality property in Theorem 3.7(a) is an exact distribution
result for the OLS estimator, but it is valid only when y
t
are are independent, normal
random variables.
When V
o
is unknown, let

V
T
denote a symmetric and positive denite matrix that
is consistent for V
o
. A weakly consistent estimator of D
o
is then
D
T
=
_
1
T
T
t=1
x
t
x
t
_
1
V
T
_
1
T
T
t=1
x
t
x
t
_
1
.
As

D
T
IP
D
o
, we must have

D
1/2
T
IP
D
1/2
o
. It follows from Theorem 6.9 and
Lemma 5.19 that
D
1/2
T
T(
)
D
D
1/2
o
N(0, D
o
)
d
= N(0, I
k
).
The shows that Theorem 6.9 remains valid when the asymptotic covariance matrix D
o
is replaced by a weakly consistent estimator

D
T
. Note that

D
T
does not have to be a
strongly consistent estimator here.
Then,
D
1/2
T
T(
)
D
N(0, I
k
),
where

D
T
= (
T
t=1
x
t
x
t
/T)
1

V
T
(
T
t=1
x
t
x
t
/T)
1
and

V
T
IP
V
o
.
Remark: Theorems 6.9 and 6.10 show that we may obtain asymptotic normality when
the condition [B2] for correct specication is absent. Without [B2], there is no guarantee
that

T
would converge to
o
, but it still converges to some limit
under [B1]. Then

the CLT eect of [B3] suces for asymptotic normality. When the asymptotic covariance
matrix V
o
is unknown, it is of paramount importance to nd a consistent estimator of
D
o
. Normalizing the OLS estimator with an inconsistent estimator of D
o
will, in
general, destroy asymptotic normality.
Example 6.11 Given the linear specication
y
t
= x
t
+e, t = 1, . . . , T,
suppose that the classical conditions [A1] and [A2] hold. If

T
t=1
x
t
x
t
/T converges to
some limit M
xx
, then
V
o
= lim
T
1
T
T
t=1
IE(
2
t
x
t
x
t
) =
2
o
M
xx
.
By invoking a suitable CLT, it can be shown that the classical conditions are sucient
for [B3]. It follows from Theorem 6.9 that
T(
)
D
N(0, D
o
),
with a much simpler asymptotic covariance matrix:
D
o
= M
1
xx
V
o
M
1
xx
=
2
o
M
1
xx
.
Comparing to Theorem 3.7(a), y
t
here need not be normally distributed, and the asymp-
totic covariance matrix is determined by the limit of

T
t=1
(x
t
x
t
)/T, which in matrix
notations can be written as X
X/T. A natural estimator of D

o
is
D
T
=
2
T
(X
X/T)
1
,
where
2
T
is the OLS variance estimator. Theorem 6.10 then ensures
1

T
(X
X/T)
1/2
T(
) =
1

T
(X
X)
1/2
(
)
D
N(0, I
k
),
Taking the inner product of the left-hand side above we immediately conclude that
(
(X
X)(
T

o
)

2
T
D
2
(k).
by Lemma 5.20. Note that the left-hand side is k times the F statistic (with R = I
k
)
in Section 3.4.1.
The example below shows that even without the eects of SLLN (WLLN) and CLT,
properly normalized OLS estimators may still have an asymptotic normal distribution.
Example 6.12 The simple linear time trend specication,
y
t
= a +b t +e
t
,
is a special case of the regression with non-stochastic regressors x
t
= [1 t]
. Let a
T
and
b
T
denote the OLS estimators of a and b, respectively. We know that x
t
x
t
does not
obey a SLLN (WLLN) and that t
t
does not obey a CLT. It is, however, easy to see
that for x
t
= [1 t/T]
,
1
T
T
t=1
x
t
x
t

_
1 1/2
1/2 1/3
_
=: M,
so that x
t
x
t
obeys a SLLN. Example 5.37 also shows that (t/T)
t
obeys a CLT.
These suggest that we may consider an alternative specication:
y
t
= a +b
t
T
+e
t
.
The resulting OLS estimators are such that a
T
= a
T
and

b
T
= T
b
T
.
Suppose that
y
t
= a
o
+b
o
t +
t
,
where
t
are uncorrelated random variables with IE(
t
) = 0 and var(
t
) =
2
o
. In view
of the preceding example we can then conclude that
_
_
T
1/2
( a
T
a
o
)
T
3/2
(
b
T
b
o
)
_
_
=

T
_
_
a
T
a
o
b
T
T b
o
_
_
D
N(0, D
o
),
with D
o
=
2
o
M
1
, where
M
1
=
_
4 6
6 12
_
.
Moreover,
D
1/2
o
_
_
T
1/2
( a
T
a
o
)
T
3/2
(
b
T
b
o
)
_
_
D
N(0, I
2
),
where D
1/2
o
=
1
o
M
1/2
with
M
1/2
=
_
1 1/2
0 1/(2
3)
_
.
Replacing
o
by the OLS estimator
T
yields the same limiting result.
6.3 Consistent Estimation of Covariance Matrix
We have seen in the preceding section that a consistent estimator of D
o
= M
1
xx
V
o
M
1
xx
is crucial for the asymptotic normality result. The matrix M
xx
can be consistently
estimated by its sample counterpart

T
t=1
x
t
x
t
/T; it then remains to nd a consistent
estimator of
V
o
= lim
T
var
_
1
T
T
t=1
x
t
t
_
,
where
t
= y
t
x
. This section is concerned with consistent estimation of V

o
and
D
o
.
In its most general form, V
o
can be expressed as the sum of variances and autoco-
variances:
V
o
= lim
T
var
_
1
T
T
t=1
x
t
t
_
= lim
T
1
T
T
t=1
IE(
2
t
x
t
x
t
) +
lim
T
1
T
T1
=1
T
t=+1
_
IE(x
t
t
x
t
) + IE(x
t
t
x
t
)
_
.
(6.6)
Let x
ij
denote the j
th
element of x
i
. It can be seen that for t ,= s,
IE(x
t1
s
x
s2
) ,= IE(x
s1
t
x
t2
),
in general. That is, the covariance matrix IE(x
t
t
x
t
) need not be symmetric.
This matrix would be symmetric when, for example, x
t
t
is a multivariate, weakly
6.3. CONSISTENT ESTIMATION OF COVARIANCE MATRIX 157
stationary process such that the autocovariances of its elements, IE(x
ti
s
x
sj
), do not
depend on t but only on the time dierence t s. When x
t
t
is indeed weakly
stationary, V
o
simplies to
V
o
= lim
T
1
T
T
t=1
IE(
2
t
x
t
x
t
) + lim
T
2
T
T1
=1
(T ) IE(x
t
t
x
t
). (6.7)
Clearly, if x
t
t
are serially uncorrelated, the second terms on the right-hand side of (6.6)
and (6.7) vanish; the remaining part of V
o
is relatively easy to estimate. When there
are serial correlations, estimating V
o
would be more cumbersome because it involves
an innite sum of autocovariances.
6.3.1 When Serial Correlations Are Absent
First observe that [B2] is equivalent to the condition that
IE(
t
[
t1
, J
t
) = 0,
where
t
= y
t
x
t
o
. The sequence
t
with the property above is known as the
martingale dierence sequence with respect to the sequence of -algebras generated by
(
t1
, J
t
).
It is easy to see that if
t
is a martingale dierence sequence with respect to
t1
, J
t
, its unconditional mean and autocovariances are also zero, yet it may not
be a white noise; see Exercise 6.7. Note also that a white noise need not be a martingale
dierence sequence. For the same reasons, we can verify that
IE(x
t
t
) = IE[x
t
IE(
t
[
t1
, J
t
)] = 0.
and for any t ,= ,
IE(x
t
) = IE[x
t
IE(
t
[
t1
, J
t
)
] = 0.
That is, x
t
t
is a sequence of uncorrelated, zero-mean random vectors under [B2]. In
this case, the covariance matrices (6.6) and (6.7) are
V
o
= lim
T
1
T
T
t=1
IE(
2
t
x
t
x
t
). (6.8)
Note that the simpler form of V
o
is a consequence of [B2], the correct specication of
the conditional mean function. [B2] is not a necessary condition for (6.8), however.
If, in addition to [B2],
t
are also conditionally homoskedastic:
IE(
2
t
[
t1
, J
t
) =
2
o
,
then (6.8) can be further simplied to
V
o
= lim
T
1
T
T
t=1
IE[IE(
2
t
[
t1
, J
t
)x
t
x
t
]
=
2
o
_
lim
T
1
T
T
t=1
IE(x
t
x
t
)
_
=
2
o
M
xx
.
(6.9)
As M
xx
can be consistently estimated by its sample counterpart, it remains to estimate
2
o
. It turns out that the standard OLS estimator
2
T
=

T
t=1
e
2
t
/(T k) is consistent
for
2
o
, where e
t
are the OLS residuals; see Exercise 6.8. It follows that a consistent
estimator of V
o
is
V
T
=
2
T
_
1
T
T
t=1
x
t
x
t
_
.
In this case, the asymptotic covariance matrix of

T(
T

o
) is also of a simpler form:
D
o
= M
1
xx
V
o
M
1
xx
=
2
o
M
1
xx
,
which can be consistently estimated by
D
T
=
2
T
_
1
T
T
t=1
x
t
x
t
_
1
. (6.10)
This is the same as the estimator obtained in Example 6.11. Note again that, apart
from the factor T,

D
T
here is the estimated covariance matrix of

T
in the classical
least squares theory.
More generally, when
t
are conditionally heteroskedastic, i.e., IE(
2
t
[
t1
, J
t
) are
random variables depending on t, then (6.8) cannot be simplied as before. To estimate
(6.8), it can be seen that
1
T
T
t=1
[ e
2
t
x
t
x
t
IE(
2
t
x
t
x
t
)]
=
1
T
T
t=1
_
2
t
x
t
x
t
IE(
2
t
x
t
x
t
)
_
2
T
T
t=1
_
t
x
t
(
T

o
)x
t
x
t
_
+
1
T
T
t=1
_
(
T

o
)
x
t
x
t
(
T

o
)x
t
x
t
_
.
6.3. CONSISTENT ESTIMATION OF COVARIANCE MATRIX 159
The rst term on the right-hand side would converge to zero in probability if
2
t
x
t
x
obeys a WLLN. By noting that under [B2],

IE(
t
x
t
x
t
x
t
) = IE[IE(
t
[
t1
, J
t
)x
t
x
t
x
t
] = 0,
a suitable WLLN will ensure
1
T
T
t=1
t
x
t
x
t
x
t
IP
0.
This, together with the fact that

T

o
is O
IP
(T
1/2
), shows that the second term
also converges to zero in probability. Similarly, the third term also vanishes in the limit
by a suitable WLLN. These results together indicate that, as long as data have proper
WLLN eects,
1
T
T
t=1
[ e
2
t
x
t
x
t
IE(
2
t
x
t
x
t
)]
IP
0.
A consistent estimator of V
o
is therefore
V
T
=
1
T
T
t=1
e
2
t
x
t
x
t
. (6.11)
Thus, V
o
can be consistently estimated without modeling the conditional variance
IE(
2
t
[
t1
, J
t
). An estimator of this form is known as a heteroskedasticity-consistent
covariance matrix estimator which is consistent when conditional heteroskedasticity is
present and of unknown form. Consequently, a consistent estimator of D
o
is
D
T
=
_
1
T
T
t=1
x
t
x
t
_
1
_
1
T
T
t=1
e
2
t
x
t
x
t
__
1
T
T
t=1
x
t
x
t
_
1
. (6.12)
This estimator was proposed by Eicker (1967) and White (1980) and known as the
Eicker-White covariance matrix estimator. While the estimator (6.10) is inconsistent
under conditionally heteroskedasticity, the Eicker-White estimator is robust in the
sense that it remains consistent under conditional homoskedasticity and heteroskedas-
ticity. Yet the Eicker-White estimator is less ecient than (6.10) when
t
are in fact
conditionally homoskedastic. That is, we obtain a more robust estimator at the expense
of (possible) eciency loss.
6.3.2 When Serial Correlations Are Present
When x
t
t
exhibit serial correlations, it is still possible to estimate (6.6) and (6.7)
consistently. Let m(T) denote a function of T which diverges to innity with T but
at a slower rate. Suppose that the autocorrelations between x
t
t
and x
t
t
decay
suciently fast such that
1
T
T1
=m(T)+1
T
t=+1
IE(x
t
t
x
t
) 0.
That is, x
t
t
and x
t
t
are asymptotically uncorrelated in a proper way. Then for
large T, V
o
can be well approximated by
V

T
=
1
T
T
t=1
var(x
t
t
) +
1
T
m(T)
=1
T
t=+1
IE(x
t
t
x
t
) + IE(x
t
t
x
t
).
Estimating V
o
now amounts to estimating V

T
.
White (1984) notes that a consistent estimator of V

T
is its sample counterpart:
V
T
=
1
T
T
t=1
e
2
t
x
t
x
t
+
1
T
m(T)
=1
T
t=+1
_
x
t
e
t
e
t
x
t
+x
t
e
t
e
t
x
t
_
,
A major problem with this estimator is that

V
T
need not be positive semi-denite and
hence cannot be a well-dened variance-covariance matrix. Newey and West (1987) show
that with a suitable weighting function w
m(T)
(), the estimator below is guaranteed to
be positive semi-denite while remaining consistent for V

T
:
V
T
=
1
T
T
t=1
e
2
t
x
t
x
t
+
1
T
T1
=1
w
m(T)
()
T
t=+1
_
x
t
e
t
e
t
x
t
+x
t
e
t
e
t
x
t
_
(6.13)
The estimator (6.13) is known as a heteroskedasticity and autocorrelation-consistent
(HAC) covariance matrix estimator and is valid when both conditional heteroskedastic-
ity and serial correlations are present but of an unknown form. The resulting consistent
estimator of D
o
is
D
T
=
_
1
T
T
t=1
x
t
x
t
_
1
V
T
_
1
T
T
t=1
x
t
x
t
_
1
, (6.14)
where

V
T
is given by (6.13); cf. the Eicker-White estimator (6.12). An estimator of
this type is usually referred to as the Newey-West covariance matrix estimator.
6.4. LARGE-SAMPLE TESTS 161
In particular, Newey and West (1987) suggested the so-called Bartlett kernel for the
weighting function of

V
T
:
w
m(T)
() =
_
_
_
1

m(T)
, if 0

m(T)
1,
0, otherwise.
Gallant (1987) chose the Parzen kernel as the weighting function:
w
m(T)
() =
_
_
1 6
_

m(T)
_
2
+ 6
_

m(T)
_
3
, if 0

m(T)
1/2,
2
_
1

m(T)
_
3
, if 1/2

m(T)
1,
0, otherwise.
Consider the Bartlett kernel where w
m(T)
() = 1 /m(T). For a xed m(T), it is
decreasing in ; hence a smaller weight is assigned when two random variables are
separated for a long time period (i.e., is large). On the other hand, for a xed ,
w
m(T)
() 1 as m(T) and hence entails little loss asymptotically. In practice,
a nite number of m(T) must be chosen to compute

V
T
. It is worth noting that a
small m(T) may result in substantial nite-sample bias. For other choices of weight-
ing functions and a method of determining the approximation lags m(T), we refer to
Andrews (1991).
Comparing to the Eicker-White estimator, the Newey-West estimator is robust to
both conditional heteroskedasticity of
t
and and serial correlations of x
t
t
, yet the
latter is less ecient than the former when x
t
t
are not serially correlated. Finally, we
note that both the Eicker-White estimator (6.12) and the Newey-West estimator (6.14)
are non-parametric in nature because they do not rely on any parametric model of
conditional heteroskedasticity and serial correlations.
6.4 Large-Sample Tests
After learning the asymptotic properties of the OLS estimator under more general con-
ditions, it is now important to construct suitable tests and derive their limiting distri-
butions. In this section, we will study two large-sample tests for the linear hypothesis
H
0
: R
= r,
where R is a q k (q < k) nonstochastic matrix with rank q, and r is a pre-specied
real vector, as in Section 3.4.
6.4.1 Wald Test
Given that the OLS estimator

T
is consistent for some parameter vector
, one
would expect that R
T
is close to R
when T becomes large. As R
= r under
the null hypothesis, whether R
T
is suciently close to r constitutes an evidence for
or against the null hypothesis. The Wald test is based on the dierence between R
T
and r.
When [B1] and [B3] hold, we have from Theorem 6.9 that
TR(
)
D
N(0, RD
o
R
),
or equivalently,
1/2
o
TR(
)
D
N(0, I
q
),
where
o
= RD
o
R
= RM
1
xx
V
o
M
1
xx
R
. By Theorem 6.10, asymptotic normality

would not be aected if
o
is replaced by its consistent estimator, e.g.,
T
= R
D
T
R
= R
_
1
T
T
t=1
x
t
x
t
_
1
V
T
_
1
T
T
t=1
x
t
x
t
_
1
R
,
where

V
T
is a consistent estimator of V
o
. That is,
1/2
T
TR(
)
D
N(0, I
q
). (6.15)
Under the null hypothesis, R
= r, the Wald test statistic is the inner product of

(6.15):
J
T
= T(R
T
r)
1
T
(R
T
r). (6.16)
The result below follows directly from the continuous mapping theorem (Lemma 5.20).
Then, under the null hypothesis,
J
T
D

2
(q).
where J
T
is given by (6.16) and q is the number of hypotheses.
The Wald test has much wider applicability because it is valid for a wide variety
of data which may be non-Gaussian, heteroskedastic, and serially correlated. What
matter here are the asymptotic normality result of the OLS estimator and a consistent
estimator of V
o
. If an inconsistent estimator of V
o
is used in the test statistic, both

D
T
and

T
become inconsistent, and, consequently, the Wald statistic J
T
will not have a
limiting
2
distribution.
Example 6.14 Test of a subset of coecients being zero: Given the linear specication
y
t
= x
1,t
b
1
+x
2,t
b
2
+e
t
,
where x
1,t
is (k s) 1 and x
2,t
is s 1, suppose that the specication is correct for
the conditional mean with
o
= [b
1,o
b
2,o
]
. If we want to verify whether

IE(y
t
[
t1
, J
t
) = x
1,t
b
1,o
,
then the hypothesis is R
o
= 0 with R = [0
s(ks)
I
s
]. The Wald test statistic is
J
T
= T

T
R
_
R(X
X/T)
1

V
T
(X
X/T)
1
R
_
1
R
T
D
2
(s).
Note that this is a general expression of the Wald statistic; its exact form depends on
V
T
.
When V
o
can be consistently estimated by

V
T
=
T
t=1
e
2
t
x
t
x
t
/T,
J
T
=

T
R
_
R(X
X)
1
_
T
t=1
e
2
t
x
t
x
t
_
(X
X)
1
R
_
1
R
T
.
If the null hypothesis is that the i
th
coecient is zero, then R is the i
th
Cartesian unit
vector c
i
so that
i,T
/
_
d
ii
D
N(0, 1), (6.17)
where

d
ii
is the i
th
diagonal element of

D
T
/T:
(X
X)
1
_
T
t=1
e
2
t
x
t
x
t
_
(X
X)
1
.
Note that

d
ii
is usually referred to as Whites estimate of the standard error of the i
th
coecient. In view of (6.17), we can test the signicance of the i
th
coecient using the
t statistic with the OLS standard error replaced by the Eicker-White estimate of the
standard error. We can also base other t tests on the Eicker-White standard errors.
When a consistent estimator of V
o
is

V
T
=
2
T
(X
X/T), the Wald statistic becomes

J
T
=

T
R
_
R(X
X)
1
R
_
1
R
T
/
2
T
,
which is nothing but s times the standard F-statistic. Therefore, for testing the i
th
coecient being zero, the standard t ratio will still do. The dierence is that, instead of
having an exact null distribution, the t ratio now has the asymptotic null distribution
N(0, 1).
Remark: The F-test-based version of the Wald test is appropriate only when

V
T
=

2
T
(X
X/T) is consistent for V

o
. We know that if
t
are conditionally heteroskedastic
and/or x
t
t
are serially correlated, this estimator is not consistent for V
o
. Consequently,
the F-test-based version does not have a limiting
2
distribution.
6.4.2 Lagrange Multiplier Test
From Section 3.4.3 we have seen that, given the constraint R = r, the constrained
OLS estimator can be obtained by nding the saddle point of the Lagrangian:
1
T
(y X)
(y X) + (R r)
,
where is the q 1 vector of Lagrange multipliers. The underlying idea of the Lagrange
Multiplier (LM) test of this constraint is to check whether is suciently close to
zero. Intuitively, can be interpreted as the shadow price of this constraint and
hence should be small when the constraint is valid (i.e., the null hypothesis is true);
otherwise, ought to be large. Again, the closeness between and zero must be
determined by the distribution of the estimator of .
It is easy to nd the solutions to the Lagrangian above:
T
= 2[R(X
X/T)
1
R
]
1
(R
T
r),
T
=

T
(X
X/T)
1
R
T
/2.
Here,

T
is the constrained OLS estimator of , and

T
is the basic ingredient of the
LM test. Let
t
= y
t
x
, where
satises the constraint R
= r under the null

hypothesis and
V
o
= lim
T
var
_
1
T
T
t=1
x
t
t
_
.
Given [B1] and [B3],
T
= 2
_
R(X
X/T)
1
R
_
1
T(R
T
r)
D
2(RM
1
xx
R
)
1
N(0, RD
o
R
).
where D
o
= M
1
xx
V
o
M
1
xx
and the limiting distribution of the right-hand side is
N(0,
o
) with
o
= 4(RM
1
xx
R
)
1
(RD
o
R
)(RM
1
xx
R
)
1
.
We immediately have
1/2
o
T
D
N(0, I
q
).
Similar as before, this result remains valid when
o
is replaced by a consistent estimator;
for example,
T
= 4
_
R(X
X/T)
1
R
_
1
_
R(X
X/T)
1

V
T
(X
X/T)
1
R
_
_
R(X
X/T)
1
R
_
1
,
where

V
T
is a consistent estimator of V
o
based on the constrained estimation result.
Typically,

V
T
involves x
t
and constrained OLS residuals e
t
= y
t
x
T
. Thus,
1/2
T
T
D
N(0, I
q
). (6.18)
and the LM statistic is the inner product of (6.18):
L/
T
= T
1
T

T
, (6.19)
The result below is again a consequence of the continuous mapping theorem.
Then, under the null hypothesis,
L/
T
D

2
(q),
where L/
T
is given by (6.19).
Let e = y X
T
denote the vector of constrained OLS residuals. By noting that
the constrained OLS estimator must satisfy the constraint (i.e., R
T
= r), we can write
R
T
r = R(X
X/T)
1
X
(y X
T
)/T
= R(X
X/T)
1
X
e/T.
An alternative expression of

T
is then
T
= 2[R(X
X/T)
1
R
]
1
R(X
X/T)
1
X
e/T.
It follows that the LM test statistic is algebraically equivalent to
L/
T
= T e
X(X
X)
1
R
_
R(X
X/T)
1

V
T
(X
X/T)
1
R
_
1
R(X
X)
1
X
e.
(6.20)
This expression shows that only constrained estimation is needed to compute the LM
statistic.
A fundamental dierence between the LM and Wald tests is that the former involves
only constrained estimation, whereas the latter requires unconstrained estimation. As
such, the Wald test would be more convenient if constrained specications are dicult
to estimate, such as a linear specication with a nonlinear constraint. The LM test,
on the other hand, would be simpler if constrained estimation is easier to compute.
Similar to the Wald test, the LM test is also valid for a wide variety of data which
may be non-Gaussian, heteroskedastic, and serially correlated. Again, the asymptotic
normality result of the OLS estimator and consistent estimation of V
o
play a crucial
role in Theorem 6.15. If an inconsistent estimator of V
o
is used to construct

T
, the
resulting LM test will not have a limiting
2
distribution.
Example 6.16 Test of a subset of coecients being zero: Given the following speci-
cation:
y
t
= x
1,t
b
1
+e
t
,
suppose that we want to know whether
IE(y
t
[
t1
, J
t
) = x
1,t
b
1,o
+x
2,t
b
2,o
,
where x
1,t
is (k s) 1 and x
2,t
is s 1. The specication would be correct for the
conditional mean if b
2,o
= 0. Letting
o
= [b
1,o
b
2,o
]
, the null hypothesis is R

o
= 0
with R = [0
s(ks)
I
s
]. The specication above is then a constrained version of
y
t
= x
1,t
b
1
+x
2,t
b
2
+e
t
.
For this specication, the constrained OLS estimator is

T
= (
1,T
0
, where
b
1,T
=
_
T
t=1
x
1,t
x
1,t
_
1
T
t=1
x
1,t
y
t
= (X
1
X
1
)
X
1
y,
which is obtained from the constrained specication. The LM statistic now can be
computed as (6.20) with X = [X
1
X
2
] and the vector of constrained OLS residuals
e = y X
1
b
1,T
.
When

V
T
=
2
T
(X
X/T) is consistent for V

o
under the null hypothesis, where

2
T
=
T
t=1
e
2
t
/(T k +s), we have
(R
D
T
R
)
1
=
1

2
T
_
R(X
X/T)
1
R
_
1
.
It can be veried that, by the Frisch-Waugh-Lovell Theorem,
R(X
X)
1
R
= [X
2
(I P
1
)X
2
]
1
,
R(X
X)
1
X
= [X
2
(I P
1
)X
2
]
1
X
2
(I P
1
),
where P
1
= X
1
(X
1
X
1
)
1
X
1
. The LM statistic now simplies to
L/
T
= e
(I P
1
)X
2
[X
2
(I P
1
)X
2
]
1
X
2
(I P
1
) e/
2
T
= e
X
2
[X
2
(I P
1
)X
2
]
1
X
2
e/
2
T
,
because X
1
e = 0 so that P
1
e = 0. As
2
T
= e
e/(T k +s),
L/
T
=
e
X(X
X)
1
X
e
e
e/(T k +s)
= (T k +s)R
2
,
where R
2
is the (non-centered) coecient of determination of regressing e on X. If
the estimator
2
T
=

T
t=1
e
2
t
/T is used, we simply have TR
2
as the test statistic. It
must be emphasized that the simpler version of the LM statistic is valid only when

2
T
(X
X/T) is a consistent estimator of V

o
; otherwise, TR
2
does not have a limiting
2
distribution. If the LM statistic is based on the heteroskedasticity-consistent covariance
matrix estimator:
V
T
=
1
T
T
t=1
e
2
t
x
t
x
t
,
it cannot be simplied to TR
2
.
As the LM test only requires constrained estimation, it is based on the simpler,
constrained specication and checks whether additional s regressors should be included
as well. Comparing to Example 6.14, the Wald test checks whether the unconstrained
specication should exclude s redundant regressors. Thus, the LM test permits testing
up (from a simplerr specication), while the Wald test can be employed to test down
(from a more complex specication).
Remark: It can also be shown that the Wald and LM statistics are asymptotically
equivalent under the null hypothesis, i.e.,
J
T
L/
T
IP
0;
see Exercise 6.9. If V
o
is known, these two statistics turn out to be algebraically
equivalent. Note, however, that these two tests may result in conicting statistical
inferences in nite samples. For instance, it can be shown that when there are no
heteroskedasticity and serial correlations, J
T
L/
T
in numerical values; see e.g.,
Godfrey (1988) for more details.
6.5 Application: Autoregressive Models
To analyze time series data, it is quite common to postulate an autoregressive (AR)
specication, in the sense that the regressors are nothing but the lagged dependent
variables. In particular, an AR(p) specication is such that
y
t
=
0
+
1
y
t1
+ +
p
y
tp
+e
t
, t = p + 1, . . . , T.
The specication in Examples 6.6 and 6.8 is AR(1) without the constant term. The
OLS estimators of
0
, . . . ,
p
are obtained by regressing y
t
on
t1
= [y
t1
. . . y
tp
]
for t = p + 1, . . . , T. The OLS variance estimator is

2
T
=
1
T 2p 1
T
t=p+1
e
2
t
,
where e
2
t
are the OLS residuals. It is also common to compute the variance estimator
as the sum of squared residuals divided by T or T p. The properties of the OLS
estimators depend crucially on whether y
t
are weakly stationary.
6.5.1 Properties of the OLS estimators
Recall that y
t
is weakly stationary if its mean, variance and autocovariances are all
independent of t. When y
t
is weakly stationary with nite fourth moment, both y
t
,
y
t
t1
and
t1
t1
obey a WLLN.
Let
o
= IE(y
t
) and
j
= cov(y
t
, y
tj
) for j = 0, 1, 2, . . . . Clearly,
0
is the
variance of y
t
and
j
=
j
. Then,
= var(
t1
) =
_
0

1

2
. . .
p1
1

0

1
. . .
p2
2

1

0
. . .
p3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
p1

p2

p3
. . .
0
_
_
.
The WLLN eect ensures:
1
T
T
t=p+1
y
t
IP
o
,
1
T
T
t=p+1
y
t
t1
IP
[
1
+
2
o
. . .
p
+
2
o
]
,
1
T
T
t=p+1
t1
t1
IP
+
2
o
,
6.5. APPLICATION: AUTOREGRESSIVE MODELS 169
where is the p 1 vector of ones. It follows that
T
IP
_
1
o
o
+
2
o
_
1
_
1
+
2
o
.
.
.
p
+
2
o
_
_
.
In particular, if the specication does not contain the constant term and
t1
contains
only y
t1
, the OLS estimator converges in probability to
(
1
+
2
o
)/(
0
+
2
o
).
When
o
= 0, this probability limit simplies to
1
/
0
, which is precisely what we
obtained in Example 6.2.
If y
t
is generated as an AR(p) process:
y
t
= c
o
+
1
y
t1
+
2
y
t2
+ +
p
y
tp
+
t
,
the true parameters c,
1
, . . . ,
p
must satisfy certain constraints so as to ensure weak
stationarity (see below). In addition, if
IE(y
t
[
t1
) = c
o
+
1
y
t1
+
2
y
t2
+ +
p
y
tp
,
Theorem 6.3 ensures that the OLS estimators will converge in probability to the true pa-
rameters. Note, however, that
t
may be a white noise but not a martingale dierence
sequence.
Whether the AR(p) specication is a correct specication for the conditional mean
function, the resulting OLS estimators, with suitable normalization, are asymptotically
normally distributed.
6.5.2 Dierence Equation
Suppose that y
t
are generated according to the following rst-order dierence equation:
y
t
=
1
y
t1
+u
t
, t = 0, 1, 2, . . . .
It is easily veried that, by recursive substitution,
y
t+j
=
j+1
1
y
t1
+
j
1
u
t
+
j1
1
u
t+1
+ +
1
u
t+j1
+u
t+j
.
Dene the dynamic multiplier of u
t
on y
t+j
as
y
t+j
/u
t
=
j
1
,
which is the eect of a given change of u
t
on y
t+j
. When [
1
[ < 1, the dynamic multiplier
approaches zero when j tends to innity so that the eect of u
t
eventually dies out. As
y
t
does not depend much on what what happens in the distant past, the dierence
equation is said to be stable. When [
1
[ > 1, the dierence equation is explosive in the
sense that the eect of u
t
on future ys grows exponentially fast. If
1
= 1, u
t
has a
constant eect on future ys.
Consider now a p
th
-order dierence equation:
y
t
=
1
y
t1
+
2
y
t2
+ +
p
y
tp
+u
t
,
which can be expressed as a rst-order vector dierence equation:
t
= F
t1
+
t
,
with
t
=
_
_
y
t
y
t1
y
t2
.
.
.
y
tp+1
_
_
, F =
_
1

2

p1

p
1 0 0 0
0 1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1 0
_
_
,
t
=
_
_
u
t
0
0
.
.
.
0
_
_
.
Recursive substitution yields
t+j
= F
j+1
t1
+F
j
t
+F
j1
t+1
+ +F
t+j1
+
t+j
.
The dynamic multiplier of
t
on
t+j
is
t+j
= F
j
,
and its (m, n)
th
element is denoted as f
j
(m, n). It is straightforward to verify that
y
t+j
= f
j+1
(1, 1)y
t1
+ +f
j+1
(1, p)y
tp
+
j
i=0
f
i
(1, 1)u
t+ji
.
The dynamic multiplier of u
t
on y
t+j
is thus
y
t+j
/u
t
= f
j
(1, 1),
the (1, 1)
th
element of F
j
.
Recall that the eigenvalues of F solve the equation: det(F I) = 0, which is
known as the em characteristic equation of F. This equation is of the following form:
p1

p1

p
= 0.
6.5. APPLICATION: AUTOREGRESSIVE MODELS 171
When all the eigenvalues of F are distinct, then F can be diagonalized by a nonsingular
matrix C such that C
1
FC = , where is the diagonal matrix with all the eigenval-
ues of F on its main diagonal. Writing F = CC
1
, we have F
j
= C
j
C
1
. When
all the eigenvalues of F are less than one in modulus (i.e., for any complex eigenvalue
= a + bi, i = (1)
1/2
, [
[ = (a
2
+ b
2
)
1/2
< 1),
j
tends to the zero matrix as j
goes to innity, and so does F
j
. In this case, f
j
11
will be approaching zero, so that the
dierence equation is stable. Thus, a p
th
-order dierence equation is stable provided
that the eigenvalues of F (the roots of the characteristic equation) are all less than
one in modulus. This is equivalent to saying that these roots must lie inside the unit
circle on the complex plane. This condition requires that the coecients
i
must satisfy
certain constraints.
If there is a root of the characteristic equation equals one in modulus (i.e., on the
unit circle), such a root is usually referred to as a unit root. When the characteristic
equation has a unit root with all remaining roots less than one in modulus,
lim
j
F
j
= C
_
_
1 0 0
0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0
_
_
C
1
,
so that its (1, 1)
th
element f
j
(1, 1) is a constant. If there is at least one eigenvalue
greater than one in modulus, this eigenvalue eventually dominates, and F
j
will explode.
The dierence equation is then explosive.
Alternatively, setting = 1/z and multiplying the characteristic equation by z
p
we
obtain:
1
1
z
p1
z
p1
p
z
p
= 0.
The condition that all the roots of the characteristic equation are inside the unit circle
is thus equivalent to requiring all the roots of the polynomial above being outside the
unit circle.
6.5.3 Weak Stationarity
Let B denote the back-shift operator in the sense that By
t
= y
t1
. We will also write
B
2
y
t
= B(By
t
) = y
t2
, B
3
y
t
= B(B
2
y
t
) = y
t3
, and so on. This operator has no eect
on constants, i.e., for any constant c, Bc = c. Hence, the back-shift operator has the
linear property:
B(cy
t
+dz
t
) = c(By
t
) +d(Bz
t
) = cy
t1
+dz
t1
,
for any constants c and d.
Consider now the AR(p) process
y
t
= c +
1
y
t1
+ +
p
y
tp
+
t
,
which can be written compactly as
(B)y
t
= c +
t
,
where (B) = 1 +
1
B+ +
p
B
p
is a polynomial in B. When (B) = 1
1
B, it
is an AR(1) process. As discussed above, this system would be stable if all the roots of
(z) = 0 are outside the unit circle.
Consider again an AR(1) process:
(1
1
B)y
t
= c +
t
.
Note that for large t,
(1 +
1
B +
2
1
B
2
+ +
t
1
B
t
)(1
1
B)y
t
= (1
t+1
1
B
t+1
)y
t
y
t
,
provided that B
t+1
y
t
= y
1
is nite and [
1
[ < 1. This suggests that, when [
1
[ < 1,
the inverse of (1
1
B) can be dened as
(1
1
B)
1
= lim
t
(1 +
1
B+
2
1
B
2
+ +
t
1
B
t
),
so that (1
1
B)
1
(1
1
B) = I, the identity operator. It follows that
y
t
= (1
1
B)
1
(c +
t
) =
c
1
1
+ (1
1
B)
1
t
.
When
t
2
, we nd IE(y
t
) = c/(1
1
) and
0
= (1 +
2
1
+
4
1
+ )
2
=
2
/(1
2
1
),
1
= (
1
+
3
1
+
5
1
+ )
2
=
1
[
2
/(1
2
1
)],
2
= (
2
1
+
4
1
+
6
1
+ )
2
=
2
1
[
2
/(1
2
1
)],
.
.
.
j
=
j
1
1
2
1
.
6.6. LIMITATIONS OF THE LINEAR SPECIFICATION 173
Thus, y
t
have a constant mean, constant variance, and autocovariances depending on j
but not on t. This shows that y
t
is a weakly stationary process. On the other hand,
y
t
cannot be weakly stationary when the dierence equation is not stable. Note also
that the autocovariances can be expressed as
j
=
j1
1

j1
=
j
1
0
, j = 0, 1, 2, . . . ,
and the autocorrelations are
j
=
j
1
=
1
j1
. That is, both the autocovariances and
autocorrelations have the same AR(1) structure. If we view the autocorrelations of a
process as its memory, a weakly stationary AR(1) process has exponentially decaying
memory and is also said to be of short memory.
The previous results are readily generalized. For the AR(p) processes (B)y
t
=
c +
t
, where (B) is a p
th
-order polynomial in B. When all the roots of (z) are
outside the unit circle, y
t
are weakly stationary with IE(y
t
) = c/(1
1
2

p
),
autocovariances: acut
0
=
1
1
+
2
2
+ +
p
p
+
2
j
=
1
j1
+
2
j2
+ +
p
jp
, j = 1, 2, . . . ,
and autocorrelations:
j
=
1
j1
+
2
j2
+ +
p
jp
, j = 1, 2, . . . .
The equation for autocorrelations is also known as the Yule-Walker equation which has
the same AR(p) structure. As the initial value
0
= 1, it is then clear that
j
0
exponentially fast as j tends to innity. Hence, a weakly stationary AR(p) process is
also a short memory process.
6.6 Limitations of the Linear Specication
In this chapter the classical conditions are relaxed so as to allow for more general data
in linear regressions. Careful readers must have noticed that, aside from the conditions
on the stochastic properties of data, there is always a condition of correct specication
([A2](i) in Chapter 3 and [B2] in this chapter). Such a condition may be too strong in
practice, as discussed in Section 3.7. In the context of this chapter, we also notice that,
while IE(y
t
[
t1
, J
t
) must be a function of the elements of
t1
and J
t
, [B2] requires
this function being linear. A sucient condition for linear conditional mean function
is that all the elements of
t1
and J
t
are jointly normally distributed; this condition
is also much too strong in practic, however. If joint normality is unlikely, there would
be no guarantee that [B2] is true. Hence, the OLS estimator may converge to some
parameter vector that does not have any meaningful interpretations. This suggests
that we should not conne ourselves to linear specications and may want to explore
nonlinear specications instead. The least squares theory for nonlinear specications is
the topic to which we now turn.
6.6. LIMITATIONS OF THE LINEAR SPECIFICATION 175
Exercises
6.1 Suppose that y
t
= x
o
+
t
such that IE(
t
) = 0 for all t.
(a) If x
t
and
t
are two mutually independent sequences, i.e., x
t
and
are
independent for any t and , is

T
unbiased?
(b) If x
t
and
t
are two mutually uncorrelated sequences, i.e., IE(x
t
) = 0
for any t and , is

T
unbiased?
6.2 Consider the specication y
t
= x
t
+e
t
, where x
t
is k 1. Suppose that
IE(y
t
[
t1
, J
t
) = z
o
,
where z
t
is m 1. Assuming suitable weak laws for x
t
and z
t
, what is the
probability limit of the OLS estimator of ?
6.3 Consider the specication y
t
= x
t
+z
t
+e
t
, where x
t
is k
1
1 and z
t
is k
2
1.
Suppose that
IE(y
t
[
t1
, J
t
) = x
o
.
Assuming suitable weak laws for x
t
and z
t
, what are the probability limits of the
OLS estimators of and ?
6.4 Consider a linear specication with x
t
= (1 d
t
)
, where d
t
is a one-time dummy:
d
t
= 1 if t = t
, a particular date, and d

t
= 0 otherwise. What is
lim
T
1
T
T
t=1
IE(x
t
x
t
)?
How can you prove (or disprove) OLS consistency?
6.5 State the conditions under which the OLS estimator of seemingly unrelated re-
gressions is consistent and asymptotically normally distributed.
6.6 For Example 6.12, suppose that
t
are serially correlated with unequal variances.
Given the asymptotic normality result for the OLS estimators a
T
and

b
T
.
6.7 Given y
t
= x
o
, if
t
t1
, J
t
, prove that IE(
t
) = 0 and IE(
t
) = 0 for all t ,= . Is
t
a
white noise? Why or why not?
6.8 Given the conditions of Theorem 6.3, let
t
= y
t
x
o
such that IE(
2
t
[x
t
) =
2
o
.
Prove that the standard OLS variance estimator
2
T
is weakly consistent for
2
o
.
6.9 Prove that under the null hypothesis, J
T
L/
T
IP
0. Also show that when
V
o
is known, J
T
= L/
T
.
References
Andrews, Donald W. K. (1991). Heteroskedasticity and autocorrelation consistent co-
variance matrix estimation, Econometrica, 59, 817858.
Davidson, James (1994). Stochastic Limit Theory, New York, NY: Oxford University
Press.
Gallant, A. Ronald (1987). Nonlinear Statistical Models, New York, NY: Wiley.
Eicker, (1967).
Godfrey, L. G. (1988). Misspecication Tests in Econometrics: The Lagrange Multiplier
Principle and Other Approaches, New York, NY: Cambridge University Press.
Newey, Whitney K. and Kenneth West (1987). A simple positive semi-denite het-
eroskedasticity and autocorrelation consistent covariance matrix, Econometrica,
55, 703708.
White, Halbert (1980). A heteroskedasticity-consistent covariance matrix estimator and
a direct test for heteroskedasticity, Econometrica, 48, 817838.
White, Halbert (1984). Asymptotic Theory for Econometricians, Orlando, FL: Aca-
demic Press.
Chapter 7
Nonlinear Least Squares Theory
For real world data, it is hard to believe that linear specications are universal in
characterizing all economic relationships. A straightforward extension of linear speci-
cations is to consider specications that are nonlinear in parameters. For example, the
function +x
oers more exibility than the simple linear function +x. Although
such an extension is quite natural, it also creates various diculties. First, deciding an
appropriate nonlinear function is typically dicult. Second, it is usually cumbersome to
estimate nonlinear specications and analyze the properties of the resulting estimators.
Last, but not the least, estimation results of nonlinear specication may not be easily
interpreted.
Despite these diculties, more and more empirical evidences show that many eco-
nomic relationships are in fact nonlinear. Examples include nonlinear production func-
tions, regime switching in output series, and time series models that can capture asym-
metric dynamic patterns. In this chapter, we concentrate on the estimation of and hy-
pothesis testing for nonlinear specications. For more discussion of nonlinear regressions
we refer to Gallant (1987), Gallant and White (1988), Davidson and MacKinnon (1993)
and Bierens (1994).
7.1 Nonlinear Specications
We consider the nonlinear specication
y = f(x; ) +e(), (7.1)
where f is a given function with x an 1 vector of explanatory variables and a
k1 vector of parameters, and e() denotes the error of the specication. Note that for
177
178 CHAPTER 7. NONLINEAR LEAST SQUARES THEORY
a nonlinear specication, the number of explanatory variables need not be the same
as the number of parameters k. This formulation includes the linear specication as a
special case with f(x; ) = x
and = k. Clearly, nonlinear functions that can be

expressed in a linear form should be treated as linear specications. For example, a
specication involving a structural change is nonlinear in parameters:
y
t
=
_
+x
t
+e
t
, t t
,
( +) +x
t
+e
t
, t > t
,
but it is equivalent to the linear specication:
y
t
= +D
t
+x
t
+e
t
,
where D
t
= 0 if t t
and D
t
= 1 if t > t
. Our discussion in this chapter focuses on

the specications that cannot be expressed as linear functions.
There are numerous nonlinear specications considered in empirical applications. A
exible nonlinear specication is
y
t
= +
x
t
1
+e
t
,
where (x
t
1)/ is the so-called Box-Cox transform of x
t
, which yields dierent func-
tions, depending on the value . For example, the Box-Cox transform yields x
t
1
when = 1, 1 1/x
t
when = 1, and a value close to lnx
t
when approaches
zero. This function is thus more exible than, e.g., the linear specication +x and
nonlinear specication +x
. Note that the Box-Cox transformation is often applied

to positively valued variables.
In the study of rm behavior, the celebrated CES (constant elasticity of substitution)
production function suggests characterizing the output y by the following nonlinear
function:
y =
_
L
+ (1 )K
/
,
where L denotes labor, K denotes capital, , , and are parameters such that > 0,
0 < < 1 and 1. The elasticity of substitution for a CES production function is
s =
d ln(K/L)
d ln(MP
L
/MP
K
)
=
1
(1 +)
0,
where MP denotes marginal product. This function includes the linear, Cobb-Douglas,
Leontief production functions as special cases. To estimate the CES production function,
the following nonlinear specication is usually considered:
ln y = ln

ln
_
L
+ (1 )K
+e;
7.1. NONLINEAR SPECIFICATIONS 179
for a dierent estimation strategy, see Exercise 7.3. On the other hand, the translog
(transcendental logarithmic) production function is nonlinear in variables but linear in
parameters:
ln y =
1
+
2
ln L +
3
lnK +
4
(ln L)(ln K) +
5
(ln L)
2
+
6
(ln K)
2
,
and hence can be estimated by the OLS method.
In the time series context, a nonlinear AR(p) specication is
y
t
= f(y
t1
, . . . , y
tp
) +e
t
.
For example, the exponential autoregressive (EXPAR) specication takes the following
form:
y
t
=
p
j=1
_
j
+
j
exp
_
y
2
t1
_
y
tj
+e
t
,
where in some cases one may replace y
2
t1
in the exponential function with y
2
tj
for
j = 1, . . . , p. This specication was designed to describe physical vibration whose
amplitude depends on the magnitude of y
t1
.
As another example, consider the self-exciting threshold autoregressive (SETAR)
specication:
y
t
=
_
a
0
+a
1
y
t1
+ +a
p
y
tp
+e
t
, if y
td
(, c],
b
0
+b
1
y
t1
+ +b
p
y
tp
+e
t
, if y
td
(c, ),
where d is known as the delay parameter which is an integer between 1 and p, and c is
the threshold parameter. Note that the SETAR model is dierent from the structural
change model in that the parameters switch from one regime to another depending on
whether a past realization y
td
exceeds the threshold value c. This specication can be
easily extended to allow for r threshold parameters, so that the specication switches
among r + 1 dierent dynamic structures.
The SETAR specication above can be written as
y
t
= a
0
+
p
j=1
a
j
y
tj
+
_
0
+
p
j=1
j
y
tj
_
1
{y
td
>c}
+e
t
,
where a
j
+
j
= b
j
, and 1 denotes the indicator function. To avoid abrupt changes of
parameters, one may replace the indicator function with a smooth function h so as
to allow for smoother transitions of structures. It is typical to choose the function h as
a distribution function, e.g.,
h(y
td
; c, ) =
1
1 + exp[(y
td
c)/]
,
where c is still the threshold value and is a scale parameter. This leads to the following
smooth threshold autoregressive (STAR) specication:
y
t
= a
0
+
p
j=1
a
j
y
tj
+
_
0
+
p
j=1
j
y
tj
_
h(y
td
; c, ) +e
t
.
Clearly, this specication behaves similarly to a SETAR specication when [(y
td
c)/[
is very large. For more nonlinear time series models and their motivations we refer to
Tong (1990).
Another well known nonlinear specication is the so-called articial neural network
which has been widely used in cognitive science, engineering, biology and linguistics. A
3-layer neural network can be expressed as
f(x
1
. . . . , x
p
; ) = g
_
_
0
+
q
i=1
i
h
_
i0
+
p
j=1
ij
x
j
_
_
_
,
where is the parameter vector containing all and , g and h are some pre-specied
functions. In the jargon of the neural network literature, this specication contains
p inputs units in the input layer (each corresponding to an explanatory variable
x
j
), q hidden units in the hidden (middle) layer with the i
th
hidden-unit activation
h
i
= h(
i0
+
p
j=1
ij
x
j
), and one output unit in the output layer with the activation
o = g(
0
+
q
i=1
i
h
i
). The functions h and g are known as activation functions, the
parameters in these functions are connection weights. That is, the input values simul-
taneously activate q hidden units, and these hidden-unit activations in turn determine
the output value. The output value is supposed to capture the behavior of the target
(dependent) variable y. In the context of nonlinear regression, we can write
y = g
_
_
0
+
q
i=1
i
h
_
i0
+
p
j=1
ij
x
j
_
_
_
+e,
For a multivariate target y, networks with multiple outputs can be constructed similarly
with g being a vector-valued function.
In practice, it is typical to choose h as a sigmoid (S-shaped) function bounded
within a certain range. For example, two leading choices of h are the logistic function
7.2. THE METHOD OF NONLINEAR LEAST SQUARES 181
h(x) = 1/(1 + e
x
) which is bounded between 0 and 1 and the hyperbolic tangent
function
h(x) =
e
x
e
x
e
x
+e
x
,
which is bounded between 1 and 1. The function g may be the identity function or the
same as h. Although the class of neural networks is highly nonlinear in parameters, it
possesses two appealing properties. First, a neural network is capable of approximating
any Borel-measurable function to any degree of accuracy, provided that the number of
hidden units q is suciently large. Second, to achieve a given degree of approximation
accuracy, neural networks are relatively more parsimonious than, e.g., the polynomial
and trignometric expansions. For more details of articial neural networks and their
relationships to econometrics we refer to Kuan and White (1994).
7.2 The Method of Nonlinear Least Squares
Formally, we consider the nonlinear specication (7.1):
y = f(x; ) +e(),
where f : R
1
R,
1
denotes the parameter space, a subspace of R
k
, and e() is
the specication error. Given T observations of y and x, let
y =
_
_
y
1
y
2
.
.
.
y
T
_
_
, f(x
1
, . . . , x
T
; ) =
_
_
f(x
1
; )
f(x
2
; )
.
.
.
f(x
T
; )
_
_
.
The nonlinear specication (7.1) now can be expressed as
y = f(x
1
, . . . , x
T
; ) +e(),
where e() is the vector of errors.
7.2.1 Nonlinear Least Squares Estimator
Our objective is to nd a k-dimensional surface that best ts the data (y
t
, x
t
), t =
1, . . . , T. Analogous to the OLS method, the method of nonlinear least squares (NLS)
suggests to minimize the following NLS criterion function with respect to :
Q
T
() =
1
T
[y f(x
1
, . . . , x
T
; )]
[y f(x
1
, . . . , x
T
; )]
=
1
T
T
t=1
[y
t
f(x
t
; )]
2
.
(7.2)
Note that Q
T
is also a function of the data y
t
and x
t
; we omit the arguments y
t
and x
t
just for convenience.
The rst order condition of the NLS minimization problem is a system of k nonlinear
equations with k unknowns:
Q
T
() =
2
T
f(x
1
, . . . , x
T
; ) [y f(x
1
, . . . , x
T
; )]
set
= 0,
where
f(x
1
, . . . , x
T
; ) =
_

f(x
1
; )
f(x
2
; ) . . .
f(x
T
; )
_
,
is a kT matrix. A solution to this minimization problem is some

1
that solves the
rst order condition:
Q
T
(
) = 0, and satises the second order condition:

2
Q
T
(
)
is positive denite. We thus impose the following identication requirement; cf. [ID-1]
for linear specications.
[ID-2] f(x; ) is twice continuously dierentiable in the second argument on
1
, such
that for given data (y
t
, x
t
), t = 1, . . . , T,
2
Q
T
() is positive denite at some interior
point of
1
.
While [ID-2] ensures that a minimum of Q
T
() can be found, it does not guarantee
the uniqueness of this solution. For a a given data set, there may exist multiple solutions
to the NLS minimization problem such that each solution is a local minimum of Q
T
().
This result is stated below; cf. Theorem 3.1.
Theorem 7.1 Given the specication (7.1), suppose that [ID-2] holds. Then, there
exists a solution that minimizes the NLS criterion function (7.2).
Writing f(x
1
, . . . , x
T
; ) as f(), we have
Q
T
() =
2
T
f() [y f()] +
2
T
[
f()][
f()]
.
For linear regressions, f() = X so that
f() = X
and
2
f() = 0. It follows
that
2
Q
T
() = 2(X
X)/T, which is positive denite if, and only if, X has full
column rank. This shows that [ID-2] is, in eect, analogous to [ID-1] for the OLS
method. Comparing to the OLS method, the NLS minimization problem may not have
a closed-form solution because the rst order condition is a system of nonlinear functions
in general; see also Exercise 7.1.
The minimizer of Q
T
() is known as the NLS estimator and will be denoted as

T
. Let
y denote the vector of NLS tted values with the t
th
element y
t
= f(x
t
,

T
), and e
denote the vector of NLS residuals y y with the t
th
element e
t
= y
t
y
t
. Denote the
transpose of
f() as (). Then by the rst order condition,

(
T
)
e = [
f(
T
)] e = 0.
That is, the residual vector is orthogonal to every column vector of (
T
). Geometri-
cally, f() denes a surface on
1
, and for any in
1
, () is a k-dimensional linear
subspace tangent at the point f(). Thus, y is orthogonally projected onto this surface
at f(
T
) so that the residual vector is orthogonal to the tangent space at that point. In
contrast with linear regressions, there may be more than one orthogonal projections and
hence multiple solutions to the NLS minimization problem. There is also no guarantee
that the sum of NLS residuals is zero; see Exercise 7.2.
Remark: The marginal response to the change of the i
th
regressor is f(x
t
; )/x
ti
.
Thus, one should be careful in interpreting the estimation results because a parameter
in a nonlinear specication is not necessarily the marginal response to the change of a
regressor.
7.2.2 Nonlinear Optimization Algorithms
When a solution to the rst order condition of the NLS minimization problem cannot be
obtained analytically, the NLS estimates must be computed using numerical methods.
To optimizing a nonlinear function, an iterative algorithm starts from some initial value
of the argument in that function and then repeatedly calculates next available value
according to a particular rule until an optimum is reached approximately. It should be
noted that when there are multiple optima, an iterative algorithm may not be able to
locate the global optimum. In fact, it is more common that an algorithm gets stuck at
a local optimum, except in some special cases, e.g., when optimizing a globally concave
(convex) function. In the literature, several new methods, such as the simulated anneal-
ing algorithm, have been proposed to nd the global solution. These methods have not
yet been standard because they are typically dicult to implement and computation-
ally very intensive. We will therefore conne ourselves to those commonly used local
methods.
To minimize Q
T
(), a generic algorithm can be expressed as
(i+1)
=
(i)
+s
(i)
d
(i)
,
so that the (i +1)
th
iterated value
(i+1)
is obtained from
(i)
, the value from the pre-
vious iteration, by adjusting the amount s
(i)
d
(i)
, where d
(i)
characterizes the direction
of change in the parameter space and s
(i)
controls the amount of change. Dierent al-
gorithms are resulted from dierent choices of s and d. As maximizing Q
T
is equivalent
to minimizing Q
T
, the methods discussed here are readily modied to the algorithms
for maximization problems.
Consider the rst-order Taylor expansion of Q() about
:
Q
T
() Q
T
(
) + [
Q
T
(
)]
).
Replacing with
(i+1)
and
with
(i)
we have
Q
T
_
(i+1)
_
Q
T
_
(i)
_
+
_
Q
T
_
(i)
_
s
(i)
d
(i)
.
Note that this approximation is valid when
(i+1)
is in the neighborhood of
(i)
. Let
g() denote the gradient vector of Q
T
:
Q
T
(), and g
(i)
denote g() evaluated at
(i)
. If d
(i)
= g
(i)
,
Q
T
_
(i+1)
_
Q
T
_
(i)
_
s
(i)
_
g
(i)
g
(i)
.
As g
(i))
g
(i)
is non-negative, we can nd a positive and small enough s such that Q
T
is decreasing. Clearly, when
(i)
is already a minimum of Q
T
, g
(i)
is zero so that no
further adjustment is possible. This suggests the following algorithm:
(i+1)
=
(i)
s
(i)
g
(i)
.
Choosing d
(i)
= g
(i)
leads to:
(i+1)
=
(i)
+s
(i)
g
(i)
,
which can be used to search for a maximum of Q
T
.
Given the search direction, one may want to choose s
(i)
such that the next value
of the objective function Q
T
_
(i+1)
_
is a minimum. This suggests that the rst order
condition below should hold:
Q
T
_
(i+1)
_
s
(i)
=
Q
T
_
(i+1)
_
(i+1)
s
(i)
= g
(i+1)
g
(i)
= 0.
Let H
(i)
denote the Hessian matrix of Q
T
evaluated at
(i)
:
H
(i)
=
2
Q
T
()[
=
(i) =
g()[
=
(i) .
Then by Taylors expansion of g, we have
g
(i+1)
g
(i)
+H
(i)
_
(i+1)
(i)
_
= g
(i)
H
(i)
s
(i)
g
(i)
.
It follows that
0 = g
(i+1)
g
(i)
g
(i)
g
(i)
s
(i)
g
(i)
H
(i)
g
(i)
,
or equivalently,
s
(i)
=
g
(i)
g
(i)
g
(i)
H
(i)
g
(i)
.
The step length s
(i)
is non-negative whenever H
(i)
is positive denite. The algorithm
derived above now reads
(i+1)
=
(i)
g
(i)
g
(i)
g
(i)
H
(i)
g
(i)
g
(i)
,
which is known as the steepest descent algorithm. If H
(i)
is not positive denite, s
(i)
may be non-negative so that this algorithm may point to a wrong direction.
As the steepest descent algorithm adjusts parameters along the opposite of the
gradient direction, it may run into diculty when, e.g., the nonlinear function being
optimized is at around the optimum. The algorithm may iterate back and forth without
much progress in approaching an optimum. An alternative is to consider the second-
order Taylor expansion of Q() around some
:
Q
T
() Q
T
(
) +g
) +
1
2
(
),
where g
and H
are g and H evaluated at
, respectively. From this expansion, the

rst order condition of Q
T
() may be expressed as
g
+H
) 0,
so that
(H
)
1
g
. This suggests the following algorithm:
(i+1)
=
(i)
_
H
(i)
_
1
g
(i)
,
where the step length is 1, and the direction vector is
_
H
(i)
_
1
g
(i)
. This is also
known as the Newton-Raphson algorithm. This algorithm is more dicult to implement
because it involves matrix inversion at each iteration step.
From Taylors expansion we can also see that
Q
T
_
(i+1)
_
Q
T
_
(i)
_

1
2
g
(i)
_
H
(i)
_
1
g
(i)
,
where the right-hand side is negative provided that H
(i)
is positive denite. When this
approximation is good, the Newton-Raphson algorithm usually (but not always) results
in a decrease in the value of Q
T
. This algorithm may point to a wrong direction if
H
(i)
is not positive denite; this happens when, e.g., Q is concave at
i
. When Q
T
is
(locally) quadratic with the local minimum
, the second-order expansion about
is
exact, and hence
=
H(
)
1
g(
).
In this case, the Newton-Raphson algorithm can reach the minimum in a single step.
Alternatively, we may also add a step length to the Newton-Raphson algorithm:
(i+1)
=
(i)
s
(i)
_
H
(i)
_
1
g
(i)
,
where s
(i)
may be found by minimizing Q
_
(i+1)
_
. In practice, it is more typical to
choose s
(i)
such that Q
_
(i)
_
is decreasing at each iteration.
A algorithm that avoids computing the second-order derivatives is the so-called
Gauss-Newton algorithm. When Q
T
() is the NLS criterion function,
H() =
2
T
f()[y f()] +
2
T
()
(),
where () =
f(). It is therefore convenient to ignore the rst term on the right-

hand side and approximate H() by 2()
()/T. There are some advantages of this

approximation. First, only the rst-order derivatives need to be computed. Second,
this approximation is guaranteed to be positive denite under [ID-2]. The resulting
algorithm is
(i+1)
=
(i)
+
_
(i)
_
(i)
_
1
(i)
__
y f
_
(i)
_
.
Observe that the adjustment term can be obtained as the OLS estimator of regressing
y f
_
(i)
_
on
_
(i)
_
; this regression is thus known as the Gauss-Newton regression.
The iterated values can be easily computed by performing the Gauss-Newton regres-
sion repeatedly. The performance of this algorithm may be quite dierent from the
Newton-Raphson algorithm because it utilizes only an approximation to the Hessian
matrix.
To maintain a correct search direction of the steepest descent and Newton-Raphson
algorithms, it is important to ensure that H
(i)
is positive denite at each iteration. A
simple approach is to correct H
(i)
, if necessary, by adding an appropriate matrix to it.
A popular correction is
H
(i)
c
= H
(i)
+c
(i)
I,
where c
(i)
is a positive number chosen to force H
(i)
c
to be a positive denite matrix.
Let

H = H
1
. One may also compute
H
(i)
c
=

H
(i)
+cI,
because it is the inverse of H
(i)
that matters in the algorithm. Such a correction is used
in, for example, the so-called Marquardt-Levenberg algorithm.
The quasi-Newton method, on the other hand, corrects

H
(i)
iteratively by adding a
symmetric, correction matrix C
(i)
:
H
(i+1)
=

H
(i)
+C
(i)
,
with the initial value

H
(0)
= I. This method includes the Davidon-Fletcher-Powell
(DFP) algorithm and the Broydon-Fletcher-Goldfarb-Shanno (BFGS) algorithm, where
the latter is the algorithm used in the GAUSS program. In the DFP algorithm,
C
(i)
=

(i)
(i)
(i)
(i)
+

H
(i)
(i)
(i)

H
(i)
(i)
H
(i)
(i)
,
where
(i)
=
(i+1)

(i)
and
(i)
= g
(i+1)
g
(i)
. The BFGS algorithm contains an
additional term in the correction matrix.
To implement an iterative algorithm, one must choose a vector of initial values to
start the algorithm and a stopping rule to terminate the iteration procedure. Initial
values are usually specied by the researcher or by random number generation; prior in-
formation, if available, should also be taken into account. For example, if the parameter
is a probability, the algorithm may be initialized by, say, 0.5 or by a number randomly
generated from the uniform distribution on [0, 1]. Without prior information, it is also
typical to generate initial values from a normal distribution. In practice, one would
generate many sets of initial values and then choose the one that leads to a better result
(for example, a better t of data). Of course, this search process is computationally
demanding.
When an algorithm results in no further improvement, a stopping rule must be
invoked to terminate the iterations. Typically, an algorithm stops when one of the
following convergence criteria is met: for a pre-determined, small positive number c,
1.
_
_
(i+1)
(i)
_
_
< c, where | | denotes the Euclidean norm,
2.
_
_
g
_
(i)
__
_
< c, or
3.

Q
T
_
(i+1)
_
Q
T
_
(i)
_
< c.
For the Gauss-Newton algorithm, one may stop the algorithm when TR
2
is close to
zero, where R
2
is the coecient of determination of the Gauss-Newton regression. As the
residual vector must be orthogonal to the tangent space at the optimum, this stopping
rule amounts to checking whether the rst order condition is satised approximately.
In some cases, an algorithm may never meet its pre-set convergence criterion and hence
keeps on iterating. To circumvent this diculty, an optimization program usually sets
a maximum number for iterations so that the program terminates automatically once
the number of iterations reaches this upper bound.
7.3 Asymptotic Properties of the NLS Estimators
7.3.1 Consistency
As the NLS estimator does not have an analytic form in general, a dierent approach is
thus needed to establish NLS consistency. Intuitively, when the NLS objective function
Q
T
() is close to IE[Q
T
()] for all , it is reasonable to expect that the minimizer of
Q
T
(), i.e., the NLS estimator

T
, is also close to a minimum of IE[Q
T
()]. Given that
Q
T
is nonlinear in , a ULLN must be invoked to justify the closeness between Q
T
()
and IE[Q
T
()], as discussed in Section 5.6.
To illustrate how consistency can be obtained, we consider a special case. Suppose
that IE[Q
T
()] is a continuous function on the compact parameter space
1
such that
o
is its unique, global minimum. The NLS estimator

T
is such that
Q
T
(
T
) = inf
1
Q
T
().
7.3. ASYMPTOTIC PROPERTIES OF THE NLS ESTIMATORS 189
Suppose also that Q
T
has a SULLN eect, i.e., there is a set
0
such that IP(
0
) = 1
and
sup
Q
T
() IE[Q
T
()]
0,
for all
0
. Set
= inf
B
c
1
_
IE[Q
T
()] IE[Q
T
(
o
)]
_
,
where B is an open neighborhood of
o
. Then for
0
, we can choose T suciently
large such that
IE[Q
T
(
T
)] Q
T
(
T
) <

2
,
and that
Q
T
(
T
) E[Q
T
(
o
)] Q
T
(
o
) E[Q
T
(
o
)] <

2
,
because the NLS estimator

T
minimizes Q
T
(). It follows that for
0
,
IE[Q
T
(
T
)] IE[Q
T
(
o
)]
IE[Q
T
(
T
)] Q
T
(
T
) +Q
T
(
T
) E[Q
T
(
o
)]
< ,
for all T suciently large. This shows that, comparing to all outside the neighborhood
B of
o
,

T
will eventually render IE[Q
T
()] closer to IE[Q
T
(
o
)] with probability one.
Thus,

T
must be in B for large T. As B is arbitrary,

T
must converge to
o
almost
surely. Convergence in probability of

T
to
o
can be established using a similar
argument; see e.g., Amemiya (1985) and Exercise 7.4.
The preceding discussion shows what matters for consistency is the eect of a SULLN
(WULLN). Recall from Theorem 5.34 that, to ensure a SULLN (WULLN), Q
T
should
obey a SLLN (WLLN) for each
1
and also satisfy a Lipschitz-type continuity
condition:
[Q
T
() Q
T
(
)[ C
T
|
| a.s.,
with C
T
bounded almost surely (in probability). If the parameter space
1
is compact
and convex, we have from the mean-value theorem and the Cauchy-Schwartz inequality
that
[Q
T
() Q
T
(
)[ |
Q
T
(
)| |
| a.s.,
where and
are in
1
and
is the mean value of and
, in the sense that

[
o
[ < [
o
[. Hence, the Lipschitz-type condition would hold by setting
C
T
= sup
Q
T
().
Observe that in the NLS context,
Q
T
() =
1
T
T
t=1
_
y
2
t
2y
t
f(x
t
; ) +f(x
t
; )
2
_
,
and
Q
T
() =
2
T
T
t=1
f(x
t
; )[y
t
f(x
t
; )].
Hence,
Q
T
() cannot be almost surely bounded in general. (It would be bounded
if, for example, y
t
are bounded random variables and both f and
f are bounded
functions.) On the other hand, it is practically more plausible that
Q
T
() is bounded
in probability. It is the case when, for example, IE[
Q
T
()[ is bounded uniformly in
. As such, we shall restrict our discussion below to WULLN and weak consistency of
T
.
To proceed we assume that the identication requirement [ID-2] holds with proba-
bility one. The discussion above motivates the additional conditions given below.
[C1] (y
t
w
t
)
is a sequence of random vectors, and x

t
is vector containing some
elements of
t1
and J
t
.
(i) The sequences y
2
t
, y
t
f(x
t
; ) and f(x
t
; )
2
all obey a WLLN for each
in
1
, where
1
is compact and convex.
(ii) y
t
, f(x
t
; ) and
f(x
t
; ) all have bounded second moment uniformly in
.
[C2] There exists a unique parameter vector
o
such that IE(y
t
[
t1
, J
t
) = f(x
t
;
o
).
Condition [C1] is analogous to [B1] so that stochastic regressors are allowed. [C1](i)
regulates that each components of Q
T
() obey a standard WLLN. [C1](ii) implies
IE[
Q
T
()[
2
T
T
t=1
_
|
f(x
t
; )|
2
|y
t
|
2
+|
f(x
t
; )|
2
|f(x
t
; )|
2
_
,
for some which does not depend on . This in turn implies
Q
T
() is bounded
in probability (uniformly in ) by Markovs inequality. Condition [C2] is analogous to
[B2] and requires f(x
t
; ) been a correct specication of the conditional mean function.
Thus,
o
globally minimizes IE[Q
T
()] because the conditional mean must minimizes
mean-squared errors.
Theorem 7.2 Given the nonlinear specication (7.1), suppose that [C1] and [C2] hold.
Then,

T
IP
o
.
Theorem 7.2 is not completely satisfactory because it is concerned with the conver-
gence to the global minimum. As noted in Section 7.2.2, an iterative algorithm is not
guaranteed to nd a global minimum of the NLS objective function. Hence, it is more
reasonable to expect that the NLS estimator only converges to some local minimum
of IE[Q
T
()]. A simple proof of such local consistency result is not yet available. We
therefore omit the details and assert only that the NLS estimator converges in proba-
bility to a local minimum
. Note that f(x;
) is, at most, an approximation to the

conditional mean function.
7.3.2 Asymptotic Normality
Given that the NLS estimator

T
is weakly consistent for some
, we will sketch a
proof that, with more regularity conditions, the suitably normalized NLS estimator is
asymptotically distributed as a normal random vector.
First note that by the mean-value expansion of
Q
T
(
T
) about
Q
T
(
T
) =
Q
T
(
) +
2
Q
T
(
T
)(
),
where
T
is a mean value between

T
and
. Clearly, the left-hand side is zero

because

T
is the NLS estimator and hence solves the rst order condition. By [ID-2],
the Hessian matrix
2
Q
T
(
T
) is invertible, so that
T(
) = [
2
Q
T
(
T
)]
1
Q
T
(
).
The asymptotic distribution of

T(
) is therefore the same as that of the right-

hand side.
Let H
T
() = IE[
2
Q
T
()] and vec denote the operator such that vec(A) is the
vector resulted from stacking all the column vectors of A. By the triangle inequality,
_
_
vec
_
Q
T
(
T
)
vec
_
H
T
(
)
_
_
_
_
vec
_
Q
T
(
T
)
vec
_
H
T
(
T
)
_
_
+
_
_
vec
_
H
T
(
T
)
vec
_
H
T
(
)
_
_
.
The rst term on the right-hand side converges to zero in probability, provided that
Q
T
() also obeys a WULLN. As
T
is a mean value between

T
and
, weak
consistency of

T
implies
T
also converges in probability to
. Under [ID-2], Q
T
is
twice continuously dierentiable so that H
T
() is continuous in . Thus, H
T
(
T
)
H
T
(
) also converges to zero in probability. Consequently,

2
Q
T
(
T
) is essentially
close to H
T
(
), in the sense that they dier by an o

IP
(1) term.
The result above shows that the normalized NLS estimator,

T(
), and
H
T
(
)
1
Q
T
(
)
are asymptotically equivalent and hence must have the same limiting distribution. Ob-
serve that

T
Q
T
(
) is a partial sum:
Q
T
(
) =
2
T
T
t=1
f(x
t
;
)[y
t
f(x
t
;
)],
and hence obeys a CLT under suitable regularity conditions. That is,
(V

T
)
1/2
Q
T
(
)
D
N(0, I
k
),
where
V

T
= var
_
2
T
T
t=1
f(x
t
;
)[y
t
f(x
t
;
)]
_
.
Then for D
T
= H
T
(
)
1
V

T
H
T
(
)
1
, we immediately obtain the following asymp-
totic normality result:
(D
T
)
1/2
H
T
(
)
1
Q
T
(
)
D
N(0, I
k
),
which in turn implies
(D
T
)
1/2
T(
)
D
N(0, I
k
),
As in linear regression, asymptotic normality of the normalized NLS estimator remains
valid when D
T
is replaced by its consistent estimator

D
T
:
D
1/2
T
T(
)
D
N(0, I
k
),
Thus, nding a consistent estimator for D
T
is important in practice.
Consistent estimation of D
T
is completely analogous to that for linear regression;
see the discussion of Section 6.3. First observe that H
T
(
) is
H
T
(
) =
2
T
T
t=1
IE
__
f(x
t
;
)
_
f(x
t
;
2
T
T
t=1
IE
_
f(x
t
;
)
_
y
t
f(x
t
;
)
_
,
which can be consistently estimated by its sample counterpart:
H
T
=
2
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
2
T
T
t=1
f(x
t
;

T
) e
t
.
When
t
= y
t
f(x
t
;
) are uncorrelated with

2
f(x
t
;
), H
T
(
) depends only on
the expectation of the outer product of
f(x
t
;
) so that

H
T
simplies to
H
T
=
2
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
.
This estimator is analogous to

T
t=1
x
t
x
t
/T for M
xx
in linear regression.
If
=
o
so that f(x
t
;
) is the conditional mean of y

t
, we write V

T
as
V
o
T
=
4
T
T
t=1
IE
_
2
t
_
f(x
t
;
o
)
_
f(x
t
;
o
)
_
.
When there is conditional homoskedasticity: IE(
2
t
[
t1
, J
t
) =
2
o
, V
o
T
simplies to
V
o
T
=
4
2
o
T
T
t=1
IE
_
_
f(x
t
;
o
)
_
f(x
t
;
o
)
_
,
which can be consistently estimated by
V
T
=
4
2
T
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
,
with
2
T
a consistent estimator for
2
o
. Typically,
2
T
=
T
t=1
e
2
t
/T. In the NLS context,
T
t=1
e
t
is non-zero in general so that we may also compute
2
T
as

2
T
=
1
T
T
t=1
( e
t

e)
2
,
where

e =
T
t=1
e
t
/T. Combining

V
T
and

H
T
we have
D
T
=
2
T
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
.
This estimator is analogous to the OLS covariance matrix estimator
2
T
(X
X/T)
1
for
linear regressions.
When there is conditional heteroskedasticity such that IE(
2
t
[
t1
, J
t
) are functions
of the elements of
t1
and J
t
, V
o
T
can be consistently estimated by
V
T
=
4
T
T
t=1
e
2
t
_
f(x
t
;

T
)
_
f(x
t
;

T
)
,
so that
D
T
=
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
V
T
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
.
This is Whites heteroskedasticity-consistent covariance matrix estimator for nonlinear
regressions.
As discussed earlier, the probability limit
of the NLS estimator is typically a

local minimum of IE[Q
T
()] and hence not
o
in general. In this case,
t
is not a
martingale dierence sequence with respect to
t1
and J
t
, and V

T
must be estimated
using a Newey-West type estimator; see Exercise 7.7.
7.4 Hypothesis Testing
For testing linear restrictions of parameters, we again consider the null hypothesis
H
0
: R
= r,
where R is a q k matrix and r is a q 1 vector of pre-specied constants.
The Wald test now evaluates the dierence between the NLS estimates and the hypo-
thetical values. When the normalized NLS estimator, T
1/2
(
o
), has an asymptotic
normal distribution with the asymptotic covariance matrix D
T
, we have under the null
hypothesis that
(
T
)
1/2
T(R
T
r) = (
T
)
1/2
TR(
)
D
N(0, I
q
).
where
T
= RD
T
R
. Let

D
T
be a consistent estimator for D
T
. Then,

T
= R
D
T
R
is also consistent for
T
. It follows that the Wald statistic is
J
T
= T(R
T
r)
1
T
(R
T
r)
2
(q),
which is of the same form as the Wald statistic in linear regressions.
More generally, one may test for the nonlinear restriction of the form
H
0
: r(
) = 0,
where r now is a R
q
-valued nonlinear function that is continuously dierentiable. It
is natural to consider basing the Wald test on r(
T
). First note that linearization of
r(
T
) about
yields
r(
T
) = r(
) + [
r(
)]
) +o
IP
(1).
Letting R(
) = [
r(
)]
, a q k matrix, we have under the null hypothesis that

r(
T
) = R(
)(
) +o
IP
(1).
Similar as before,
(
T
)
1/2
r(
T
) = (
T
)
1/2
TR(
)(
)
D
N(0, I
q
).
where in this case
T
= R(
)D
T
R(
. This immediately suggests the following

Wald statistic:
J
T
= Tr(
T
)
1
T
r(
T
)
D

2
(q),
where

T
= R(
T
)

D
T
R(
T
)
is consistent for
T
.
A well known drawback of the Wald test is that it is not invariant with respect
to dierent forms of a nonlinear hypothesis. For example, consider two equivalent
hypotheses:
1
2
= 1 and
1
= 1/
2
. The Wald test for the former is based on
2
1, whereas the Wald test for the latter is based on

1
2
. It turns out that
these two tests perform very dierently; see e.g., Gregory and Veall (1985) and Phillips
and Park (1988). In particular, the Wald test for
1
= 1/
2
rejects far too often when
the null hypothesis is indeed correct (i.e., the empirical size is much larger than the
nominal size). Moreover, these two tests result in conicting conclusions quite often.
Hence, the inferences from testing nonlinear hypotheses should be drawn with care.
Exercises
7.1 Suppose that Q
T
() is quadratic in :
Q
T
() = a +b
C,
where a is a scalar, b a vector and C a symmetric, positive denite matrix. Find
the rst order condition of minimizing Q
T
() and the resulting solution. Is the
OLS criterion function (3.2) quadratic in ?
7.2 Let
t
= y
t
y
t
denote the t
th
NLS residuals. Is

T
t=1

t
zero in general? Why or
why not?
7.3 Given the nonlinear specication of the CES production function
ln y = ln

ln
_
L
+ (1 )K
+e,
nd the second order Taylor expansion of lny around = 0. How would you
estimate this linearized function and how can you calculate the original parameters
, , and ?
7.4 Suppose that IE[Q
T
()] is a continuous function on the compact parameter space
1
such that
o
is its unique, global minimum. Also suppose that the NLS
estimator

T
is such that
IE[Q
T
(
T
)] = inf
1
IE[Q
T
()].
Prove that when Q
T
has a WULLN eect, then

T
converges in probability to
o
.
7.5 Apply Theorem 7.2 to discuss the consistency property of the OLS estimator for
the linear specication y
t
= x
t
+e
t
.
7.6 Let
t
= y
t
f(x
t
;
o
). If
t
t1
and J
t
such that IE(
2
t
[
t1
, J
t
) =
2
o
, state the conditions under which

2
T
=
T
t=1
e
2
t
/T is consistent for
2
o
.
7.7 Let
t
= y
t
f(x
t
;
), where
may not be the same as

o
. If
t
is not
a martingale dierence sequence with respect to
t1
and J
t
, give consistent
estimators for V
T
and D
T
.
References
sity Press.
Davidson, Russell and James G. MacKinnon (1993). Estimation and Inference in Econo-
metrics, New York, NY: Oxford University Press.
Bierens, Herman J. (1994). Topics in Advanced Econometrics, New York, NY: Cam-
bridge University Press.
Gallant, A. Ronald (1987). Nonlinear Statistical Inference, New York, NY: John Wiley
& Sons.
Gallant, A. Ronald and Halbert White (1988). A Unied Theory of Estimation and
Inference for Nonlinear Dynamic Models, Oxford, UK: Basil Blackwell.
Gregory, Allan W. and Michael R. Veall (1985). Formulating Wald tests of nonlinear
restrictions, Econometrica, 53, 14651468.
Kuan, Chung-Ming and Halbert White (1994). Articial neural networks: An econo-
metric perspective, Econometric Reviews, 13, 191.
Phillips, Peter C. B. and Joon Y. Park (1988). On the formulation of Wald tests of
nonlinear restrictions, Econometrica, 56, 10651083.

Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL

Caricato da

Copyright:

Formati disponibili

INTRODUCTION TO ECONOMETRIC THEORY

f() is the m n matrix of the

f() is the (column) gradient vector of f(). The m m Hessian

, is the n m matrix whose (i, j)

z. Thus, the inner product of y and z can

z = 0. In this case, we say that y and z are orthogonal

). From the denition of determinant, it is straightforward

). The trace function

is known as the adjoint of A. For example, when

) and known as the row space of A. The

, which is also the row rank of A, must

). An n n matrix A is said to be of full rank if rank(A) = n.

Ab > 0 for all vectors b ,= 0;

Ab 0 for all b ,= 0. A positive denite matrix

P. This shows that P must be symmetric. Thus, a matrix is an

Ax = 0, so that Ax must be a zero vector. That is, x is

A. Hence, these two row spaces are also the same.

is n n and hence singular. The result below is

, y is orthogonal to the column vectors of A so that A

orthogonally projects vectors onto span(A) and has rank k; I

and has rank n k.

must be a positive semi-denite

[z IE(z)] = 0 with probability one; i.e, the elements of z are linearly

does not have full rank, so

log L()] + var(s(z

as the acceptance region of T .

T z > 1.645) = 1 (1.645).

T z > 1.645) = 1 (1.645

is the vector of unknown parameters, y and X contain all

X is positive denite and hence

X is not invertible so that there exist innitely many solutions to the

y. As such, the OLS estimator

e = 0. When X contains a column of constants (i.e., a column of X is c,

e = 0; in particular, if X contains a column of constants,

which is an orthogonal projection

and hence is orthogonal to y

denote the corresponding OLS estimators. Then,

, I P is not aected if it is pojected onto

e. Unfortunately, this measure is not invariant with respect to measurement units of

y, we can also write

be the OLS estimator of . Using

. Again by Lemma 2.6, y

. For the two-sided t test, we can nd the values t

is unlikely under the null hypothesis. When

(q, T k) is such that

e denote, respectively, the ESS resulted from con-

(q, T k) is the critical value of the F test at the signicance level .

(y y)? If yes, is there

denote the OLS estimator obtained from estimating

= 1000 y and X are used as the dependent and explanatory variables.

= 1000 X are used as the dependent and explanatory variables.

are used as the dependent and explanatory variables.

+d. Show that

X. Is this decomposition orthogonal?

. Use a graph to illustrate this result.

= Ge. The transformed specica-

, but not the variation of the original variable

] is true, the log-likelihood function is

. It is then clear that

) = 0 for t ,= . Now suppose that the elements of is generated as a weakly

= ). Then for each ,

y, is not the BLUE in

is a projection matrix. Thus,