Probability and Statistics Guide

Econometric Theory
Martin Wittenberg
School of Economics and SALDRU
University of Cape Town
2011
ii
Contents
I
Probability and Statistics
1 Probability and Distribution Theory 1

1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Some probability theorems . . . . . . . . . . .
1.1.2 Conditional probability . . . . . . . . . . . . .
1.2 Random variables and probability distributions . . . .
1.2.1 Exercises . . . . . . . . . . . . . . . . . . . . .
1.3 Expectations of a random variable . . . . . . . . . . .
1.4 Specific univariate discrete distributions . . . . . . . .
1.4.1 Bernoulli . . . . . . . . . . . . . . . . . . . . .
1.4.2 Binomial . . . . . . . . . . . . . . . . . . . . .
1.5 Specific univariate continuous probability distributions
1.5.1 Normal . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Chi-squared . . . . . . . . . . . . . . . . . . . .
1.5.3 Students t . . . . . . . . . . . . . . . . . . . .
1.5.4 F . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.5 Noncentral 2 , and distributions . . . . . .
1.5.6 Gamma . . . . . . . . . . . . . . . . . . . . . .
1.5.7 Exponential . . . . . . . . . . . . . . . . . . . .
1.5.8 Beta . . . . . . . . . . . . . . . . . . . . . . . .
1.5.9 Logistic . . . . . . . . . . . . . . . . . . . . . .
1.5.10 Cauchy . . . . . . . . . . . . . . . . . . . . . .
1.5.11 Uniform . . . . . . . . . . . . . . . . . . . . . .
1.6 Transformations of a random variable . . . . . . . . .
1.6.1 The lognormal distribution . . . . . . . . . . .
1.7 Appendix: Moment generating function . . . . . . . .
1.7.1 Exercises . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
4
6
8
9
9
9
10
10
10
11
11
12
12
12
13
13
13
14
14
15
16
17
2 Probability and Distribution Theory II

2.1 Joint distributions . . . . . . . . . . . . . . . . . . .
2.1.1 Discrete joint distributions . . . . . . . . . .
2.1.2 Continuous joint distributions . . . . . . . . .
2.2 Marginal distributions . . . . . . . . . . . . . . . . .
2.3 Conditional distributions . . . . . . . . . . . . . . . .
2.3.1 Discrete conditional distributions . . . . . . .
2.3.2 Continuous conditional distributions . . . . .
2.3.3 Statistical independence of random variables
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
22
23
23
23
24
25
iii
.
.
.
.
.
.
.
.
iv
CONTENTS
2.4
2.5
Expectations of multivariate distributions . . .

2.4.1 Conditional expectations . . . . . . . .
2.4.2 Covariance . . . . . . . . . . . . . . . .
2.4.3 Expectations of vectors and matrices . .
2.4.4 Covariance matrix . . . . . . . . . . . .
Some commonly used multivariate distributions
2.5.1 Uniform . . . . . . . . . . . . . . . . . .
2.5.2 Bivariate normal . . . . . . . . . . . . .
2.5.3 Multivariate normal . . . . . . . . . . .
3 Sampling and Estimation

3.1 Samples, statistics and point estimates
3.2 Maximum Likelihood Estimation . . .
3.2.1 Exercises . . . . . . . . . . . .
3.3 Method of Moments Estimation . . . .
3.3.1 Exercises . . . . . . . . . . . .
3.4 Other rules . . . . . . . . . . . . . . .
3.4.1 Rules of thumb . . . . . . . . .
3.4.2 Bayesian estimation . . . . . .
3.4.3 Pretest estimators . . . . . . .
3.4.4 Bias adjusted estimators . . . .
3.5 Sampling distribution . . . . . . . . .
3.5.1 Exercises . . . . . . . . . . . .
3.6 Finite sample properties of estimators
3.6.1 Bias . . . . . . . . . . . . . . .
3.6.2 Minimum variance . . . . . . .
3.6.3 Mean Square Error . . . . . . .
3.6.4 Invariance . . . . . . . . . . . .
3.7 Monte Carlo simulations . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
27
27
28
29
29
29
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
41
41
42
43
43
43
43
44
44
47
47
47
48
50
50
51
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
55
57
58
59
60
62
62
63
64
64
67
68
68
4 Asymptotic Theory
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Sequences, limits and convergence . . . . . . . . . . . . . . . .
4.2.1 The limit of a mathematical sequence . . . . . . . . . .
4.2.2 The probability limit of a sequence of random variables
4.2.3 Rules for probability limits . . . . . . . . . . . . . . . .
4.2.4 Convergence in distribution . . . . . . . . . . . . . . . .
4.2.5 Rates of convergence . . . . . . . . . . . . . . . . . . . .
4.3 Sampling, consistency and laws of large numbers . . . . . . . .
4.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Consistency of the sample CDF . . . . . . . . . . . . . .
4.3.3 Consistency of method of moments estimation . . . . .
4.4 Asymptotic normality and central limit theorems . . . . . . . .
4.5 Properties of Maximum Likelihood Estimators . . . . . . . . .
4.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
5 Statistical Inference
5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Type I and Type II errors . . . . . . . . . . . . . . . .
5.1.2 Power of a test . . . . . . . . . . . . . . . . . . . . . .
5.2 Types of tests . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 The Wald Test . . . . . . . . . . . . . . . . . . . . . .
5.2.2 The likelihood ratio test . . . . . . . . . . . . . . . . .
5.2.3 The Lagrange Multiplier test . . . . . . . . . . . . . .
5.3 Worked example: The Pareto distribution . . . . . . . . . . .
5.3.1 Wald test . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Likelihood ratio test . . . . . . . . . . . . . . . . . . .
5.3.3 Lagrange multiplier test . . . . . . . . . . . . . . . . .
5.4 Worked example: The bivariate normal . . . . . . . . . . . .
5.4.1 Wald Test of a single hypothesis . . . . . . . . . . . .
5.4.2 Wald Test of the joint hypothesis . . . . . . . . . . . .
5.4.3 Likelihood Ratio test . . . . . . . . . . . . . . . . . . .
5.5 Appendix: ML estimation of the bivariate normal distribution
5.5.1 Maximum likelihood estimators . . . . . . . . . . . . .
5.5.2 Information matrix . . . . . . . . . . . . . . . . . . . .
5.5.3 Asymptotic covariance matrix . . . . . . . . . . . . . .
5.5.4 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . .
5.6 Restricted Maximum Likelihood estimation . . . . . . . . . .
5.6.1 Restricted Maximum Likelihood Estimators . . . . . .
5.6.2 Restricted loglikelihood . . . . . . . . . . . . . . . . .
II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Single equation estimation
6 Thinking about social processes econometrically

6.1 Setting up an econometric research problem . . . . . . .
6.2 The econometric model . . . . . . . . . . . . . . . . . .
6.2.1 Abstraction and causality . . . . . . . . . . . . .
6.2.2 The Rubin causal model . . . . . . . . . . . . . .
6.2.3 Experimentation . . . . . . . . . . . . . . . . . .
6.3 The process of information recovery . . . . . . . . . . .
6.3.1 Properties of estimators and of rules of inference
6.4 Examples of econometric research problems . . . . . . .
6.4.1 The Keynesian consumption function . . . . . .
6.4.2 Estimating the unemployment rate . . . . . . . .
6.5 Types of inverse problems . . . . . . . . . . . . . . . . .
6.6 The classical linear regression model . . . . . . . . . . .
6.6.1 Matrix representation . . . . . . . . . . . . . . .
6.6.2 Assumptions . . . . . . . . . . . . . . . . . . . .
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
69
69
72
72
73
73
74
75
75
76
76
77
78
78
80
81
83
84
85
86
86
88
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
92
93
94
95
95
96
96
96
98
99
100
101
102
104
vi
CONTENTS
7 Least Squares
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The Least Squares criterion . . . . . . . . . . . . . . . . .
7.2.1 The solution to the OLS problem . . . . . . . . . .
7.3 The geometry of Least Squares . . . . . . . . . . . . . . .
7.3.1 Projection: the matrix and the matrix . . . .
7.3.2 Algebraic properties of the Least Squares Solution
7.4 Partitioned regression . . . . . . . . . . . . . . . . . . . .
7.4.1 The Frisch-Waugh-Lovell Theorem . . . . . . . . .
7.4.2 Interpretation of the FWL theorem . . . . . . . . .
7.4.3 Alternative proof . . . . . . . . . . . . . . . . . . .
7.4.4 Applications of the FWL theorem: . . . . . . . . .
7.4.5 Omitted variable bias . . . . . . . . . . . . . . . .
7.5 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . .
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7 Appendix: A worked example . . . . . . . . . . . . . . . .
8 Properties of the OLS estimators in finite samples
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Motivations for OLS . . . . . . . . . . . . . . . . . . . .
8.2.1 Method of moments . . . . . . . . . . . . . . . .
8.2.2 Minimum Variance Linear Unbiased Estimation .
8.2.3 Maximum likelihood estimation . . . . . . . . . .
8.3 The mean and covariance matrix of the OLS estimator .
8.3.1 Unbiased Estimation . . . . . . . . . . . . . . . .
b . . . . . . . . . . . .
8.3.2 The Covariance matrix of
2
8.3.3 Estimating . . . . . . . . . . . . . . . . . . . .
8.4 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . .
8.5 Stochastic, but exogenous regressors . . . . . . . . . . .
8.5.1 Lack of bias . . . . . . . . . . . . . . . . . . . . .
b . . . . . . . . . . . .
8.5.2 The covariance matrix of
8.5.3 The estimator of 2 . . . . . . . . . . . . . . . .
8.5.4 Gauss-Markov Theorem . . . . . . . . . . . . . .
8.6 The normal linear regression model . . . . . . . . . . . .
8.6.1 Finite sample distribution of the OLS estimators
8.6.2 Maximum likelihood estimation . . . . . . . . . .
8.6.3 The information matrix . . . . . . . . . . . . . .
8.7 Data Issues . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 Multicollinearity . . . . . . . . . . . . . . . . . .
8.7.2 Influential data points . . . . . . . . . . . . . . .
8.7.3 Missing information . . . . . . . . . . . . . . . .
8.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1 The trace of a matrix . . . . . . . . . . . . . . .
8.8.2 Results on the multivariate normal distribution .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
107
108
110
112
114
116
116
117
117
119
121
121
123
124
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
130
130
130
130
131
131
131
131
132
133
133
134
134
134
134
135
136
137
137
138
139
141
142
142
143
CONTENTS
vii
9 Asymptotic properties of the OLS estimators

9.1 Introduction . . . . . . . . . . . . . . . . . . . .
9.2 The sampling process . . . . . . . . . . . . . .
b . . . . . . . . . . .
9.3 Asymptotic properties of
b . . . . . . . . . . . . .
9.3.1 Consistency of
b . . . . . . .
9.3.2 Asymptotic normality of

b . .

9.4 Asymptotic properties of e,
b2 and d
9.4.1
9.4.2
9.4.3
9.4.4
9.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
145
146
147
147
147
. . . . . . . . . . . . . . . . . . . 148
Consistency of e as an estimator of . . . . . . . . . . .
Consistency of
b2 as an estimator of 2 . . . . . . . . .
Asymptotic normality of
b2 . . . . . . . . . . . . . . ..
b
Consistency of
b2 (X0 X)1 as an estimator for var
. . . . . . . . . . 148
. . . . . . . . . . 149
. . . . . . . . . . 150
. . . . . . . . . . 150
b . . . . . . . . . . . . . . . . . . . 150
Appendix: Alternative proof of consistency of
10 Inference and prediction in the CLRM

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Wald type tests . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 A Wald test . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 F test . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.3 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Likelihood ratio like tests . . . . . . . . . . . . . . . . . . . . .
10.3.1 Asymptotic LR test . . . . . . . . . . . . . . . . . . . .
10.3.2 Precise results: F test . . . . . . . . . . . . . . . . . . .
10.3.3 Restricted least squares: Reparameterising the model .
10.4 LM type tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 Equivalence between tests . . . . . . . . . . . . . . . . . . . . .
10.6 Non-linear transformations of the estimators: the delta method
10.7 Nonlinear relationships . . . . . . . . . . . . . . . . . . . . . . .
10.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 Generalised Regression Models
11.1 Estimation with a known general noise
11.1.1 The model . . . . . . . . . . .
11.2 The impact of ignoring that 6= I . .
11.2.1 Point estimation of . . . . .
11.2.2 Point estimation of 2 . . . . .
b . .
11.2.3 Point estimation of var
covariance matrix
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
11.2.4 Hypothesis testing . . . . . . . . . . . . . .

11.3 Transforming the data: Generalised Least Squares
11.3.1 Derivation of the GLS estimator . . . . . .
11.3.2 Properties of the GLS estimator . . . . . .
11.3.3 Alternative derivation of the GLS estimator
11.3.4 Estimation of 2 . . . . . . . . . . . . . . .
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
151
152
152
152
153
153
153
154
156
157
158
159
160
161
162
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
169
169
169
170
171
172
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
172
172
172
172
173
175
175
viii
CONTENTS
12 Estimation with an unknown general noise covariance matrix 2

12.1 Feasible Generalised Least Squares . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.1 The problem of estimating . . . . . . . . . . . . . . . . . . . . . . . .
12.1.2 Approach to estimating . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.3 Properties of the FGLS estimator . . . . . . . . . . . . . . . . . . . . . .
12.2 OLS with robust estimation of the covariance matrix . . . . . . . . . . . . . . .
12.2.1 Heteroscedasticity consistent standard errors . . . . . . . . . . . . . . .
12.2.2 Heteroscedasticity and autocorrelation consistent (HAC) standard errors
12.3 Summing up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
177
177
178
179
180
180
180
181
181
13 Heteroscedasticity and Autocorrelation

13.1 Introduction . . . . . . . . . . . . . . . .
13.2 Tests for heteroscedasticity . . . . . . .
13.2.1 Breusch-Pagan-Godfrey test . . .
13.2.2 White test . . . . . . . . . . . .
13.2.3 Other tests . . . . . . . . . . . .
13.3 Tests for autocorrelation . . . . . . . . .
13.3.1 Breusch-Godfrey test . . . . . . .
13.3.2 Durbin-Watson d test . . . . . .
13.4 Pretest estimation . . . . . . . . . . . .
13.5 A warning note . . . . . . . . . . . . . .
13.6 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
183
183
183
183
184
184
185
185
185
186
186
186
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Estimation with endogenous regressors - IV and GMM
14 Instrumental Variables
14.1 Introduction . . . . . . . . . . . . . . . . . . . .
14.1.1 The model . . . . . . . . . . . . . . . .
14.1.2 Least squares bias and inconsistency . .
14.1.3 Examples . . . . . . . . . . . . . . . . .
14.1.4 The problem of nonexperimental data .
14.2 The instrumental variables solution . . . . . . .
14.2.1 Rationale . . . . . . . . . . . . . . . . .
14.2.2 Consistency . . . . . . . . . . . . . . . .
14.2.3 Asymptotic normality . . . . . . . . . .
14.3 The overidentified case . . . . . . . . . . . . . .
14.3.1 Two stage least squares . . . . . . . . .
14.3.2 Test of the overidentifying restrictions .
14.4 IV and Ordinary Least Squares . . . . . . . . .
14.4.1 OLS as a special case of IV estimation .
14.4.2 Hausman specification test . . . . . . .
14.4.3 Hausmans test by means of an artificial
14.5 Problems with IV estimation . . . . . . . . . .
14.5.1 Finite sample properties . . . . . . . . .
14.5.2 Weak instruments . . . . . . . . . . . .
14.6 Omitted variables . . . . . . . . . . . . . . . . .
14.7 Measurement error . . . . . . . . . . . . . . . .
14.7.1 Attenuation bias . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
regression
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
189
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
191
191
191
191
192
193
193
194
195
196
196
197
198
199
199
199
200
200
200
201
201
202
203
CONTENTS
ix
14.7.2 Errors in variables estimator . . . . . . . . . . . . . . . . . . . . . . . . . 204

14.7.3 Instrumental variables solution . . . . . . . . . . . . . . . . . . . . . . . . 204
14.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
15 Estimation by Generalised Method of Moments (GMM)
15.1 The moments of a Pareto distribution . . . . . . . . . . . . . .
15.2 Definition and properties of the GMM estimator . . . . . . . .
15.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . .
15.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
15.2.3 Asymptotic normality . . . . . . . . . . . . . . . . . . .
15.2.4 Estimating the covariance matrices . . . . . . . . . . . .
15.3 Optimal GMM and Estimated Optimal GMM . . . . . . . . . .
15.4 Lessons from the Pareto distribution . . . . . . . . . . . . . . .
15.5 GMM estimator of the linear model with exogenous regressors .
15.6 GMM estimator of the linear model with endogenous regressors
IV
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Systems of Equations
16 Estimation of equations by OLS and GLS

16.1 Introduction . . . . . . . . . . . . . . . . .
16.2 Stacking the equations . . . . . . . . . . .
16.3 Assumptions . . . . . . . . . . . . . . . .
16.4 Estimation by OLS . . . . . . . . . . . . .
16.5 Estimation by GLS . . . . . . . . . . . . .
16.5.1 Notation . . . . . . . . . . . . . . .
16.5.2 Some caution . . . . . . . . . . . .
16.6 Estimation by FGLS . . . . . . . . . . . .
16.7 Exercises . . . . . . . . . . . . . . . . . .
16.8 Appendix: A worked example . . . . . . .
16.9 Appendix: The Kronecker product . . . .
209
209
212
212
213
213
214
214
215
218
219
221
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
223
223
223
224
225
226
226
226
227
227
229
238
17 System estimation by Instrumental Variables and GMM
239
18 Simultaneous Equation Models
241
243
Solutions
Solutions to Chapter 14
245
CONTENTS
Part I
Probability and Statistics
Chapter 1
Probability and Distribution

Theory 1
1.1
Probability
The notion of probability is fundamental to everything that we will be doing in this course,
but a proper (axiomatic) treatment of it is beyond the scope of this course. At the core of the
theory of probability is the concept of the sample space , i.e. the set of all possible outcomes
of the random experiment. An event is said to occur if, and only if, the outcome of the
experiment is such that .
Example 1.1 Consider throwing a die. The sample space = {1 2 3 4 5 6} and a possible
event is throwing an even number, i.e. = {2 4 6}. Note that we have to be able to
define all possible outcomes of the experiment. We have excluded certain outcomes, e.g. the die
shattering or standing on edge or disappearing down a drain hole!
The fundamental defining axioms of probability theory are given by: (Mittelhammer, Judge
and Miller 2000, Appendix E1, pp.45)
For any event () 0
() = 1
Let { | } be a set of disjoint events contained
in
where is a set of positive integers

S
P
(i.e is finite or countably infinite), then
=
( )
1.1.1
Some probability theorems

() = 1
() = 0
() () and ( ) = () ()
() = ( ) +
( ) = () + () ( )
3
(1.1)
(1.2)
(1.3)
(1.4)
(1.5)
1.1.2
CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1
Conditional probability
In certain situations we know that event has definitely occurred. In this case we want to
recalibrate the probabilities of other events occurring.
Definition 1.2 If () 6= 0, then the conditional probability of event given event is given
by (|) = ( ) ()
Note that using this definition it is trivial to show that (|) = 1. In other words we are
eectively redefining the sample space to include only events that belong in .
It follows from the definition that
Pr ( ) = Pr (|) Pr ()
By applying this rule repeatedly we can extend this to any countable number of events:
Pr (1 2 ) = Pr (1 |2 ) Pr (2 |3 ) Pr (1 | ) Pr ( )
Theorem 1.3 Total probability
If the events are such that
S(| ) is defined for all and the events are mutually disjoint,
i.e. = for 6= , and
= then
() =
Theorem 1.4 Bayess Rule
If () 0
(|) =
(| ) ( )
(|) ()
()
Or more generally if the conditions enumerated in the previous theorem hold, then
(| ) ( )
( |) = P
for all
(| ) ( )
We say that two events are independent if knowledge that one event occurred does not
change the probability that we would assign to the other event occurring, i.e.
Pr (|) = Pr ()
It then follows immediately that Pr ( ) = Pr () Pr (). This is, in fact, how we will define
the statistical independence of events:
Definition 1.5 and are pairwise independent events if, and only if
Pr ( ) = Pr () Pr ()
1.2
Random variables and probability distributions
A random variable is a mapping from the sample space to the real numbers. In other words
it is the outcome of a random experiment with real number values. We can define how probable
certain of these values are. A useful way of summarising this information is by means of the
concept of a probability distribution.
1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
A random variable is discrete if the set of outcomes is either finite or countable. The random
variable is continuous if the set of outcomes is not countable.
In the case of a discrete random variable, we can enumerate the probabilities associated with
the outcomes. This gives the discrete probability distribution1 .
() = Pr ( = )
This will have the properties
0 () 1
X
() = 1
For the (absolutely) continuous case we can define a probability density function which
has the properties
() 0
Z
()
Pr ( ) =
Z
() = 1
R
Note that () = 0, so Pr( = ) = 0. Nevertheless there are situations where we
want to combine continuous and discrete distributions, i.e. there may be particular points in the
distribution (e.g. where = 0) where there is a spike in the distribution. Such a concentration
of probability at a single point is called a point mass. For such mixed distributions we need to
define separately the density function for continuous points and for
R the discrete points
P (for
more details see Mittelhammer et al. 2000, Chapter E1). In this case () + () =
1, where the sum is taken over all points at which the distribution has a point mass.
All types of distribution can be uniquely described by the cumulative distribution function (cdf ).
() = Pr ( )
This function must satisfy the following properties:
1. 0 () 1
2. If , then () ()
3. lim () = 1
4. lim () = 0
5. must be right continuous, i.e.
lim () = ()
1 In some statistical texts this is referred to as a a probability mass function to distinguish it from a probability
density function (pdf) which applies to continuous variables. We will refer to both of them as probability density
functions.
It is easy to see that the cdf of a discrete distribution must have jumps upwards at all the
values where it has a point mass. In fact we must have
() = () ()
where () = lim () is the left limit at .
In the case of continuous random variables (and mixed distributions at points where there is
no jump discontinuity) we will have
() =
1.2.1
()
Exercises
1. Consider the following function:
() =
1
2
1
4
1
4
if
if
if
= 2
=0
=1
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf.
(b) What is Pr ( 1)?
(c) Sketch the cdf of the distribution.
(d) Verify that (1) = (1) (1)

2. (Harder) Consider the following function:
() =
1
4
1
8
1
16
if
if
if
=
=
=
..
.
1 +1
2
if
=1
..
.
1
2
if
=1
..
.
..
.
1
2
3
4
7
8
1
2
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf.
(d) Verify that (1) = (1) (1).
3. Consider the function
() =
5 if
0
|| 01
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf. Then
sketch the pdf.
2 Does
this example suggest a problem with equation (B-7) in Greene (p.846)?
1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

() =
1 || if
0
|| 1
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf. Then
sketch the pdf.
(b) What is Pr 12 ?

() =
1 2 if
0
|| 1
elsewhere
where is a constant to be determined.

(a) Find an appropriate way to turn into a proper pdf. Then sketch the pdf.
(b) What is Pr 12 ?
0 if
if
() =
1 if
0
01
1
(a) Is this a valid cdf? If not, find the most appropriate way to turn it into a valid cdf.
(b) Is it the cdf of a discrete, continuous or mixed distribution?
(c) Describe the pdf of the distribution
() =
if
if
if
1
2
0
01
1
(c) Describe the pdf of the distribution.
() =
1
2
0
+
1
if
if
if
0
01
1
(c) Describe the pdf of the distribution.
1.3
Expectations of a random variable
Definition 1.6 Mean of a Random Variable

The mean or expected value of a random variable is
P
if is discrete
R ()
[] =
() if is continuous
Usually it is denoted as .
Let () be a function of . We can define a new random variable = (), defined such
that = (). We can calculate the expected value of . It is given by
P
if is discrete
R () ()
[ ()] =
() () if is continuous
It is easy to verify that the expectations operator is linear, i.e.

( + ) = + ()
Definition 1.7 Variance of a Random Variable
The variance of a random variable is
h
i
() = ( )2
P
( )2 ()
if is discrete
R
2
(
()
if
is continuous
It is usually denoted as 2 .
It is again easy to observe that

( + ) = 2 ()
The mean and variance of a distribution are specific examples of moments of the distribution:
Definition 1.8 Moments of a distribution
Let be a non-negative valued integer
1. -th moment about the origin 0 = [ ]
2. -th moment about the mean = [( ) ]

0
Note that according to this definition, = 1 and 2 = 2 . There is an interesting interrelationship between these type of moments. In the case of the variance it is easy to show
that
2 = 02 2
Some other useful statistics:
coecient of skewness =
3
3
22
coecient of kurtosis =
i
h
3
( )
3
i
h
4
( )
4
=
22
4
1.4. SPECIFIC UNIVARIATE DISCRETE DISTRIBUTIONS
Note that these are standardised measures, because we have normalised them relative to the
variance of the distribution. This means that.they are independent of the particular units within
which the outcomes of the random variable are measured. A symmetric distribution will have a
skewness of zero. A distribution with a longer right tail (skewed to the right) will have a positive
skewness. The kurtosis measures how peaked the distribution is. The normal distribution has
a kurtosis of 3. Any distribution which has a kurtosis higher than 3 is said to be leptokurtic
(thin peaked) while if it has a kurtosis less than 3 it is said to be platykurtic (flat peaked).
It turns out that in certain cases it may be possible to calculate the moments in a dierent
way by means of the moment generating function. We define this in the appendix to this
chapter.
1.4
1.4.1
Specific univariate discrete distributions

Bernoulli
This is the simplest random variable. It can take on only two values, zero and one. The full
description of the pdf is:
(1) = Pr ( = 1) =
(0) = Pr ( = 0) = (1 )
where 0 1. This can be put more elegantly as follows
1
(|) = (1 )
{0 1} , [0 1]
Applying the definitions it is straightforward to show that the mean of this random variable is
and its variance is (1 ).
1.4.2
Binomial
The binomial distribution is used to model the number of successes in experiments, where
each trial is independent of the next and the outcome of each trial is a Bernoulli random variable
with the same parameter . The pdf of this random variable is given by

(|) =
(1 )
{0 1 } , [0 1]
The mean is and its variance is (1 ).
Exercises
1. Assuming that the probability of passing a certain econometrics test is 8 for every member
of the class. There are six people in the class. What is
(a) The probability that at least one person passes?
(b) The probability that three or more people pass?
10
1.5
1.5.1
Specific univariate continuous probability distributions

Normal
The pdf of the normal distribution with mean and variance 2 is given by
| 2 =

()2
1
22
2
( ) , 2 (0 )
(1.6)
We write this as 2 . The pdf of a (0 1) random variable is frequently written as

() while the associated cdf is often written as (). The normal distribution is symmetric
(has a coecient of skewness of zero) and has a coecient of kurtosis of 3.
Exercises
1. Using a suitable package graph the pdfs of the following distributions: (1 1), (1 2),
(1 4) on the same set of axes.
2. Calculate the value of the cdf at = 28 for these three distributions.
1.5.2
Chi-squared
If (0 1), then 2 2 (1). The pdf of = 2 is given by

(|) =
(2)2 2
(1.7)
22
2
(0 ) , {1 2 3 }
The parameter is called the degrees ofR freedom of the distribution. The function () is
the gamma function defined as () = 0 1 . It is a generalisation of the factorial

function and satisfies
(1) = 1 and ( + 1) = ()
(the second equality can be shown by integrating by parts.) For positive integers , it simplifies
to
() = ( 1)!
The mean of the chi-square distribution is given by
() =
and its variance by
() = 2
Exercises
1. Graph the following pdfs: 2 (1), 2 (4), 2 (10)
2. Calculate the value of the cdf at = 10 in these three cases.
1.5. SPECIFIC UNIVARIATE CONTINUOUS PROBABILITY DISTRIBUTIONS
1.5.3
11
Students t
If (0 1) and 2 (), with and independent of each other, then =
The pdf of the -distribution is given by
( +1
2
2 )
(|) =
1
+
( 2 )
{1 2 3 }
+1
2
().
(1.8)
The parameter is referred to as the degrees of freedom.

The mean of this distribution is
() = 0, if 1
and it variance is
() =
, if 2
2
The -distribution has a symmetrical shape similar in appearance to the normal distribution,
but it has thicker tails. Indeed we see from the formula above, that for suciently low degrees
of freedom, the variance does not even exist indicating that there is too much mass in the tails.
As increases, the -distribution approaches the (0 1) distribution.
Exercises
1. Graph the following pdfs on the same set of axes: (1), (5), (25), (0 1).
2. Calculate the value of the cdf at = 3 in these cases.
1.5.4
If 2 (1 ) and 2 (2 ) and and are independent of each other, then =

(1 2 ). Its pdf is given by
1 2
1 + 2
1
2
)
( 1 +
1
1
2
2
2
(|1 2 ) =
1+
() 2
( 21 )( 22 ) 2
2
(0 ) , 1 2 {1 2 3 }
The mean of an F distribution is
() =
2
provided 2 2
2 2
The variance is
222 (2 + 1 2)
2
1 (2 2) (2 4)
, provided 2 4
In cases where 2 is large, 1 is approximately 2 (1 ).
1
2
(1.9)
12
Exercises
1. Graph the following distributions: (5 10), (5 30), (5 100)
2. Calculate the value of the cdf at = 25 for these three distributions.
3. Let the random variable = 1 , where (1 2 ). The pdf of will be given by
2
)
1 ( 1 +
2
(|1 2 ) =
1
2
1 ( 2 )( 2 )
1
2
1 1 2
1 + 2
2
2
2
1+
1
2
(0 ) , 1 2 {1 2 3 }
(1.10)
Graph the distributions of 5 (5 10), 5 (5 30), 5 (5 100) and 2 (5) on the same set of
axes.
1.5.5
Noncentral 2 , and distributions
The pdf of the 2 distribution given above, was for the central chi-square distribution. Most
hypothesis tests are based on this distribution. If we square a normal variable that has a nonzero mean, we get the noncentral chi-square. Correspondingly there are noncentral versions of
the and distributions. These become important particularly if we want to investigate the
distribution of test-statistics under the alternative hypothesis.
1.5.6
Gamma
The gamma distribution has pdf

(| ) =
1
()
(0 ) , 0, 0
where and are called the shape and scale parameter
respectively. It has mean and

variance 2 . The Chi-square distribution is 2 2 .
Exercises
1. Graph (1 2), (2 2) and (5 2) on the same set of axes. What do

you notice?
2. Now graph (1 4), (2 4) and (5 4) together and compare these
graphs to the previous ones.
1.5.7
Exponential
This is a particularly simple distribution which crops up frequently in applied work. Its pdf is
given by
(|) =
(0 ) , (0 )
It is another special case of the gamma distribution, being (1,).
1.5. SPECIFIC UNIVARIATE CONTINUOUS PROBABILITY DISTRIBUTIONS
13
Exercises
1. Graph the exponential distributions with = 02, = 1, and = 5.
2. Evaluate the cdfs at = 2.
1.5.8
Beta
Like the gamma distribution, this is a flexible distribution that can capture many particular
shapes. It is defined on a bounded interval only, so is particularly appropriate for contexts where
the range of the
random variable is bounded. Its pdf is:
1
( + ) 1 (1 )
if 0 1
(| ) =
() ()
0
otherwise
It has mean
+
Exercises
1. Graph the following distributions: (2 4), (4 2),
1.5.9
Logistic
1
2 2
, (1 2).
The logistic distribution ( ) is some times used in applications where one requires thicker
tails than the normal distribution has. Otherwise it is a bell-shaped, symmetric curve quite
similar to the normal. Its pdf is given by
(| ) =

1 +
<, (0 )
In this case it is also possible to give a closed form for the cdf:
(| ) =
Its mean is and its variance is
1
1 +
()2
3 .
Exercises
1. Find and such that ( ) has the same mean and variance as (0 1). Plot these
two distributions on top of each other. If you can, zoom into the tail, to verify that the
logistic distribution has fatter tails.
1.5.10
Cauchy
If (0 1) and (0 1) and these are independent of each other, then =

is
distributed according to the Cauchy distribution. Its pdf is given by
1
() =
(1.11)
(1 + 2 )
This distribution is interesting because it does not have any moments!
14
Exercises
1. Verify that equation 1.11 does define a legitimate pdf.
2. Verify that the Cauchy distribution does not have a mean.
3. Graph the Cauchy distribution
4. What, if any, is the connection between the Cauchy distribution and the distribution?
1.5.11
Uniform
To leave the simplest distribution to last: the uniform distribution ( ) states that every
outcome in the interval [ ] is equally probable. Its pdf is given by
1
[ ] ,
(| ) =
The mean of this distribution is
+
2
and its variance is
()2
12 .
Exercises
1. Assume that (0 1). Calculate Pr (02 07)
1.6
Transformations of a random variable
It frequently happens that we want to define a new random variable as some function of an
existing random variable, e.g. = , more generally:
= ()
Provided that this is a one-to-one transformation (i.e. is monotonically increasing or decreasing) so that there is an inverse transformation = 1 ( ) it is straightforward to calculate
probabilities in this new distribution.
Pr ( ) = Pr 1 () 1 ()
where we have assumed for the moment that is increasing. If we let the pdf of be and
the pdf of be , it follows that
Z
() =
1 ()
()
1 ()
If we know what is, we can simply use the fact that = 1 () and that = 10 () ,
and change the variable of integration on the right hand side i.e.
Z
() =
0
1 () 1 ()
0
Since this is true for any values of and it is easy to see that we must have = 1 () 1 ().
1.6. TRANSFORMATIONS OF A RANDOM VARIABLE
15
If is a decreasing function, then
Pr ( ) = Pr 1 () 1 ()
since 1 () will be less than 1 (). In this case

Z
() =
1 ()
()
1 ()
0
We can again rewrite () as 1 () 1 (). Note that this expression will now be
0
negative, since 1 () is negative. Changing variables on the right hand side again, we see that
Z
()
0
1 () 1 ()
0
1 () 1 ()
0
In this case we must have = 1 () 1 (). We can combine both cases by writing
Exercises
= 1 () 1 ()
1. Let have a distribution with pdf what is the pdf of the new random variable =
+ where and are both non-zero?
2. Let have exponential distribution with parameter . Find the distribution of = 2 .
1.6.1
The lognormal distribution
Assume
that
= , where 2 , then we say that is log-normal, i.e.
2 . Applying the formula above, we see that the pdf of this variable is
| 2 =

(ln )2 1
1
22
2
(ln )2
1
22
2
This distribution is very important in practical applications. It is worthwhile to derive its

mean in some detail:
Z
() =

Z
(ln )2
1
22
=
2
16
We will substitute = ln , i.e. = , so

() =
=
=
=
=
= . Consequently
()2
1
22
2
Z
()2 2 2
1
2 2

2
2
Z
(2 ) 22 4
1
22

2
Z
(2 )2
2
1
22
+ 2
2
Z
(2 )2
2
1
+ 2
2 2

The term that is being integrated is just the pdf of a variable that is + 2 2 , so the
integral must evaluate to one. The fundamental result is that
() = +
2
2
(1.12)
We note that we cannot simply take the mean of ln and antilog it. We need to add in the
2
correction 2 . The reason for this correction is that the lognormal is no longer a symmetric
distribution. The right tail of the distribution is much longer than the left tail and this shifts
the mean up.
Exercises
1. Graph the following lognormal distributions: (0 1), (0 2) and (0 4) on the
same set of axes. What do you observe?
2. What are the means for these distributions? The medians? And the modes? Comment on
what you observe.
3. Let 2 . Define the new random variable = 1 . Show that 2

by means of the change of variable technique. Can you show this in any other way?
4. Let 2 . Consider the new random variable defined by = . Derive its

distribution.
1.7
Appendix: Moment generating function
Definition 1.9 Moment generating function of the Random Variable X

The MGF of a random variable is
() =
provided that this expectation exists everywhere in a neighbourhood around = 0. If it exists,

then
()
0 =
|=0
1.7. APPENDIX: MOMENT GENERATING FUNCTION
17
Moment generating functions are extremely useful (when they exist) because they uniquely
identify a distribution. In a sense they act as fingerprints of that distribution (Mittelhammer
et al. 2000, Chapter E1, p.44). A useful feature in this regard is that if and are independent,
then the MGF of + is () (). If we can identify this MGF, we can deduce the
distribution of the random variable (see the exercises below).
Distribution MGF
Bernoulli
() = + (1 )
Binomial
() = (1 + )
1 2 2
2

() = exp + 2
2
()
() = (1 2) 2 , 12
Gamma( ) () = (1 ) , 1
( )
() = csc ()
( )
() = ()
Students t
MGF does not exist
F
MGF does not exist
1.7.1
Exercises
0
1. Use the MGF of a Bernoulli distribution to calculate 1 and 2 . What is 0 ?

2. Derive the MGF of the binomial distribution. (Hint: use the MGF of the Bernoulli distribution).
0
3. Use the MGF of the binomial distribution to calculate 2 .
4. Using the formula for the MGF of a 2 random variable show that 01 = and
2 = 2 .
5. Using the MGF show that if 1 21 and 2 22 , with and independent of each other, then + 1 + 2 21 + 22 . Find a simple example where and are not independent of each other and + is not distributed as
1 + 2 21 + 22
6. Using the MGF show that if 2 (1 ) and 2 (2 ), with and independent,
then + 2 (1 + 2 )
7. Derive the MGF for the exponential distribution from first principles. Check using the fact
that the exponential distribution is (1 ) that your answer is correct.
0
8. Use the MGF of the uniform distribution to calculate 2 .
18
Chapter 2
Probability and Distribution

Theory II
Sources: Greene (2003), Mittelhammer et al. (2000, Appendix E), Davidson and MacKinnon
(2004) and Sydsaeter, Strom and Berck (1999).
2.1
Joint distributions
In the previous chapter we considered only univariate distributions. In most situations of interest
to the economist, however, the outcome of the random experiment can most usefully be thought
of as a vector. For instance when we collect survey information we tend to collect information
from the same individual on more than one variable. If the outcome of the experiment can be
captured by the variables 1 2 then we can think of the outcome in terms of the
random vector X = (1 2 )
Many of the definitions can be extended very easily to this case.
Definition 2.1 Cumulative density function
The cdf (x) of the random vector X is defined as
(x) = Pr (X x)
Note that the vector inequality holds only if the inequality holds for every one of the components of the vector, so
Pr (X x) = Pr ((1 1 ) (2 2 ) ( ))
The properties of the cdf are as before. In particular the function must be non-decreasing
and must be continuous from the right (where this applies to each dimension). Furthermore
there will again be two types of probability distributions that can be defined in terms of the
cumulative density functions: discrete and continuous.
2.1.1
Discrete joint distributions
As in the univariate case, a joint distribution of discrete variables will show jumps in the cdf
at the points (which are now vectors) where there is positive probability. The size of the jump
19
20
CHAPTER 2. PROBABILITY AND DISTRIBUTION THEORY II
will again be equal to the probability attached to that precise outcome. It turns out, however,
that there will now be many more points at which there are jumps, but where there is zero
probability (see Figure 2.1 and exercise 2 below). This means that it is more dicult to recover
the corresponding probability distribution function.
We can define the joint probability distribution as
(1 2 ) = Pr (1 = 1 and 2 = 2 and and = )
As before we must have
0 (x) 1
X
(x) = 1
x
We can obviously retrieve the cdf from the joint distribution:

X
(x) =
(u)
ux
Example 2.2 Assume that we have enumerated a population of ten individuals and have ascertained that the following combinations (1 2 ) of measurements are possible:
(1 1)
(0 2)
(2 3)
(1 3)
(0 1)
(2 1)
(3 4)
(2 4)
(2 3)
(3 3)
The outcome of a random draw from this population defines the random vector X = (1 2 )
with the following joint distribution:
Joint probabilities
1 = 0
1 = 1
1 = 2
1 = 3
Marginal 2
2 = 1
01
01
01
0
03
2 = 2
01
0
0
0
01
2 = 3
0
01
02
01
04
2 = 4
0
0
01
01
02
Marginal 1
02
02
04
02
Based on these outcomes we can graph the cumulative distribution function as in Figure 2.1.
Exercises
1. For the example given above find (1 2), (3 0),

2 5
2. How do you explain the jump in Figure 2.1 at the point (1 2) even though Pr (X = (1 2)) =
0?
3. Assume that you are given the following definition of a function:
0 if
( 1) or ( 2)
03 if (1 3) and (2 )
( ) =
07 if ( 3) and (2 5)
1 if
( 3) and ( 5)
Generate a contour plot (bands in which the probability is the same) for this function.
Is this a valid cdf? If yes, derive the corresponding joint distribution. If no, find some way
of turning it into a proper cdf and then provide the joint distribution.
2.1. JOINT DISTRIBUTIONS
21
p=1
p=0.8
p=0.4
p=0.7
p=0.2
4
3
2
y
-2
-1
p=0.3
0.5 p=0.1
1
x 2
-1
-2
Figure 2.1: The cdf of a joint distribution is a non-decreasing function with jumps at all points
where there is a point mass but some additional jumps as well.
22
2.1.2
Continuous joint distributions
In the case of continuous distributions the relationship to the joint density is given by
(1 2 ) =
(1 2 )
1 2
Correspondingly we can define the cdf in terms of the joint pdf as

(1 2 ) =
Z1 Z2
(1 2 ) 1 2
A joint density function must have the following properties:

Z
0 (x)
(x) x = 1
x
where it is understood that the integral is taken over the entire domain of the random vector.
Example 2.3 Assume that the function is defined as
2 if 0 and 0 1
( ) =
0
elsewhere
A three-dimensional plot of this function looks as follows:
HL
2
1.5
f x,y 1
0.5
0
-0.5
5
1.5
1
0.5y
0
0.5
x
0
1
1.5 -0.5
In this case we have ( ) 0 and

Z Z
Z
( ) =
so that ( ) is a well-defined joint pdf.
[2]0
1
= 2 0
=1
2.2. MARGINAL DISTRIBUTIONS
2.2
23
Marginal distributions
Frequently we are interested in the behaviour of one of the components of the random vector
while ignoring the rest. We define the marginal pdf of the random variable
( P
( x )
if is discrete
R x
( ) =
( x ) x if is continuous
x
where the vector x is the vector of all the other random variables in the random vector X.
Example 2.4 The marginal distributions of the discrete distribution considered in Example 2.2
are given in the margins. They are:
02 if
1 = 0
02
if
1 =1
04 if
1 = 2
1 (1 ) =
02 if
1 = 3
0
elsewhere
03 if
2 = 1
01
if
2 =2
04 if
2 = 3
2 (2 ) =
02 if
2 = 4
0
elsewhere
It is easy to see that both of these are valid univariate discrete distributions.
Example 2.5 The marginal distributions of the continuous distribution considered in Example
2.3 can be worked out as follows:
Z
2
2 () =
0
= 2 where 0 1
Z 1
1 () =
2
= 2 2 where 0 1
Note that in the second case we rewrote the domain of the function as 0 1 and 1,
which is equivalent to the domain definition that we started out with. We needed to do this to
ensure that we had no variable left in the definition of the marginal distribution.
2.3
2.3.1
Conditional distributions
Discrete conditional distributions
In the previous chapter we defined the conditional probability Pr (|) as

Pr (|) =
(defined only if Pr () 6= 0).
Pr ( )
Pr ()
24
We can use this notion to define the conditional distribution of a random variable given that
one variable takes on a particular value:
(2 |1 ) =
(1 2 )
1 (1 )
(2.1)
This is again defined only if 1 (1 ) 6= 0.

Example 2.6 Consider the joint distribution given in Example 2.2. We can define the conditional distribution (2 |1 = 1). using the above definition. We have
(1|1 = 1) =
(2|1 = 1) =
(3|1 = 1) =
(4|1 = 1) =
(1 1)
1 (1)
(1 2)
1 (1)
(1 3)
1 (1)
(1 4)
1 (1)
01
02
0
=
02
01
=
02
0
=
02
=
= 05
=0
= 05
=0
In short the conditional pdf is given by (1) = 05, (3) = 05 and (2 ) = 0 everywhere else.
This function meets all the conditions of a proper pdf.
Exercises
1. Find the conditional distribution (1 |2 = 1), using the joint distribution in Example 2.2.
Verify that it is a proper distribution.
2. Find the conditional distribution (| = 2) using the joint distribution in Exercise 3.
2.3.2
Continuous conditional distributions
In the case of continuous random variables the probability that a random variable takes on a
particular value is always zero. Nevertheless we can still define a conditional pdf in exactly the
same way as we have done for the discrete case, i.e
(2 |1 ) =
(1 2 )
1 (1 )
In this case we also need to ensure that 1 (1 ) 6= 0.

Example 2.7 Consider the joint distribution given in Example 2.3. We have
(| = 05) =
(05 )
1 (05)
Now observe that (05 ) = 2 if 05 1 and (05 ) = 0 elsewhere. Furthermore

1 (05) = 2 2 (05) = 1. Putting this together:
2 if 05 1
(| = 05) =
0
elsewhere
Observe that (| = 05) = (05 1)
2.3. CONDITIONAL DISTRIBUTIONS
25
Exercises
1. Find the conditional distribution (| = 05) using the joint distribution given in Example
2.3.
2. Consider the following function:
8
( ) =
0
if 0 and 0 1
elsewhere
(a) Plot this function.

(b) Verify that this is a proper joint distribution.
(c) Derive the marginal distributions 1 and 2 .
(d) Hence derive the conditional distribution (| = 05)
2.3.3
Statistical independence of random variables
Observe that (as with conditional probability) we can rewrite the definition of a conditional
distribution (equation 2.1) as
(1 2 ) = (2 |1 ) 1 (1 )
Note that we can iterate this definition in just the same way as we did in the case of probabilities, i.e.
(1 2 ) = ( |1 2 1 ) (2 |1 ) 1 (1 )
Intuitively, the joint distribution of the random variables 2 is independent of 1 if
knowledge of 1 does not change our assessment of the probability of particular joint outcomes
of the variables 2 , i.e. if (2 |1 ) = (2 ).
Definition 2.8 We will say that the variables 1 2 are statistically independent if
the joint distribution (1 2 ) can be written as the product of the marginal distributions,
i.e.
(1 2 ) = 1 (1 ) 2 (2 ) ( )
Note that this implies that the probability of the joint event 1 1 and 2 2 ... and
will be just the product of the probabilities that 1 1 and the probability that
2 2 ... and that . In the case of continuous variables we can show this as follows:
Z Z
Z
Pr (1 2 ) =
(1 2 ) 1 2
1 2
1 (1 ) 1
2 (2 ) 2
( )
= Pr (1 ) Pr (2 ) Pr ( )
Exercises
1. Are the variables 1 and 2 in Example 2.2 statistically independent? Explain.
2. Are the variables and in Example 2.3 statistically independent? Explain.
26
2.4
Expectations of multivariate distributions
The definition of the expected value of a variable is a simple extension of the univariate case:
Z
(1 ) 1
( ) =
where the integral is taken over the entire domain of the joint distribution. In the expression
above we can integrate out all the variables except for so it is relatively easy to see that this
will be equivalent to evaluating the expectation on the marginal distribution, i.e.
Z
( ) =
( )
where is the entire domain of the marginal distribution.

The expectation of a function of a set of variables is given by
Z
(1 ) (1 ) 1
( (1 )) =
2.4.1
Conditional expectations
Much of econometrics is concerned with conditional expectations, i.e. we will be concerned

with the means of conditional distributions. These are defined in the usual way, i.e.
Z
[ |] =
(|)
Note that this will, in general, be a function of .

Some useful results involving conditional expectations
Theorem 2.9 Law of Iterated Expectations
[] = [ [|]]
Note that the left hand side is the unconditional expectation. [] indicates that the expectation is taken over the distribution of .
Theorem 2.10 Decomposition of Variance
[] = [ [|]] + [ [|]]
where again indicates that the variance is calculated over the distribution of .
Exercises
1. Verify the law of iterated expectations on Example 2.3, i.e. derive the conditional mean of
, i.e. [|]. This should be a function of . Then obtain the expected value of of this
function over the distribution of and compare this to the unconditional expectation of .
2. Verify that the unconditional variance of can be decomposed according to Theorem 2.10
using the joint distribution given in Example 2.3.
2.4. EXPECTATIONS OF MULTIVARIATE DISTRIBUTIONS
2.4.2
27
Covariance
One of the most commonly used expectation involving two variables is the covariance, defined as
( ) = [( ( )) ( ( ))]
This is often written as . Note that ( ) = ( ) = 2 .

A scalar measure which is also often used is the correlation coecient defined as
( )
=

=

1
where = ( ( )) 2 and is similarly defined.

We can show that in general | | , which implies that 1. This result follows
from the Cauchy-Schwarz inequality which states (in the discrete case) that
"
#2 "
#"
#
X
X
X
2
2
| |
This in turn implies for any random variables and

2
[ ( )] 2 2
If and are statistically independent, so that the joint distribution factors into the
product of the marginal distributions, it follows from the definition that the covariance and the
correlation coecient are of necessity zero. The converse result does not always hold. It does,
however, hold in the case of the multivariate normal distribution, as we will show below.
Exercises
1. Calculate the covariance and the correlation coecient for 1 and 2 in Example 2.2.
2. Calculate the covariance and the correlation coecient for and in Example 2.3.
2.4.3
Expectations of vectors and matrices
In the case of multivariate distributions we will be frequently interested in the expectations of a

vector of random variables, in particular the mean vector
= [x]
where x is the column vector [1 2 ]0 .
11 12 1
21 22 2
Definition 2.11 If A is the random matrix

where each is a
..

.

1 2
(11 ) (12 ) (1 )
(21 ) (22 ) (2 )
random variable, then we define (A) =
..
(1 ) (2 ) ( )
1 Hint:
let =
( ) and = ( ).
28
Several useful properties of the expectations operator follow. Since the expectations operator
is linear in random variables , i.e.
( + ) = () + ( )
(where and are real constants) it immediately follows that this will be true for random
matrices X and Y too, i.e.
(X+Y) = (X) + (Y)
Furthermore it follows that if is a fixed matrix, i.e. a matrix of constants, then
(X) = (X)
2.4.4
Covariance matrix
We can generalise the notion of covariance for the case of a whole vector of random variables.
Definition 2.12 We define
matrix Var (z) of the (column) vector of
the variance-covariance
random variables z as (z [z]) (z [z])0
1
1
2
2
Remark 2.13 Let z =

and let (z) = . Hence
1 1
2 2
z [z] =

and
1 1
2 2
1 1 2 2
(z [z]) (z [z])0 =
2
(1 1 )
(1 1 ) (2 2 ) (1 1 ) ( )
(2 ) (1 )
(2 2 )2
(2 2 ) ( )
2
1
=
..
( ) (1 1 ) ( ) (2 2 )
( )2
Taking expectations of this we get
(1 )
(1 2 )
(1 2 )
(2 )
0
(z [z]) (z [z]) =
..
(1 ) (2 )
(1 )
(2 )
( )
It is now obvious why this matrix should be called the variance-covariance matrix.
Note that a variance-covariance matrix by definition must be symmetric. We show below
that it must also be positive semi-definite. This simply means that if we take any non-zero
column vector of constants c and form the product c0 Vc where V is a covariance matrix then
c0 Vc 0. This is simply the extension of the condition that variances must be non-negative to
the context of more than one variable.
2.5. SOME COMMONLY USED MULTIVARIATE DISTRIBUTIONS
29
Remark 2.14 We will often refer to it simply as the covariance matrix and write Var (z)
simply as z or as V (z).
Observe that if the random vector z has the covariance matrix Var (z) then the random
vector Az (where A is a matrix of constants) has covariance matrix Var (Az) = Az A0 . This
follows simply by expanding out the definitions:
(Az) = A (z)
Az (Az) = A (z [z])
cov (Az) = (Az [Az]) (Az [Az])0
= {A (z [z])} {A (z [z])}0
0
= A (z [z]) (z [z]) A0
= Az A0
In the case where A is the row vector
a0 , the random vector a0 z is just a scalar variable.
0
0
In this case a (z [z]) (z [z]) a is just the variance of this new scalar variable. Since
variances are always non-negative it follows that a0 z a is non-negative regardless of the choice
of a. This shows that z must be positive definite. Note that if z is positive semi-definite,
then Az A0 is also positive semi-definite.
Exercises
1. Find the covariance matrix of 1 and 2 in Example 2.2 and hence find the covariance
matrix of the new variables 1 = 1 + 2 and 2 = 1 2 .
2. Find the covariance matrix of 1 and 2 in Example 2.3 and hence find the covariance
matrix of the new variables 1 = + and 2 = .
2.5
2.5.1
Some commonly used multivariate distributions

Uniform
The multivariate uniform distribution is defined on the rectangle where 1 [1 1 ], 2 [2 2 ],

. . . , [ ].
(1 ) =
1
, if (1 ) [1 1 ] [2 2 ] [ ]

=1
= 0 elsewhere
It is easy to see that this expression is just the product of separate uniform pdfs.
2.5.2
Bivariate normal
The bivariate normal distribution is defined by the pdf

(
)
2 + 2 2
1
p
( ) =
exp
2 (1 2 )
2 1 2

, =
=
30
Marginal distributions
Despite the fact that the distribution looks rather complicated, it is fairly easy to obtain the
marginal and conditional distributions. We can rewrite the term in braces by completing the
square, i.e.
So consequently
2
2 + 2 2
2 ( )
=
2 (1 2 )
2
2 (1 2 )
!
2 Z
2
( )
1
1
q
exp
( ) = p
exp
2
2 (1 2 )
22
2 2 (1 2 )
( )2
2(12 ) ,
We can rewrite the term
by substituting in for and , i.e.

2
( )
=
2 (1 2 )
( )
2 2 (1 2 )
The expression that is being integrated is the pdf of a normal variable with mean + ( )
and variance 2 1 2 , so the area under the curve must be equal to one. Consequently
Z
( )
2
1
p
exp
2
2
2
!
( )2
1
p
exp
2 2
2 2
so the marginal distribution of is in fact distributed as 2 . We could, of course,

complete
the other way round as well and so show that the marginal distribution of
the square
is 2 . The parameter is the correlation coecient, i.e. = .

Note that if = 0 this distribution reduces to
(
)
2 + 2
1
( ) =
exp
2
2

!
2 !
2

( )
1
1
q
exp
exp
= p
2 2
2 2
2 2
22
Consequently in this case a zero covariance or correlation implies that and are statistically
independent.
Conditional distributions
We have, in fact, already derived the conditional distributions. We showed above that
!
2
1
( )2
1
q
exp
exp
( ) = p
2
2 (1 2 )
22
2 2 (1 2 )
31
The first term is the marginal distribution of , i.e. it is 1 (). So by definition we must have
( )
(|) = q
exp
2 (1 2 )
2 2 (1 2 )
We noted that this was distributed as +
( ) 2 1 2 . We observe that the
conditional mean of changes with . The slope of this relationship is given by which we
could also write as ()
() . Note that changes in will aect both the slope of this relationship
as well as the conditional variance. Observe that if = 1 then
variance of would
the conditional

be zero! This would be the case if the probability that = + is equal to one.
We say that in this case the random vector ( ) has a degenerate
distribution.
The entire
mass of the distribution is concentrated along the line = + . In essence we

have one random variable and not two: we can solve out for in terms of (or vice versa).
We can graphically show the impact of changes in as in Figure 2.2.
Exercises
1. Generate contour plots corresponding to the three bivariate normal distributions shown in
Figure 2.2 as well as for the case where = 095.
2. Take the first of these distributions, i.e. the bivariate normal with = 0, = 0,
= 1, = 1 and = 08. Plot the cross-sections through the surface at the values
{2 1 0 1 2}.
3. Plot the conditional distributions at the same values and compare these to the graphs
generated above.
2.5.3
Multivariate normal
0
The random (column) vector x = 1 2
with mean and (nonsingular) covariance matrix is multivariate normal if its pdf is given by:
(x) = (2)
12
||
1
0
exp (x ) 1 (x )
2
We write this as x N ( ).
Special case: the bivariate normal
We can check that this definition gives the same formula for the bivariate case .
0
2

In this case we have = 2, =
and =
. It follows that || =

2
"
#
1
p
(1
2)
12
2 (12 )
2 2
2
1
2
= 1
. Furthermore =
1 . Consequently ||
.
1
(1
2)
2 (12 )
32
0
-2
-4
4
0.2
0.1
0
-4
-2
0
2
4
-2
-4
0.15
0.1
0.05
0
-4
-2
0
2
4
-2
-4
0.15
0.1
0.05
0
-4
-2
0
2
4
Figure 2.2: Changes in the bivariate normal distribution with . In all cases = = 0,
= = 1. Top panel: = 08. Middle panel: = 05. Bottom panel: = 0.

0
Consequently the term (x )

Let =
and =
(x )
(x ) =
=
(x ) is
as before. Then
"
1
2 (12 )
(1
2)
"
33
(1
2)
1
2 (12 )
(1
2)
(1
2)
1
2 (12 )
1
2 (12 )
2 + 2 2
1 2
It is now easy to verify that the two expressions are mathematically identical.
Special case: diagonal covariance matrix
1
If h= 21 22 i 2 , then it is easy to see that || 2 = 1 2 and 1 =
1
1
1
21 22 2 . In this case
1
(1 1 )
(2 2 )
( )
(x) =

exp
2
2
2
2
22
1 2 2
1
2
!
Y
1
( )2
exp
=
2 2
2
=1
=
( )
=1
So if every pairwise correlationcoecient

is zero then the variables are statistically independent,
with each distributed as 2 .

An important result
0
Theorem 2.15 Let the vector x = 1 2
have multivariate normal distribution with mean and covariance matrix . Assume that
z = Ax
where A is a real matrix with rank . Then
z N A AA0
We can use this result to show that every one of the variables in x must itself be normally
distributed.
For instance,
to show that 1 has a normal distribution, just let A be the row

vector 1 0 0 and it follows that 1 1 21 . It is equally easy to show that any
two of the variables in x will be bivariate normal and so on.
34
Chapter 3
Sampling and Estimation

3.1
Samples, statistics and point estimates
The basic approach that we will be using can be explained by means of the diagram given in
Figure 3.1. There are several components to this diagram:
1. We begin with the underlying social/economic processes which are happening in the real
world. We assume that these can be represented as random variables 1 . Implicitly we assume that the outcomes of these processes can be measured and well defined.
This may not be true of all processes.
2. The processes in the real world, together with the measurement process (e.g. a survey
questionnaire administered by a field team which is coded up in a back room) result in the
delivery of real data on our desk top. These data (even if it they are a macroeconomic
time series or a population census ) can be thought of as the outcome of a sampling
from that social reality. We will call this (after Mittelhammer et al. (2000)) the Data
Sampling Process. Many authors refer to it as the Data Generating Process. I prefer the
Mittelhammer et al. (2000) usage, because it emphasises the fact that data are intrinsically
incomplete. Crucially we will assume that the DSP can be fully characterised by some
joint probability distribution function over the outcomes y . In particular we will assume
that we can characterise the DSP as belonging to a given family of distributions although
we will not know the precise one. This means that we assume that the distribution of the
sample observations is given by the joint distribution function (y1 y |), where
is a parameter (or vector of parameters) which uniquely identifies the DSP. For instance,
y1 y might be multivariate normal, in which case is the vector of means and the
covariance matrix.
3. Once we have the data in front of us, we can manipulate them. In particular we can calculate
various statistics. These are simply functions of the observations y . Some examples:
The sample mean =
The sample variance 2 =
The sample covariance
( )2
P
1
= 1
( ) ( )
1
1
The sample maximum or the sample minimum.

35
36
CHAPTER 3. SAMPLING AND ESTIMATION
Data sampling proc ess
Estimation
Represented by:
f(Y| )
Y2
Y1
Yj
sa tio n
Glo b a li
In
m
co
BMI
Fina n
c ia
l p o lic
y1
y2
y
Labour market status
ns
a
uc
Ed
n
tio
Em o tio
"Real World"
yn
Sample
yi
y
s2
max
min
median
Statistic s
Figure 3.1: Estimation of population parameters

4. We are not, however, interested in generating functions of the data for its own sake. We
would like to generate statistics that provide reasonable estimates of the parameter . A
statistic that is used in such a way is referred to as an estimator of . To make this role
clear, an estimator will generally be written as b
or e
or equivalent. The theory of point
estimation of parameters is concerned with establishing procedures that use the data as
eectively as possible. It should be clear from Figure 3.1, however, that the estimates will
depend on the sample and therefore ultimately on the DSP. To the extent to which we
misrepresent the DSP, our estimator will not capture anything of interest either. Another
way of saying this is that the properties of the estimators (that we will be discussing below)
are all conditional on the DSP truly being represented by (y|).
A simplification that we will make upfront is that we will generally assume that the DSP is a
process of simple random sampling (SRS), i.e. we assume that each observation is independently
extracted from the same distribution. It is important to realise that in practice most data do
not arrive in this way:
Macroeconomic data generally comes in the form of time series data where consecutive
observations are related in more or less complicated ways
Microeconomic data generally comes in the form of firm or household surveys which is
collected in clusters. If the variables of interest are correlated within households it means
that these observations are not independent of each other. For instance if height has a
3.2. MAXIMUM LIKELIHOOD ESTIMATION
37
genetic component then two observations extracted from the same household will be more
alike than two observations extracted from the population at random.
Despite the fact that simple random sampling processes are the exception rather than the
rule, we will build up the theory on the basis of this assumption and then complicate it for other
sampling processes.
Note that just as the individual observations can be thought of as random variables, so
statistics are random variables. (Functions of random variables are themselves random variables.)
We can therefore talk about the distribution of a particular statistic. The distribution depends of
course on the DSP and on the sample size . This implies that in dierent samples an estimator
will lead to dierent estimates. Consequently we will be concerned with what sort of rules are
desirable or even optimal. Below we will consider two types of rules that have been frequently
used in practice:
estimation by maximum likelihood
estimation by method of moments
3.2
Maximum Likelihood Estimation
The principle of maximum likelihood is relatively easy to grasp in the context of discrete random variables. The idea is explained diagrammatically in Figure 3.2. In this diagram we are
considering an experiment in which a sample of ten observations is extracted from a Bernoulli
distribution with parameter . Assuming that we have simple random sampling, the joint pdf
of the sample will be given by (y|) = (1 ) since each random variable has pdf
1
(1 ) , with {0 1}. This pdf tells us how probable dierent kinds of samples will
be. In the left panel of Figure 3.2 we have shown several possible outcomes and the associated
probabilities if = 06.
We assume that our actual sample is given by y = (0 1 1 0 1 1 1 0 1 1), i.e. there were
seven successes and three failures. The question that we now want to solve is what would be a
reasonable estimate for ? Given the outcome, we know that the joint density in this case will
be given by 7 (1 )3 . We can now consider how likely the actual sample would have been if
the DSP had taken on some particular value, say . For instance, if = 05 we could deduce
that the probability that we would have observed this particular sample was 057 053 = 00009
765 6. On the other hand, if was really 03, the probability that we would have observed this
sample would only have been 037 073 = 000007501 4.
The maximum likelihood criterion stipulates that we use that estimate of which maximises the probability that we would have observed this particular sample, i.e. in this case
b = arg max 7 (1 )3
In the right panel of Figure 3.2 we see that = 07 gives a higher probability than any of the
other values that we have displayed. Nonetheless we need to consider all possible values.
3
2
6
7
If we let = 7 (1 )3 , we find that
= 7 (1 ) 3 (1 ) . If we set this equal
2
to zero we find that the optimum must satisfy b6 (1 b) (7 (1 b) 3b

) = 0, i.e. b = 07. We
display the procedure graphically in Figure 3.3. Interestingly enough b corresponds precisely to
the sample proportion of successes. We will see below that this is no accident.
In summary, the method of maximum likelihood estimation proceeds as follows:
1. Describe the pdf of each observation , i.e. (y |)
38
The Da ta Sa m pling Proc ess
Maxim um Likelihood Estim ation

Possib le
D.S.P.
Possib le
D.S.P.
Bernoulli
p= 0.5
L(p| y)=
7
3
0.5 0.5 = 0.000977
Bernoulli
p= 0.3
Possible sam ple 1

[1,1,1,1,1,1,1,1,1,1]
10
f(y| p)= 0.6 = 0.006047

Possible sam p le 2
[1,1,1,1,1,1,1,1,1,0]
L(p| y )= 0.3 70.7 3= 0.000075
f(y | p )= 0.6 0.4= 0.004031
True
D.S.P.
Bernoulli
p= 0.6
Possib le
D.S.P.
Possible sam ple 3
Ac tual sam ple

[0,1,1,0,1,1,1,0,1,1]
[1,1,1,1,1,1,1,1,0,1]
Bernoulli
p= 0.6
f(y | p )= 0.6 0.4= 0.004031
L(p| y )= 0.6 0.4

= 0.001792
Possible sam ple k
True DSP
[0,1,1,0,1,1,1,0,1,1]
7
Possible
D.S.P.
f(y | p)= 0.6 0.4 = 0.001792

Possible sam ple
1024
Ac tua l sa mp le
Bernoulli
p= 0.7
[0,0,0,0,0,0,0,0,0,0]
L(p| y )= 0.7 0.3 = 0.00222
10
f(y | p)= 0.4 = 0.0001049
Possible
D.S.P.
Bernoulli
p= 1
7
L(p| y )= 1 0 = 0
Figure 3.2: The joint density (y|) represents how likely a particular sample is, given the DSP
(left panel). In maximum likelihood estimation we ask how likely the given sample would be if
the DSP had been represented by some value (right panel).
0.002
0.0015
0.001
0.0005
0.2
0.4
0.6
0.8
Figure 3.3: The probability of observing the given sample changes with . It reaches its maximum
at b = 07.
3.2. MAXIMUM LIKELIHOOD ESTIMATION
39
2. Form the joint pdf of the sample (y1 y2 y |). Normally

Q we will assume independent
(and identical) distribution, so that (y1 y2 y |) = =1 (y |)
3. We rewrite this joint density as the likelihood function
(|y) = (y|)
Often we will also work with the loglikelihood
(|y) = ln (|y)
4. We then maximise (or equivalently maximise ln ). The maximum likelihood estib is such that
mator (MLE)
b = arg max (|y)
It is clear that this is an intuitively attractive way of estimating the population parameter in
the case of discrete distributions, where (y|) is really a probability. In the case of continuous
distributions, the probability of observing any particular sample will always be zero, since the
probability of obtaining particular values is always zero. Nevertheless the value (y|) still
captures how likely certain outcomes are relative to others. One might think that the probability
that y takes on a particular value is approximately (y|) y, so that higher values of (y|)
certainly represent more likely outcomes.
Note that there is no guarantee that the estimation procedure will give us the true value of
the parameter. What we can hope for, however, is that our estimates will be close to the truth,
in a sense which we will try to make more precise later.
Example 3.1 Estimating in a Bernoulli distribution
1
If (), then ( |) = (1 ) . The joint pdf is given by (y|) =
(1 )
. Consequently (|y) = (1 )
and
!
X
X
ln +
ln (1 )
(|y) =
so
P

1
(3.1)
Equating this to zero and solving gives
Which is just the sample proportion.
b =
Example 3.2 Estimating and 2 in a normal distribution
( )2
1
We assume that 2 . Consequently | 2 = 2
exp
. The
2
2 2
joint pdf is
P
!
2

2
2 2
( )
y| = 2
exp
22
40

( )2
This becomes the likelihood function, i.e. 2 |y = 22 2 exp 2 2
. Taking
logs we get
P
( )2
2
2
(3.2)
|y = ln (2) ln
2
2
2 2
Dierentiating this with respect to and 2 we get:
P
( )
(3.3)
= 2
P
2
( )
=
+
(3.4)
2
2 2
24
Setting the derivative equal to zero, we get the likelihood equation:
P
b)
(
=0
2
b
P
b )2
(
+
=0
2b
2
2b
4
(3.5a)
(3.5b)
We have replaced the true parameters and 2 with their estimates in these equations, since
there is no guarantee that the gradient will be precisely zero at the true parameter value. Instead
these two equations implicitly define the maximum likelihood estimators. We can explicitly solve
out for them. From the first equation we find that
P
b= =
(3.6)
Substituting this into the second equation and solving we get

P
2
(
b)
2
b =
(3.7)
In both these examples the likelihood function was well behaved and we could get the
optimum through dierentiating the function and setting the derivative equal to zero. This will
not always be the case (although it will be for most of the applications that we will consider).
One case where it does not hold is in estimating the parameters of a uniform distribution:
Example 3.3 Estimating the parameters

of1 a uniform distribution ( ).
if
. The joint pdf is therefore

If ( ), then ( | ) =
0
if or
given by
(
1
if , for all {1 2 }
(1 2 | ) =
0
if or , for any {1 2 }
The likelihood function ( |y) is therefore
(
1
if min {1 2 } and max {1 2 }
( |y) =
0
if min {1 2 } or max {1 2 }
This function has discontinuities at min {1 2 } and max {1 2 } respectively, so
it cannot be dierentiated there. It is obvious, however, that ( |y) can be maximised by
setting
b
= min {1 2 } and b = max {1 2 }
3.3. METHOD OF MOMENTS ESTIMATION
41
The minimum and maximum sample values are therefore the MLE estimators of the range of the
uniform distribution.
It is intuitively obvious that both of these must be biased estimators: we must have b
and b in every sample. It also seems obvious that the larger the sample the smaller this
bias is likely to be.
3.2.1
Exercises
1. Derive the ML estimator of from a sample of independent draws from the exponential
distribution.
2. Consider the pdf of the discrete random variable given by
(|) =

, {0 1 2 } , (0 )
!
Assume that you have an independent random sample of size from this distribution.
Show that the maximum likelihood estimator of is given by the sample mean.
3. Let 2 . Assume that you have independent draws from this distribution.
Estimate and 2 by means of maximum likelihood.
3.3
Method of Moments Estimation
The principle of method of moments estimation is extremely simple to understand. It states

that we should equate the sample moments to the population moments and solve out for the
underlying parameters1 .
Example 3.4 Estimating the parameters of a U ( ) distribution
()2
2
2
2
We know that = +
2 and =
12 . If we equate = and = , we get the two
equations in two unknowns:
+
2
( )2
2 =
12
=
Our solution to these equations

will define the methodof moments estimators
b
and b. We have.
b
= 3.
+ = 2 and = 12. Consequently = + 3 and b
Example 3.5 Estimating the parameters of a gamma ( ) distribution
1
The pdf of the gamma distribution is given by ()
1 . Consequently if we have a
sample (1 2 ) then the likelihood function will be given by
1
1
( |1 ) =
(1 2 )
exp
()
It is dicult to maximise this expression with respect to and , because of the gamma function
in the denominator. Certainly there is no convenient analytical expression for the solution. If
1 My presentation here is not entirely rigorous - I mix up centred and uncentred moments. Provided the
corresponding sample moments are consistent estimators, the results will hold.
42
we use the method of moments instead, we know that = and 2 = 2 . Equating the sample
moments to the population moments we get the two equations
=
2 = 2
The estimators are
2
2
2
b =
b
=
Example 3.6 Estimating

the parameters of a LN 2 distribution
If 2 , then the most obvious

of estimating the parameters and 2 would
way
be via the variable = ln , since 2 , so we know what the appropriate MLE would
be. If, for any reason, we cannot get access to the individual level data, but we have access to the
moments of the distribution of , we could still estimate the parameters and 2 by the method
of moments. We have
2
( ) = exp +
2
2
= exp 2 + 2 2
Equating these to the sample moments we get
2
= exp +
2
2
2
+ = exp 2 + 2 2
So
b2 = ln 2 + 2 ln 2
b = ln 2 ln 2 + 2
2
The basic idea should be very clear by now. One thing which may not be clear is what to
do if we have more than the required moments. For instance the normal distribution is fully
identified by just two moments. Empirically, however, there would be no problem in calculating
the third or even the fourth sample moments. At the population level this additional information
would be redundant - any two equations would give the same parameter values. In a random
sample this will, however, not be the case. Dierent subsets of the equations would give dierent
results. The simplest solution might be to simply throw away the extra information and just use
the first two moments to estimate the two parameters. On the other hand, this seems a waste
of good information. The question of how to deal with the extra information is tackled by the
Generalised Method of Moments (GMM).
3.3.1
Exercises
1. Define the triangular distribution () as

||
2
(|) =
0
where 0 is some positive constant.
if
||
elswhere
3.4. OTHER RULES
43
(a) Verify that is a proper pdf.

(b) Find the cdf of this distribution.
(c) Find the mean and variance of this distribution.
(d) What would be the appropriate method of moments estimator of ?
3.4
Other rules
It is important to understand that ML estimation and MoM estimation are not the only approaches to estimation.
3.4.1
Rules of thumb
There are a number of approaches which are based on intuition or heuristic rules. When business
economists predict inflation or growth they frequently seem to do so on the basis of instinct.
When chartists extrapolate trends they do so by analogy with previous patterns in the data.
Even in more scientific parts of economics, analysts frequently have strong intuitions about
what sort of results one should expect. For instance when Card and Krueger suggested that
employment did not decline with increases in the minimum wage there were many economists
that simply did not believe their results.
3.4.2
Bayesian estimation
In Bayesian estimation these prior beliefs are explicitly modelled. The analyst specifies how
likely dierent parameters might be in the form of a prior distribution. This distribution is then
updated in the light of the empirical evidence (by Bayess law) to give the posterior distribution.
This does not yield a point estimate, but a range of estimates. This range, however, builds in other
information, unlike traditional confidence intervals. We will not discuss Bayesian approaches in
this course.
3.4.3
Pretest estimators
Many empirical research projects follow a particular strategy:

estimate a simple model
test to see if various assumptions are tenable
if the specification tests fail, re-estimate the model
It is important to understand that this entire package can itself be understood as a rule
for arriving at estimates. The theory of pretest estimation is again beyond the scope of this
course. Nevertheless it is important to be aware that this procedure may in itself introduce
systematic biases in the kinds of results that are achieved. One particular version of this bias is
publication bias - if the analyst does not find the results significant at the conventional levels
(5%) this non-result may not be deemed publishable. A scan of the empirical literature is
therefore likely to pick up only the positive results.
44
3.4.4
Bias adjusted estimators
We will see below that in a number of contexts analysts make adjustments to a ML or MoM
estimator in order to remove a particular source of bias.
3.5
Sampling distribution
We have seen that there may be more than one way of estimating a set of parameters. This raises
the question as to how we might decide between dierent estimators. In order to assess this we
will generally be concerned in the first instance with the sampling distribution of the estimates,
i.e. how the estimator would behave if we had the luxury of repeating the experiment very many
times. We will also in due course consider the asymptotic properties of the estimator, i.e. how
the estimator would behave if we had the luxury of enlarging our sample indefinitely.
Example 3.7 Sampling distribution of b, the sample proportion from independent
draws from a Bernoulli() distribution
We showed above that b = P

is the MLE of . If the distribution on each draw really is
Bernoulli, then we know that = is distributed as binomial with parameters and , i.e.

(| ) =
(1 )
{0 1 } , [0 1]
We can use the change of variable technique to get the distribution of b. We know that b =
so = b
and the sampling distribution of this estimator (statistic) is given by:

(b
| ) =
(1 )
b
1 2
b 0 1 , [0 1]

(Note that we do not need to put in a term for

one.).
1
,
since this is a discrete pdf and not a continuous
We observe that b can only take on ( + 1) discrete values, so this is a discrete pdf. In Figure
3.4 we graph some examples of what the true sampling distribution would look like.
We observe that in each case the distribution of the estimator is centred on the true population
parameter. This need not be the case, in general.
Example
3.8 Distribution of the sample mean from independent draws from a

2 distribution
0
We can write = 1 0 y where 0 =
12 1 1 , y = 1 2 . Since each
is independently distributed as , their joint distribution is multivariate normal with
0
mean = and diagonal covariance matrix = 2 . By Theorem 2.15 of
Chapter 2 we have 1 0 1 0 2 1 , i.e.
1 2

Note that the variance of the sample mean is considerably smaller than the variance of the original
distribution.
3.5. SAMPLING DISTRIBUTION
45
probability
.1
.2
.3
probability
0 .05 .1 .15 .2 .25
Sampling distributions
.2
.4
.6
sample proportion
.8
.4
.6
sample proportion
.8
.4
.6
sample proportion
.8
p=0.8,n=10
probability
.02 .04 .06
probability
.02 .04 .06 .08 .1
.08
p=0.5,n=10
.2
.2
p=0.5,n=100
.4
.6
sample proportion
.8
.2
p=0.8,n=100
Figure 3.4: Sampling distribution of b calculated on a sample of independent draws from a

Bernoulli() distribution.
46

1.4
1.2
1
0.8
0.6
0.4
0.2
1
Figure 3.5: Distribution of

b2 from a (0 2) distribution, with = 10, = 25 and = 100.
Example
b2 from independent draws from
3.9 Distribution of the MLE estimator

2
a distribution
It is possible to show that
2
b 2 ( 1)
2
The sampling distribution of
b2 can therefore be derived by change of variable techniques from
2
the distribution. In Figure 3.5 we graph some examples. Note that the mode of the sample
2
estimates
is below the true value. Since the mean of a ( 1) variable is 1 it follows that
2
2

b = 1
, so the mean of the estimator also undershoots the true parameter value.
Example 3.10 Distribution of the sample minimum b

from independent draws from
a ( ) distribution
We have
Pr (min {1 2 } ) = Pr (1 ) Pr (2 ) Pr ( )
= (1 ()) (1 ()) (1 ())
= (1 ())
where is the cdf of the population from which the sample is drawn, i.e. the cdf of the minimum
is
1 (1 ())
So the pdf of the minimum is
() = (1 ())1 ()
where is the pdf of the original population. Note that this holds true in general!
In the case of the uniform distribution ( ) we have
1
() = 1
, if
( )
, if
( )
Figure 3.6 gives some examples of this sampling distribution from a (0 1) distribution. We
observe that as increases, the distribution becomes increasingly concentrated around zero.
3.6. FINITE SAMPLE PROPERTIES OF ESTIMATORS
47
25
20
15
10
5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Figure 3.6: Distribution of the sample minimum from a (0 1) distribution, with sample sizes
= 10, = 25 and = 100 respectively.
3.5.1
Exercises
1. Derive the sampling distribution of the sample maximum in independent draws from a
( ) distribution.
b
2. Derive the sampling distribution
P of the ML estimator of the exponential distribution. You
may want to first show that
( ) by using the MGF of the exponential
distribution. Then plot this distribution for = 1 = 10, = 25 and = 100.
3. Derive the sampling distribution of the ML estimator

b of the 2 distribution.
3.6
3.6.1
Finite sample properties of estimators

Bias
Definition 3.11 Bias

The bias of an estimator is

b
=
Definition 3.12 Unbiased estimator

An estimator of a parameter (vector) is unbiased if the mean of its sampling distribution
is , i.e.

b =

Example 3.13 The MLE estimators of the 2 distribution

1 2
. It follows immediately that (b
) =
We have shown in example 3.8 that
b =
. By contrast we have seen (in example 3.9) that

b2 =
1 2 .
1 2
,
so the MLE estimator of
the variance is biased, with bias

This is why most analysts prefer to use the bias adjusted
b2 . Note that the size of this bias become small as .

estimator 2 = 1
48
Example 3.14 The MLE estimator of in a ( ) distribution

We showed that the MLE is the sample minimum and that this has the distribution given in
Example 3.10. We can calculate the expected value of this estimator:
Z
( )1
(b
) =

( )
( ) +
=
( ) 1 +
+
=
1+
=+
1+
Consequently the bias is
1+ .
We observe that this bias vanishes as .
Exercise
1. Discuss the bias of the sample proportion b from independent draws from a Bernoulli
distribution.
2. Discuss the bias of the ML estimator of in a ( ) distribution.
3.6.2
Minimum variance
Definition 3.15 Ecient unbiased estimator

b1 is more ecient than another unbiased estimator
b2 if the sampling
An unbiased estimator
b1 is less than that of
b2 , i.e.
variance of

b2
b1

or, in the multivariate context:

b1 is nonnegative definite
b2

Cramr-Rao Lower Bound
Theorem 3.16 Cramr-Rao Lower Bound

Assuming that the density of satisfies certain regularity conditions, the variance of an
unbiased estimator of a parameter will always be at least as large as
2
1 "
2 #!1
ln ()
ln ()
1
[ ()] =
=
2
The multivariate analogue of this is that the dierence between the covariance matrix of any
unbiased estimator and the inverse of the information matrix
2
1
ln ()
1
[I ()] =
0

1
ln ()
ln ()
=
0
will be a nonnegative definite matrix.(Greene 2003, p.889-90)
3.6. FINITE SAMPLE PROPERTIES OF ESTIMATORS
49
This theorem is important because it states that there is a point beyond which unbiased
estimators cannot get more precise. There will always be some intrinsic sampling variance in
any such estimator. On the positive side, if we can establish that any estimator has the variance
described in this theorem, then we can be sure that it must be an ecient estimator.
Example 3.17 Obtaining the Information Matrix I () of the ML estimators of the parameters
of the 2
The information matrix is the Hessian matrix (matrix of second derivatives) of the loglikelihood with respect to the parameters. In Example 3.2 we derived the gradient of the likelihood
function (equations 3.3 and 3.4). Dierentiating these again we get:
2
2
2
2
( 2 )
Now I () =
2 ln ()
0
( )
4
P
2
( )
2 4
6
2
P
, where in this case =
I () =
"
2
( )
4
. We get
( )
4
24 +
2
( )
6
Observing that the only random variables in any of these

hP are the i terms, and each
P
2
2 , it is evident that ( ( )) = 0 and
= 2 . Consequently
( )
I () =
2 4
Consequently the Cramr-Rao lower bound is given by

" 2
1
I ()
2 4
#
2
Now we have shown that

b is unbiased and that (b
) = (in Example 3.8). Since this
estimator reaches the CRLB, it follows by Theorem 3.16 that this estimator is ecient - i.e.
there can be no unbiased estimator with a smaller variance.
Theorem 3.16 does not apply to the MLE
b2 since it is not an unbiased estimator! We can
2
2
calculate the variance of the bias-adjusted estimator 2 since 1
1). We know that
2 (
2 2
2
2
the variance of this random variable is 2 ( 1), so the variance of = 1
2 ( 1) =
2 4
1 .
It follows that 2 does not reach the Cramr-Rao lower bound. Nevertheless it can be shown
that there is no unbiased estimator that has a lower variance than 2 .
Exercises
1. Find the information for the MLE b derived in Example 3.1. Hence discuss the eciency
of this ML estimator.
50

2. Find the information for the ML estimator b
of the exponential distribution. Hence discuss
the eciency of this ML estimator.
3. Use the fact that 2

b2 2 ( 1) to calculate the variance of the ML estimator
b2 . Use
this to reflect on the applicability of Theorem 3.16.
3.6.3
Mean Square Error
Definition 3.18 Mean-Squared Error

The mean-squared error of an estimator is

2
b
b
=
, if is a scalar
Exercises

2
= b
+ b

0
b +
b
b if is a vector
=
1. Compare the MSE

b2 and the sample variance 2 in estimating the
of2 the ML estimator
variance in a distribution.
3.6.4
Invariance
Another highly desirable property of an estimator is that it yields the same estimate regardless
of how the problem is parameterised. For instance, in Example 3.2 we derived the ML estimators
on the assumptions that we were estimating and 2 . It would have been somewhat disturbing
if the estimator had changed if we had wanted to estimate and instead. One of the properties
of ML estimation is that it is invariant in this sort of way. If we initially set the problem up in
b then if we re-map
terms of the parameter vector and get the maximum likelihood estimator ,
the problem in terms of the parameters
= c ()
where the function c () is a mapping from the old parameters to the new ones, then

b
b= c
i.e. applying the mapping to the ML estimators gives the ML estimators of the new parameters.
In short to obtain the ML estimator of we can just take the square root of the ML estimator
of 2 .
This result is of some practical importance, because there may be ways of parameterising a
problem so that it is easier to obtain estimates. Then one can apply the appropriate transformations to get estimates for the particular problem one started o with.
One setting in which regression packages (like Stata) do this routinely, is if the parameters
need to obey certain constraints. For instance a variance has to be positive. One could set
the maximisation problem up as a constrained maximisation problem with a non-negativity
constraint. In practice it is frequently easier to reparameterise the problem. For instance, one
can set the parameter that has to be positive, say to be equal to a positive function of another
parameter, say exp () and then optimise with respect to . When the estimates have been
obtained one transforms them back into the form that one was interested in.
3.7. MONTE CARLO SIMULATIONS
3.7
51
Monte Carlo simulations
In many situations it is very dicult to derive the precise sampling distribution of a particular
estimator. One of the tools that has become available to the econometrician is the ability to
simulate the sampling distribution. In Monte Carlo studies one specifies the distribution and
then simulates the process of extracting samples and calculating statistics on them. With modern
computing power it is possible to do this thousands of time. The distributions obtained in this
way should approximate the real sampling distributions very closely. The theory underpinning
this statement will be explored in the next chapter.
There are several issues that have to be confronted in any Monte Carlo simulation:
How can one ensure that the results reported are reproducible, given that they are supposed
to be the outcome of a set of random experiments?
The fact of the matter is that the random numbers generated by the computer are, in fact,
not really random. They come out of a deterministic process, even though they behave
precisely like random numbers. Provided that one specifies where one starts the process
o (the random number see) and which package one is using, the results are completely
deterministic.
How many samples are sucient?
That depends a bit on the process being simulated, but around 10 000 replications should
generally be good enough to persuade most sceptics.
How large a sample size should I pick?
One hopes that the qualitative results are not too dependent on the precise sample size.
For many empirical problems it is sensible to pick sample sizes similar to those encountered
in actual research. As we will see in the next chapter large samples tend to be much
better behaved than small ones, so it is probably a good idea to pick intermediate sample
sizes.
What part of the parameter space should be explored?
Again one hopes that the results are not dependent on the precise parameters. That said,
it is probably sensible to pick several combinations of parameters to explore the sensitivity
of the results to these.
There are also practical issues. The most immediate one is how to simulate a draw from an
arbitrary distribution! Most random number generators spit out numbers between zero and one
that are uniformly distributed, i.e. any fraction (up to about eight decimal places) is equally
likely. The trick then is to convert these random numbers into a draw from the appropriate
distribution:
In the case of the Bernoulli distribution with parameter we set our random variable equal
to one whenever our random number is less than (or equal) to and we set it equal to zero
if it is greater than .
Other discrete distributions can be handled analogously.
In the case of absolutely continuous distributions we make use of the fact that the cumulative distribution function () is a monotonically increasing function that gives values
between zero and one. We can therefore use the inverse function 1 to convert random
numbers between zero and one into values of the random variable . It is useful to see
why this works:
52

Take the percentiles of the uniform distribution, i.e. {001 002 003 099 1}
Now take the inverse function 1 and assume that these points map to 1 2 100
We know that Pr ( ) = ( ) 0. However ( ) =

percentile of the distribution of .
100 ,
i.e. is the
The random number generator spits out numbers so that the probability of getting a
number between 01 and 07 is 60%. Correspondingly, the probability of drawing a value
of the random variable between 10 and 70 is also exactly 60%. The draws therefore
happen in each part of the distribution precisely according to the cumulative distribution
.
Example 3.19 The sampling distributions of dierent estimators of the parameter
from a ( ) distribution
In Example 3.10 we derived the theoretical distribution of the MLE of . It is much more
intractable to derive the distribution of the MoM estimator. As Example 3.14 showed, however,
we might be interested also in adjusting the MLE for bias. In Figure 3.7 we show the sampling
distributions for three estimators:
The MLE estimator: b
= min {1 }
The MoM estimator: b

= 3
The bias-adjusted MLE estimator: b

= b

maximum.

,
1
where b is the sample
The sample statistics from these 10 000 replications are as follows:

Variable
Obs
Mean
Std. Dev Min
Max
MLE
10000 2.059143 .057856
2.000005 2.633999
MoM
10000 2.004044 .1573365 1.478669 2.753263
Bias adjusted MLE 10000 2.000326 .0590134 1.938989 2.587626
We observe that the simulated sampling distribution of the MLE matches the theoretically
determined one quite well. The graphs and the summary statistics confirm that the MLE is
3
biased. The theoretical bias is +1
which in this case is 51
= 005882 which accords well with
the bias derived from the simulations of 0.059143. We can estimate the Mean Square Error:
MSE of MLE: (0059143)2 + (0057856)2 = 0006 845 2
2
MoM: (0004044) + (01573365) = 002477 1

2
Bias adjusted: (0000326) + (00590134) = 000 348 3

In this case the ranking seems clear. The lowest MSE is that of the bias adjusted MLE
estimator, while the worst estimator is the MoM one. The MLE, despite its noticeable bias tends
to achieve estimates much closer to the true parameter value than the MoM estimator.
3.7. MONTE CARLO SIMULATIONS
53
10
15
Estimators of lower bound

of U(a,b) distribution
1.5
2.5
MoM
MLE
Adjusted MLE
MoME
n=50 Replications=10 000
Figure 3.7: Sampling distributions of three estimators of in the ( ) distribution. Monte

Carlo simulation with 10 000 replications of samples of size 50. Parameter settings: = 2, = 5.
54
Chapter 4
Asymptotic Theory
(Compare with Wooldridge (2002, Chapter 3, Sections 3.13.4).)
4.1
Introduction
The purpose of asymptotic theory is to investigate the properties of random variables as the
sample size tends to infinity. It turns out that there are a number of very powerful results which
describe these properties. They fall into broadly two classes: laws of large numbers and
central limit theorems.
4.2
4.2.1
Sequences, limits and convergence

The limit of a mathematical sequence
The crucial concept that we will be concerned with is that of the limit of a sequence of numbers
or random variables. It describes what happens to that sequence as the number of terms in it get
large. Before we define the limit, however, we need to define the sequence itself. Fundamentally
a sequence is defined by a rule. Examples of sequences are:
{}=1

1
=1
n 1
o
1
=1
The definition of the limit of a real-valued mathematical sequence is:

Definition 4.1 The sequence { } of real numbers has real limit if for any given positive
number it is possible to find an integer such that if , | | . In mathematical
notation
lim = if 0, s.t. | |
We say that { } converges to .

55
56
CHAPTER 4. ASYMPTOTIC THEORY

Example 4.2 The sequence 1 converges to zero. We can prove this quite easily. Assume that
0 is given. We now pick 1 . This will always be possible, because the natural
numbers
are not bounded. For all we have that 1 . Consequently 1 , i.e. 1 0 .
It turns out that limits have attractive properties.
Proposition 4.3 Properties of limits

If lim = and <, then lim =
If lim = and lim = , then lim ( + ) = +
If lim = and lim = , then lim =
If lim = and lim = , then lim

= , provided that 6= 0
n o
In the last case we also have to be careful that the sequence is defined, i.e. that is not
equal to zero.
In some cases these rules do not help us to evaluate the limit. A very useful result in such
cases is given by
Proposition 4.4 LHpitals rule
If the functions and are both dierentiable in an interval around , except possibly at ,
and () and () both tend to zero as tends to , then if 0 () 6= 0 for all in this interval
()
0 ()
= lim 0
=
()
()
lim
The same rule can be applied if () and () both tend to , i.e. the formula can be
used to evaluate limits of the form 00 ,

, . If necessary, the rule can be reapplied.
1
Example 4.5 The sequence 1 converges to ln (). To show this, we note that
1
1
1
lim 1 = lim
1
This is of the form
0
0.
Applying LHpitals rule we get

1
lim
1
1
1
ln 12
1
= lim
2
1
= lim ln
= ln
If we have a sequence of vectors we can extend the definition given in Definition 4.1:
Definition 4.6 The sequence {a } of real vectors has limit a if for any given positive number
it is possible to find an integer such that if , ka ak . We say that {a } converges
to a.
Note that the norm kk is just the normal definition of the length of a vector. It has the
property that kxk 0 for every non-zero vector x.
4.2. SEQUENCES, LIMITS AND CONVERGENCE
4.2.2
57
The probability limit of a sequence of random variables
We want to investigate the behaviour of a sequence of random variables. The limit of such a
sequence has to be defined somewhat dierently, because the terms are no longer numbers,
but the outcomes of a random variable. There are dierent ways of defining convergence for
random variables. The simplest of these is convergence in probability.
Definition 4.7 The sequence {a } of real- or vector-valued random variables tends in probability
to the limiting random variable a if for all 0
lim Pr (ka ak ) = 0
We write this as
(4.1)
lim a = a or a a
Note that the limit in front of the probability (in equation 4.1) is the ordinary mathematical
limit. We could rewrite this condition equivalently to say that
lim a = a if 0, and 0, s.t. Pr (ka ak )
(4.2)
Note also that this definition makes sense if a is just a constant (which is a degenerate random
variable).
Intuitively the sequence of random variables a converges to a, if in a large enough sample
it is highly improbable to find a far away from a.
Later we will want to consider the probability limit of sequence {A } of random matrices.
We can adapt this definition, provided that we find some way of defining the norm of a matrix.
This is, in fact, possible1 . Note that convergence of the random vector a or the random matrix
A will always imply convergence of the elements of the vector or matrix.
Example 4.8 Consider the case of tossing a fair coin. Let be the (Bernoulli) random variable
equal to one if the outcome is heads and zero if it is tails. This means that { } is a sequence of
random variables. We can define a new sequence as follows:
=
1X
(4.3)
This is just the proportion of heads in a sample of tosses. We have

( ) =
1
2
1
It is straightforward to show that ( ) = (1)
= 4
. This indicates that lim ( ) =
0, which suggests that the sequence will converge to a nonstochastic constant.

By Chebyshevs inequality (see Theorem 4.37) we have
1
1
Pr
2
42
1 One
such definition is given by Wooldridge (2002, p.37)

12
kAk = A0 A
where (C) is the trace of the square matrix C. We will work more with the trace of a matrix in due course.
58
1
So the condition above will be satisfied provided that we pick large enough that 4
2 , i.e.
1
we pick larger than 42 .
For instance: if = 0001 and = 001, we could set at 25000001 and we would be guaranteed that
for every
sample
where
we
would
have
Pr 12 0001 001, i.e. the probability that the true proportion of heads deviates from
the true value of 0.5 by more than one in a thousand is less than 1%.
This example exemplifies a more general case. In fact any sequence { } which has a finite
mean and finite variance 2 where the variance tends to zero, will converge in probability to
:
Theorem 4.9 Convergence in mean square
If { } is a sequence of random variables, such that the ordinary limits of and 2 are
and 0 repectively, then converges in probability to , i.e.
lim =
In this case we say that converges in mean square to

It is clear that mean square convergence is stronger than converge in probability. This is an
example of a law of large numbers. In this case it is a weak law of large numbers. There
are strong laws based on a stricter form of convergence than convergence in probability. This
form of convergence is termed almost sure convergence:
Definition 4.10 The sequence {a } of real- or vector-valued random variables a is said to
converge almost surely (a.s.) to a limiting random variable a if
Pr lim a = a = 1
We write
a a
Note that the limit and the probability are interchanged from the previous definition. We
will not attempt to explain the subtleties of the dierence between the two forms of convergence,
except to note that almost sure convergence implies convergence in probability, but not the other
way round.
4.2.3
Rules for probability limits
Theorem 4.11 Slutsky Theorem

For a continuous function ( ) that is not a function of ,
lim ( ) = ( lim )
Theorem 4.12 Rules for probability limits
If { } and { } are sequences of random variables with lim = and lim = , then
lim ( + ) = +
lim ( ) =
lim
= , if 6= 0
59
If {A } is a sequence of matrices whose elements are random variables and if lim A = ,

then
1
lim A1
=
provided that is nonsingular. If {A } and {B } are sequences of random matrices with
lim A = A and lim B = B, then
lim A B = AB
4.2.4
Convergence in distribution
Another form of convergence is given by convergence in distribution:

Definition 4.13 The sequence {a } of real- or vector-valued random variables a is said to
converge in distribution to a limiting random variable a if
lim Pr (a b) = Pr (a b)
for all real numbers or vectors b such that the limiting distribution function a (x) is continuous
in x at b. One writes
a a
An equivalent way of writing the condition is as
lim a (b) = a (b) at all points x where a (x) is continuous
2
Example 4.14 Consider the sequence of random variables { } where 0 . It is
intuitively obvious that as this variable collapses to zero. We can show that, indeed, it
converges in distribution. We have
Pr ( ) = Pr
= Pr
where is a (0 1) random variable. For a given value
1
lim Pr
=
of , (and fixed ), we have that

if 0
if = 0
if 0
The distribution of the degenerate random variable 0 which is defined as Pr (0 = 0) = 1, is

given by
0 if 0
() =
1 if 0
So we note that lim () = () except at = 0, which is the point at which () is
discontinuous.
We observe that in this case a sequence of continuous distributions converges to a discrete
distribution, and in particular a degenerate distribution.
60
Theorem 4.15 Rules for limiting distributions
If and lim = , then
+ +

, if 6= 0
If and ( ) is a continuous function, then
( ) ()
If has a limiting distribution and lim ( ) = 0, then has the same limiting
distribution as .
Theorem 4.16 Convergence in distribution via MGF convergence
(Mittelhammer et al. 2000, Appendix E1 p.65) Let the random variable in the sequence
{ } have MGF () and have MGF ().
If lim () = () then
4.2.5
Rates of convergence
(This discussion is based on Davidson and MacKinnon 1993, pp.108113) One very useful device
in assessing the asymptotic behaviour of a sequence of random variables is given by the O o
notation (big-O little-o). Here and stand for order so they are also referred to as order
symbols. When we say a quantity is () we mean roughly that it is of the same order as ,
while if we say it is (), we mean that it is of lower order.
Definition 4.17 If () and () are two real-valued functions of the positive integer variable
, then the notation
() = ( ())
means that
lim
()
=0
()
We might say that () is of smaller order than () as tends to infinity. Note that ()
does not itself need to have a limit - it is only the comparison which matters. Most often we
consider functions () that are powers of , e.g. 2 , 1 , or 0 . In the later case we would say
that () is (1), since 0 = 1. If a sequence is (1) we know that lim () = 0.
, then the notation
() = ( ())
means that there exists a constant 0, independent of , and a positive integer such that
()
()
for all .
61
Normally this notation is used to express the same-order relation, i.e. to tell us the greatest
rate at which () changes with . Note, however, that in terms of the definition the ratio could
be zero, so the expression of the same order can be misleading.
, then they are asymptotically equal if
lim
()
=1
()
We write this as () = ().

The relations defined thus far are for nonstochastic sequences of real numbers. We can define
stochastic order relations analogously:
Definition 4.20 If { } is a sequence of random variables and () is a real-valued function
of the positive integer argument , then the notation = ( ()) means that
lim
=0
()
Similarly, the notation = ( ()) means that there is a constant such that for all 0,
there is a positive integer such that

for all
Pr
()
If { } is another sequence of random variables, the notation = means that

lim
=1
Since there should not be any confusion between the mathematical and the stochastic order
symbols, we will drop the p subscript.
Proposition 4.21 Rules for operations with order symbols
Rules for addition and subtraction:
( ) ( ) = max()
( ) ( ) = max()
( ) ( ) = ( ) if
( ) ( ) = ( ) if
Rules for multiplication (and hence division)
( ) ( ) = +
( ) ( ) = +
( ) ( ) = +
In many cases we will be considering sums of terms. If these terms are all (1), then the
sum is () unless the terms all have zero meansanda central limit theorem can be applied
1
(see below). In that case the order of the sum is 2 .
62
Example 4.22 The variable defined in equation 4.3 is such that lim = 12 . Consequently
= (1). Consider now
1
=
2
We have lim = 0, hence = (1). If we define the new sequence
=
1
We have ( ) = 0 and ( ) = ( ) = 14 . Consequently 2 is (1)2 which implies

1
is 2 .
This example is interesting because it shows that the centered random

variable converges
1
more rapidly than the uncentered variable . The former is 2 while the latter is
(1). We observe also that the variable must converge to a nondegenerate distribution, since
( ) = 14 , which does not converge to zero as . In fact we can show that it converges
to a normal random variable with mean zero and variance 14 , i.e. where 0 14 .
4.3
Sampling, consistency and laws of large numbers
We are interested in applying the concepts developed above to samples generated by some DSP.
One of the tricky points to consider in this context is how we understand the concept of enlarging the sample. In the case of simple random sampling from a given distribution this is very
straightforward we simply run the DSP on and on and on.
In practice there may be some tricky issues here. For instance if we are sampling from a
finite population (like people living in South Africa) then there comes a point where we cannot
enlarge the sample any more. In the case of cross-country analyses these constraints bind much
earlier. Once you have every country in your data set, you are done! Every sample of size
(where is the total population size) will be identical.
In order to get around this, statisticians think of the finite population itself as a realisation
of a DSP which could theoretically have led to dierent people, GDP outcomes and so on. This
superpopulation approach allows to think about getting more draws from the social process
even if in practice we could never do so.
4.3.1
Consistency
b of
The first asymptotic property that we want to define is that of consistency. An estimator
a vector of parameters is said to be consistent if it converges to its true value as the sample
size tends to infinity. This statement is not all that precise. In particular we havent defined how
an estimator can be said to converge.
b be the estimator that results from a sample of size . Then we define the estimator
Let
b as the sequence
n o
b=
b
b can be
where we start the sequence at the minimum sample size at which the statistic
computed. In the case of the linear regression model with parameters, this would require
.
2 Here
we use Chebyshevs inequality again to show that Pr
1
4 2
4.3. SAMPLING, CONSISTENCY AND LAWS OF LARGE NUMBERS
63
b is a random variable. If it is to converge to a true value, we need to specify

Any element of
what kind of convergence is involved. Most of the time we will only consider convergence in
probability, i.e. we obtain weak consistency, i.e.
b of a parameter is consistent if, and only if,
Definition 4.23 An estimator
b=
lim
Theorem 4.24 Consistency of the sample mean

The mean of a random sample from any population with finite mean and finite variance 2
is a consistent estimator of .
Proof. We have ( ) = and ( ) =
2
.
Consequently .
Corollary 4.25 In random sampling for any function (), if [ ()] and [ ()] are finite
constants, then
1X
lim
( ) = [ ()]

=1
Theorem 4.26 Khinchines Weak Law of Large Numbers

If , = 1 is a random (iid) sample from a distribution with a finite mean ( ) = ,
then
lim =
Theorem 4.27 Chebyshevs Weak Law of Large Numbers

If , = 1 is a sample
P of observations such that ( ) = and ( ) =
2 such that 2 = 12 2 0 as , then lim ( ) = 0
4.3.2
Consistency of the sample CDF
One of the implications of Theorem 4.24 is that the sample cumulative distribution function
from a random sample will be a consistent estimator of the cdf of the distribution.
We define the sample cdf b () as the proportion of the sample that is smaller than or equal
to , i.e.
Definition 4.28 Sample cumulative distribution function
The sample cumulative distribution function b () is defined as
1X
1 ( )
b () =
=1
where 1 () is the indicator function which takes on the value of 1 if the condition is true and
zero otherwise.
Now define the random variable as = 1 ( ). so is a Bernoulli random variable
with parameter = Pr ( ). It follows that () = Pr ( ) = () (by definition of the
cumulative distribution function). Now note that b () is just the sample mean of the sample
P
outcomes of the Bernoulli random variable , i.e. b () = 1 =1 . The variance of a Bernoulli
random variable is finite so by Theorem 4.24
lim b () = ()
Since is just an arbitrary point it is clear that the sample cumulative distribution function is a
consistent estimator of the population cumulative distribution function.
64
4.3.3
Consistency of method of moments estimation
Another implication of Theorem 4.24 and its corollary is that in many situations method of
moments estimation will yield consistent estimators. We will not prove this in general, but
sketch out the intuition in the case of a one parameter distribution. Suppose that is a random
variable with pdf (|). Suppose also that it has finite mean () and finite variance. Now
() will typically depend on the parameter . Let us assume that we can write
() = ()
where is some continuous monotonic function, so that it has a continuous inverse 1 . We can
then solve out for as
= 1 ( [])
Our method of moments estimator would be given by
b
= 1 ()
where we have used the sample mean in place of the population mean []. By Theorem 4.24
we know that lim = []. It now follows by Slutskys Theorem (Theorem 4.11) that
lim b
= 1 lim
This establishes the consistency of the method of moments estimator.
4.4
Asymptotic normality and central limit theorems
If an estimator is consistent its distribution collapses to a spike as . This does not make
it suitable for statistical inference. Nevertheless we noted in Example 4.22 above, that
1 although
1
the random variable collapsed to a spike, the variable 2 did not. In fact 2 = 14 ,
so that the distribution of this variable is nondegenerate.
Theorem 4.29 Simple Central Limit Theorem (Lyapunov)
Let { } be a sequence of centered random variables with variances 2 such that 2 2 2
where the lower and upper bounds are finite positive constants, and absolute third moments 03
such that 03 3 for a finite constant 3 . Further let
20
exist. Then the sequence
lim
12
1X 2
=1
X
=1
tends in distribution to a limit characterised by the normal distribution with mean zero and
variance 20 . (Davidson and MacKinnon 1993, p.126)
4.4. ASYMPTOTIC NORMALITY AND CENTRAL LIMIT THEOREMS
65
We can apply this theorem directly to the case of the Bernoulli random variable considered
in Examples 4.8 and 4.22. Define the centered version of the random Bernoulli variable as
=
1
2
Note that is either 12 or 12 . We have ( ) = 14 . Consequentlynlim 1 o=1 2 =

P
1 P
lim 1 =1 41 = 14 . By the central limit theorem the sequence 2 =1 will con
P
1 P
1
1
verge to a 0 14 random variable. Note that 2 =1 = 2 1 =1 = 2 which
confirms our previous observation that this variable had a nondegenerate and normal distribution.
The implication of the central limit theorem is quite astonishing - it does not matter what
the nature of the original distribution is, the outcome is always a normal distribution!
Example 4.30 The limiting distribution when is normal
Let (0 1). Then meets all the requirements of the theorem. Furthermore in this
P
1 P
case we know that (0 ), so it is easy to see that 2 =1 (0 1).
The remarkable thing about the central limit theorem is that when we sum up the individual
variables, the original properties of the distribution somehow get lost. Davidson and MacKinnon
(1993, pp.126-7) sketch out why this should be the case. Observe that in the particular case
2
where each of the P
variables has identical distribution with mean zero and variance , the
12
2
variable =
=1 is such that ( ) = 0 and ( ) = . Now consider a higher
moment:
!4
X
4
12
=
=1
1 XXXX
( )
2 =1 =1 =1 =1
If any one of the indices are dierent from the other, say 6= , 6= , 6= , then ( ) =
( ) ( ) = 0, by the independence of the variables. The only non-zero expectations will
involve terms where = = = or where
the variables fall into pairs, e.g. = and = .
The former terms are of the type 4 , i.e. involve the fourth moments of the
variables .
There are, however, only of these and with the factor of 12 contribute to 4 only to order
2
type are of the sort 2 2 = 2 . There are 3 ( 1)
1 . The terms of thesecond
such pairs, which is 2 , so these terms contribute to order of unity. Thus to leading order,
the fourth moment of depends only on 2 , it does not depend on the fourth moment of the
random variables .
It is instructive to also consider an odd moment higher than two:
!3
X
3
12
=
=1
1 XXX
=
( )
3
2 =1 =1 =1

The only nonzero terms in this case are terms of the type 3 . There
are
only of these
and since we have assumed that the third moment is finite, we have 3 = 1 03 which is
66
1
2 and converges to zero. Similar arguments will show that the odd moments will all
vanish asymptotically, i.e. the limiting distribution is symmetric, while the higher order even
moments only depend on 2 , i.e. the higher order moments of the random variables do not
influence the asymptotic distribution.
Theorem 4.31 Lindberg-Levy Central Limit Theorem
If 1 are a random
P sample from a probability distribution with finite mean and finite
variance 2 and = 1 =1 , then
( ) 0 2
Theorem 4.32 Lindberg-Feller Central Limit Theorem

Suppose that { } is a sequence of independent random variables with finite means and
finite positive variances 2 . Let
1X
1X 2
and 2 =
=1
=1
If no single term dominates this average variance, i.e. if lim

variance converges to a finite constant, 2 = lim 2 , then
( ) 0 2
max( )

= 0, and if the average
Theorem 4.33 Multivariate Lindberg-Levy Central Limit Theorem

If x1 x are a random sample from a multivariate distribution with finite mean vector
and finite positive definite covariance matrix Q, then
( ) (0 Q)
Theorem 4.34 Multivariate Lindberg-Feller Central Limit Theorem
Suppose that x1 x are a sample of random vectors such that (x ) = (x ) = Q
and all mixed third moments of the multivariate distribution are finite. Let
1X
=

1X
Q =
Q

We assume that
lim Q = Q
where Q is a finite, positive definite matrix and that for every ,
1
Q = 0
lim Q
In this case
Definition 4.35 If
( ) (0 Q)
(0 V), then we say
b
V
4.5. PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS
4.5
67
Properties of Maximum Likelihood Estimators
It turns out that given suitable regularity conditions, maximum likelihood estimators have a lot
of very attractive asymptotic properties:
Theorem 4.36 Properties of a MLE estimator
Under regularity conditions, the maximum likelihood estimator (MLE) has the following asymptotic properties:
b = 0
1. Consistency: lim
h
i
1
b
2. Asymptotic normality:
0 [I (0 )]
where
I ( 0 ) =
2 ln ()
0 00
b asymptotically achieves the Cramr-Rao Lower Bound for

3. Asymptotic eciency:
consistent estimators.

b if c ( 0 ) is a
4. Invariance The maximum likelihood estimator of 0 = c (0 ) is c
continuous and continuously dierentiable function.
(Greene 2003, Theorem 17.1)
Note that we have not defined what these regularity conditions are. They rule out certain
MLEs, such as the MLE of the bounds of a uniform distribution, since the likelihood function is
not dierentiable at the maximum. Under better behaved conditions, these properties between
them guarantee that maximum likelihood estimation has very desirable characteristics. Note that
the second result is very powerful. It allows us to gain estimates of the covariance matrix of the
estimators, if we know something about the
h 2shape of
i the second derivatives of the log-likelihood.
ln ()
b Alternatively, we can make use of the
We can estimate I ( 0 ) by evaluating 0 0 at .
0
fact that
ln ()
ln ()
I (0 ) =
b
so that we just need to have an estimate of the gradient vector ln () evaluated at .
68
4.6
4.6.1
Appendix
Chebyshevs Inequality
Theorem 4.37 If random variable has zero mean and a finite variance , then
Pr (| | )
for any positive number .

R
Proof. (Davidson and MacKinnon 1993, pp.799-800) We have ( ) = 2 = 2 .
We split this integral up:
Z
2 =
2 +
2 +
Each of these terms is nonnegative. Considering the last two terms, we note that in both cases
2 2 over the entire domain of integration, so
Z
2 +
2 2 (Pr (|| ))
Consequently
2 (Pr (| | ))
which proves the result.
Alternative proof. (Greene 2003, p.898) The alternative proof proceeds first by proving
Markovs inequality, i.e.
( )
Pr ( )
if is a nonnegative random variable and is a positive constant. The proof follows from the
fact that
( ) = Pr ( ) ( | ) + Pr ( ) ( | )
The first term on the right hand side is nonnegative, and ( | ) so
( ) Pr ( )
from which the inequality follows. Substituting in = 2 and = 2 we get Chebyshevs
inequality.
Corollary 4.38 If random variable has mean and a finite variance 2 , then
Pr
2

for any positive number .
Proof. Apply Chebyshevs inequality to the new random variable defined as =
Chapter 5
Statistical Inference
In this chapter we consider the question how we can use the sample information to answer
questions about what the state of the world (as represented by the DSP) might actually be.
5.1
Hypothesis Testing
The basic mechanics of hypothesis testing should be familiar by now. The general principle is
that we formulate a null hypothesis 0 about the parameter vector as well as an alternative
hypothesis 1 . On the assumption that 0 is true we can derive the sampling distribution (or
asymptotic distribution) of a given estimator b
. Typically we will use this estimator as the basis
for constructing a test statistic . This test statistic is a scalar, so it lends itself to making
simple decisions of the type accept or reject. Under the assumption that 0 is true, these
test statistics will have their own sampling distribution or asymptotic distribution. In order to
adjudicate between 0 and 1 we form a decision rule. This will take the form of specifying
a rejection region . The complement of this will be the acceptance region, i.e. if our test
statistic falls into the acceptance region we accept 0 . If it falls into the rejection region, we
reject 0 in favour of 1 . In essence we calculate the probability of observing the test statistic (or
an outcome more extreme from the point of view of the comparison between 0 and 1 ), given
the hypothesis 0 . In other words, we assume that 0 is true, for the purposes of calculating
the distribution of our test statistics.
5.1.1
Type I and Type II errors
We can summarise the possible outcomes of the test in the form of the following table:
State of the world (DSP)
0 is true
1 is true
Test decision Accept 0 correct
Type II error
Reject 0 Type I error correct
5.1.2
Power of a test
We define the power function of a test as the probability of rejecting 0 :

() = Pr ( |)
Note that this is a function of the true parameter vector .
69
70
CHAPTER 5. STATISTICAL INFERENCE
If the power function is evaluated at a that is contained in 0 , then the power of the test
is equal to the probability of making a Type I error. If the power function is evaluated at a at
which 1 is true, then the power of the test is equal to 1 Pr (Type II error).
We will say that the test of the hypothesis 0 versus 1 is of size if
sup () =
0
We will say that the test is conducted at the significance level if sup0 () .
Example 5.1 Consider the case of sampling from a distribution that is known as ( 4), i.e.
we know the distribution is normal and it has a variance of 4. We want to set up the test
0 : = 0
1 : 6= 0
Assume initially that we have a sample of size 4 from this distribution. We know that
( 1). We will use the test statistic = 1 together with the rejection region =
{| 196} {| 196} to implement the test. Note that under 0 the test statistic is distributed as (0 1), so the Pr ( | = 0) = 005. We can graph the power function for this
case:
1
0.8
0.6
0.4
0.2
-4
-2
Power function for the test with sample size 4.

Note that at the value = 0 the power is, indeed, 005. We see that near = 0 the power is quite
low (i.e. the test has diculty avoiding a type II error) while at higher values of the power is
high. In general we would like low power when 0 is true and high power when 1 is true.
Note that the power
is a function of the sample size too. If we have a random sample of size
16, then 14 . In this case our test statistic would be = 1 together with the rejection
2
region = {| 196} {| 196}. In this case the power function would be:
5.1. HYPOTHESIS TESTING
71
1
0.8
0.6
0.4
0.2
-4
-2
Power function of test with = 16

In this case our test is more powerful, i.e. better able to discriminate between 0 and 1 near
= 0.
Observe that this test is completelyequivalent
to the
test with the test statistic = and with

the rejection region = | 196
| 196
.
2
2
Finally, if we have a sample of size 100, then we can define our test statistic as = 1
5
together with the rejection region = {| 196} {| 196} which now gives the power
function
1
0.8
0.6
0.4
0.2
-4
-2
Power function with = 100
The three tests considered above obviously form part of a sequence of tests where the test
statistic () is given by
= p
4
and the rejection region is given by = {| 196} {| 196}. This sequence of tests
fixes the probability of making a type I error at 005, i.e they are all of the same size.
A consistent sequence of tests (of size ) is such that if 1 is true then Pr ( ) 1 as
. In other words in large samples the probability of making a type II error goes to zero.
72
5.2
Types of tests
In general we will consider tests of the form 0 : () = 0, where is some linear or non-linear
function of . These functions can be regarded as restrictions imposed on the parameter space.
These tests can be constructed on the basis of three principles:
1. The Wald principle states that we should estimate the unrestrictedmodel
and obtain
b
b
our estimate accordingly. We should then investigate how close is to zero. If it
is, then we would accept 0 . Otherwise we would reject it. Typical examples of Wald-like
tests are t-tests and F-tests run on unrestricted regressions.
2. The likelihood ratio principle states that we should compare the fit of the estimates b
of the unrestricted model with those of the restricted model b

. This comparison is done on
the basis of something like the loglikelihoods of the two estimated models, or the respective
residual sum of squares. Certain F-tests can be cast along these lines. The important point
is that we have to estimate both the restricted and the unrestricted model.
3. The Lagrange Multiplier principle states that we should estimate the restricted model
and obtain the estimates b
. We should then check the gradient of the likelihood (or loglikelihood) function around b
. If this gradient is reasonably flat, we come to the conclusion
that the restriction is an acceptable one. In this case we only estimate the restricted model.
Many specification tests are based on the LM principle. For instance, we generally assume
that the regression model is homoscedastic. We can think of this as a restriction within
a broader model in which we allow for heteroscedasticity. Our estimates, however, are
obtained only for the restricted model.
Corresponding to these three principles are three specific types of tests. The principles,
however, are more general than the tests.
5.2.1
The Wald Test
If x N ( ), then (x )0 1 (x ) 2 () where is the dimension of the vector

x. This suggests a straightforward way of testing the hypothesis
0 : = c, where c is some
b N we have a ready made test of the

specified vector of constants. Similarly, if
hypothesis 0 : = c.
Consider now a linear function R where R is some matrix of constants. We can show that
b N R R R0
R
Our test-statistic of the hypothesis 0 : R = c is

b c R R0 1 R
bc
= R
(5.1)
This will be distributed as 2 () where is the rank (number of rows) of R. Equivalently, it is

the number of restrictions imposed on the parameter vector .
Since we know that ML estimators are asymptotically normally distributed (see Theorem
4.36) the Wald test can be used easily on parameter vectors estimated by means of maximum
likelihood.
Note that a Wald test on a single coecient takes on a particularly simple form. R will be
just the row vector (0 0 0 1 0 ) with zeros everywhere except for a 1 in the -th position.
5.2. TYPES OF TESTS
73

R R0 in this case will simply extract the -th element on the diagonal, which is b
.
Consequently
b
b
b
b

=
b
this is just the square of the standardised

Under the assumption that b
is b
(0 1) variable. This will be 2 (1). So our Wald test in this case is equivalent to doing a
normal test of the hypothesis 0 : = .
5.2.2
The likelihood ratio test
In our discussion of the principle of maximum likelihood, we argued that the likelihood (|y)
represented how likely the given sample values y were, if the true parameter vector was .
Consider now the case where we restrict the parameter space that we can consider. Beforehand
we were free to consider any . Now we will do our optimisation over the restricted parameter
space . The likelihood value that we manage to obtain on the unrestricted parameter space
b . The likelihood value on the restricted parameter space, i.e.
is b
|y . We denote this as
b |y we denote as
b . The ratio of these is a measure of how reasonable the restriction is.

We have
b
1
0
b
If we get values close to one we would accept the validity of the restrictions, while values close
to zero should lead to rejection of the null hypothesis.
The actual LR test is based on the statistic
2 ln
h
i
b
b ln
b
= 2 ln
b
(5.2)
This is distributed asymptotically as 2 () where is the number of restrictions imposed on the

parameter space , i.e. it is equal to the number of dimensions (parameters) lost in going from
to .
Note that the test is valid only if is really a subspace of .
5.2.3
The Lagrange Multiplier test
If we maximise the log-likelihood subject to the restriction R = c, this is equivalent to solving

the problem
ln () , subject to R = c
One standard way of solving this is to set up the Lagrangean

0
L = ln () (R c)
74
and to simultaneously solve the first order conditions

ln
b = 0
R0
b
b
R
(5.3)
= c
If the constraints are not binding the Lagrange multipliers would be zero. Another way of
reading the implication is that in this case ln

= 0, i.e. the restricted estimates b
would
maximise the unrestricted log-likelihood. A test based on would therefore seem appropriate.
The LM test statistic is
b 0 RbI (0 )1 R0
b
which has 2 () degrees of freedom. In this form we would need to have estimated the Lagrange
multipliers. We can derive a more tractable version of the test, by using equation 5.3, i.e. the
test statistic becomes
ln b
ln
1
(5.4)
I (0 )
b
ln
should more correctly be written as
| , i.e. it is the gradient
Note that the expression ln
of the unrestricted log-likelihood evaluated at the restricted maximum likelihood estimates.

In this form it is known as the score test, because the vector of the first derivatives of the
log-likelihood is known as the score vector. Since the score vector needs to be estimated, the
most ecient way of estimating the information matrix (given the null hypothesis) is to use the
fact that the information matrix is equal to the outer product of the gradient vector.
5.3
Worked example: The Pareto distribution
The Pareto distribution has pdf

( |) =
+1
The joint pdf of the sample is
(1 |) =
+1
=1
= (1 2 )(+1)
Consequently
(|1 ) = ln + ln ( + 1)
We dierentiate this with respect to
ln
= + ln
=1
And our MLE is the value that sets this gradient equal to zero, i.e.
b
=
1
ln ln
ln
=1
(5.5)
5.3. WORKED EXAMPLE: THE PARETO DISTRIBUTION
75
Obtaining the information () is very straightforward in this case, since

2

2 = 2
Consequently the asymptotic variance of b

will be given by
2
b
=
We wish to test the null hypothesis
0 : = 2
against
: 6= 2
5.3.1
Wald test
Since b
is a maximum likelihood estimator we know that b
. Given the null hypothesis
= 31518495,
= 2 we presume that b
2 4 . In the empirical work we get an estimate of b
with = 2582. Consequently our Wald statistic is
0
1
b
b
=
2
b
1
4
= (31518495 2)
(31518495 2)
2582
= 856 42
This is distributed as 2 (1). We can safely reject the null hypothesis.
5.3.2
Likelihood ratio test
We can use
P the empirical estimates to obtain the value of the unrestricted log-likelihood. We
note that =1 ln = 8834467 2582 = 22811. Consequently
X
b
= ln b
+ ln b
+1
ln
=1
= 2582 ln (31518495) + 2582 31518495 ln (5000) (31518495 + 1) 22811

= 224300
The restricted log-likelihood is just as easy:

(2) = ln 2 + 2 ln (2 + 1)
ln
=1
= 2582 ln (2) + 2582 2 ln (5000) (2 + 1) 22811

= 22661
Consequently the likelihood ratio statistic will be given by
= 2 b
b
= 2 (224300 (22661))
= 4620
This is also distributed as 2 (1). Again we reject the null.
76
5.3.3
Lagrange multiplier test
Finally we calculate the score version of the LM test. We substitute the restricted value of into
the equation of the gradient (equation 5.5), to get
ln
+ ln
2
=1
2582
+ 2582 ln (5000) 22811
2
= 471 39
=
Consequently
= 471 39
= 344 24
4
2582
471 39
This is distributed yet again as 2 (1) and we reject once more.

Note that the three tests came up with dierent test statistics (from the same distribution)
with

In this case the LM test is the most conservative, while the Wald test will be most likely to
reject. Of course asymptotically they are all equal, but this is not true in the finite samples to
which these tests are applied.
5.4
Worked example: The bivariate normal
In the appendix to this chapter we show that the MLE of the parameters of the bivariate normal
are given by
P
b =
b =
P
2
(
b )
b2 =
2
P
b =
P
(
b )
b
b
=
b

b
which (except for the divisors) is what we might have expected. We also show in the appendix
0
that the asymptotic covariance matrix of
is given by
b
b
b2
b2 b
0
0
0

0
0
0
2
2
2 2 2
4
1
2
)
2
(

0
0
2 (12 )
2 4
22 2 2
0
0
2 (12 )
2 (12 )
(12 )2
0
0
5.4. WORKED EXAMPLE: THE BIVARIATE NORMAL
77
We wish to test the hypothesis that the random variables and come from the same underlying
distribution, i.e. that = and 2 = 2 . We can formulate the null hypothesis as:
0 :
= 2
against the alternative

: 6= or 2 6= 2
We can write the null hypothesis in matrix notation as R = c, i.e.
1 1 0 0 0
2 = 0
0 0 1 1 0
0
2
5.4.1
Wald Test of a single hypothesis
In order to work up to the joint test it is useful first to consider what the test of the single
hypothesis
0 : =
would look like. The Wald statistic is given by
0

bc
b c R R0 1 R
R
In this particular case c =0 and
b
R
This is just a scalar. Furthermore
=
b
b
1 1 0 0 0
R R0 =
1 1 0 0 0
2 2 + 2
b2
b2
b
2 4
22 2 2
2 (12 )
22 2 2
2 4
2 (12 )
2 (12 )
2
2
(1 )
2
(12 )
0
1
1
0
0
0
b . Indeed, for any two random variables

You may notice that this is just the variance of
b
and we have ( ) = () + ( ) 2 ( ). This is how it should be, since
b Since R
b is a scalar in this case, it reduces to
R R0 will give the covariance matrix of R.
b
the variance of R.
78

So the Wald statistic is
0
b
2 2 + 2
)1
b
b
b =
2 2 + 2
This is distributed as a 2 (1) variable. Since in general we do not know 2 , 2 and we

will want to substitute in our maximum likelihood estimates. In this case the test still holds
asymptotically. We can, in fact, convert it into a precise test (or, by taking the square root,
a t-test), but we will not show that here.
5.4.2
Wald Test of the joint hypothesis
The test of the joint hypothesis is a little bit more complicated, but not much so. In this case
we have:
1 1 0 0 0
2
b =

b
R

0 0 1 1 0 2
Furthermore
b
b
b2
b2
1 1 0 0 0
R R0
=
0 0 1 1 0
"
0
#
2 4 +2 4 42 2 2
b
b
2 + 2 2
b2
b2
0
22 2 2
2 4
2 (12 )
#!1
2 4 +2 4 42 2 2
b
b
b2
b2
1
2 4 +2 4 42 2 2
2 (12 )
2 (12 )
2
(12 )
2 4
22 2 2
2 (12 )
2 2
1
+ 2
This will be distributed as 2 (2).
5.4.3
2 + 2 2
So the Wald statistic is
0 "
b
b
=
b2
b2
=
b
b
2
b
b2
Likelihood Ratio test
We show in the appendix to this chapter that the unrestricted loglikelihood is given by

b
b2
b2 b
|x y = ln (2) ln

b
b2 ln
b2 ln 1 b
2
2
2
2
1
0
1 0
0
1
0 1
0
0
5.4. WORKED EXAMPLE: THE BIVARIATE NORMAL

We also show that the restricted maximum likelihood estimators are given by
b2
1
( + )
2
P
P
2
2
(
b ) + (
b )
2
P
(
b ) (
b )
2
b
The restricted loglikelihood will evaluate to

b
b2 b
|x y = ln (2) ln
b2 ln 1 b
2
2
Consequently the LR test statistic will be
= 2 b
b

2
2
2
ln (2) ln
ln
ln
1
2
2
2
2
= 2
2 ln (2) ln
b2 2 ln 1 b
2

2
= ln
b2 ln
b2 ln
b2 + ln 1 b
2 ln 1 b
2
79
80
5.5
Appendix: ML estimation of the bivariate normal distribution
In this appendix we will derive the maximum likelihood estimators of the bivariate normal
distribution. We will also derive the information matrix and hence the asymptotic covariance
matrix. To begin with we need to start with the joint density of one observation ( ), which
we can write as:
(
2
)
2

( )
1
( )
p
( ) =
exp
+
2 (1 2 ) 2 2 (1 2 ) 2
(1 2 )
2 1 2
This means that the joint density of the sample (1 1 ), (2 2 ), ..., ( ) will be
x y| 2 2 =
2 (1 2 )
( P
2
)
P
P

( )
( )2
+
exp
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 )
This of course gives the likelihood from which we can derive the log likelihood:
2 2 |x y =
2 (1 2 )
( P
2
)
P
P

( )
( )2
+
exp
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 )
P

( )2
2
2
2
2
2
|x y = ln (2) ln ln ln 1
2
2
2
2 (1 2 ) 2
P
P

( )
+
(5.6)
2
2
2 (1 )
(1 2 )
Dierentiating the log-likelihood we get the gradient:
P

( )
(1 2 ) 2
(1 2 )
P
P

( )
(1 2 ) 2
(1 2 )
P
P
( )2 ( )
2 +
2
2 (1 2 ) 4
2 (1 2 ) 3
P
P
2

( )
2 +
2
2 (1 2 ) 4
2 (1 2 ) 3
P
P
2

( )
1 + 2
( )
(1 2 ) (1 2 )2
2
2
(1 2 )2
(1 2 )2
P
5.5. APPENDIX: ML ESTIMATION OF THE BIVARIATE NORMAL DISTRIBUTION

Setting the gradient equal to the zero vector, we get five equations in the
P
P
b

b
(
b )
b
1b
2
b2
1b
2
b
P
P

b
b
(
b )

2
2
2
b
1b

b
1b

b
P
P
2
b
(
b )
b
(
b )
2 +
2b
2 1b
2
2 1b
2
b
b4
b3
P
P

b
(
b
b )
b
2 +
2b
2 1b
2
2 1b
2
b3
b4
b
2
P
P
2

b
b
(
b )
b
+
2
2
b2
b2
1b
2
1b
2
1b
2
1+b
2 P (
b )
b
b
b
1b
2
5.5.1
81
five unknowns:
= 0
(5.7)
= 0
(5.8)
= 0
(5.9)
= 0
(5.10)
= 0
(5.11)
Maximum likelihood estimators
The solution to this will be the MLE. From equation 5.7 we get
X
b X

b ) = b
b
(
and from equation 5.8
b X

(
b = b
b )
b
from which it follows that we must have
X
X
(
b ) = b
2
b )
(
X
1b
2
(
b ) = 0
P
(
b ) = 0, i.e.
We require 1 b
2 6= 0 (otherwise the likelihood is not defined), hence
P
b =
(5.12)
and hence
P
(5.13)
b =
Equations 5.9 and 5.10 imply that
P
P
2
b
(
b )
b
(
b )
+
=
(5.14)
2
2 1b
2
2 1b
2
b
b2
b
P
P
b
(
b
b )
b

+
=
2
2 1b
2
2 1b
2
b
b2
b
82
From which it follows that

P
b )
(
=
b2
2
P

b
b2
Furthermore equation 5.14 can be rewritten as

P
(5.15)
P
1b
2
(
b )
b
(
b )2
=
+
b
b
b
b2
Substituting both these into equation 5.11 we get

b
2b
2
2
1b
1b
2
Consequently
2
2
P
2
1
+
b
(
(
b )
b
)

+
=0
+
2
2
b
b2
b
1b
(
b )2
b
2 1 b
2 2b
2
b2

P (
b )2
2
2
2
= 0
+ 1+b
1+b
1b
b2
2
P (
b )
2
= 1b
2
1b
b
P
(
b )2
b2 =
(5.16)
Substituting this into 5.15 we get
and equation 5.14 simplifies to
b2
+
2
2 1b
2
2
P

b
=
1 b
2 + =
b
2 =
b
=
b )
b
(
b
2 1b
2
b
P
(
b )
b
b
b
b
P
(
b )
b
b
b
b
P
(
b )
b
b
b
b
(5.17)
(5.18)
5.5.2
83
Information matrix
To get the information matrix, we need to get the matrix of second derivatives.
2
2
2
2
2
2
( 2 )
2
2
2
(1 2 ) 2
(1 2 ) 2
P
P
( )2 3 ( )
+
2 4
(1 2 ) 6
4 (1 2 ) 5
P
P

3 ( )
+
2 4
(1 2 ) 6
4 (1 2 ) 5

P
P
P

( )
1 + 2
1 + 32
1 + 32
2 3 + 2
( )2
+
2
2
(1 2 )2
(1 2 )3
(1 2 )3
(1 2 )3
=
=
=
=
The cross-partial derivatives are:

2

2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
(1 2 )
P
P

( )
+
(1 2 ) 4
2 (1 2 ) 3

2 (1 2 ) 3
P
P

1 + 2
2 ( )
(1 2 )2 2
(1 2 )2
P
( )
2 (1 2 ) 3
P
P

( )
+
(1 2 ) 4
2 (1 2 ) 3

P
P
2

( )
1 + 2
2 2
2
2
2
(1 )
(1 )
P
( )
4 (1 2 ) 3 3
P
2
1 + 2
( )
( )
(1 2 )2 4
2 (1 2 )2 3
2
P
P
1 + 2
( )
(1 2 )2 4
2 (1 2 )2 3
P
When we take expectations of these terms we note that [ ( )] =
=
hP
i
i
hP
P
2

( ) = .
= 2 and
0,
( ) = 2 ,
84

Consequently
() =
where =
5.5.3
2
0
(12 ) 2
(1
2 )

(1
2 )

(2 )
4(12 ) 4
4(12 )2 2
2(1
2 ) 2
(12 ) 2
0
0
2
0
0
4(12 )2 2

(22 )
4(12 ) 4
0
0
2(1
2 ) 2
2(1
2 ) 2
2(1
2 ) 2
(1+2 )
(12 )2
Asymptotic covariance matrix
b
Inverting this matrix we get the asymptotic covariance matrix of
()
2 4
22 2 2
2 (12 )
22 2 2
2 4
2 (12 )
2 (12 )
2 (12 )
2
(12 )
It is important to understand what this matrix is saying. For instance it shows that asymp2
totically (b
) = . This will, of course, hold also in a small sample. This quantity is the
variance of the sampling distribution of
b . It captures how variable the estimates would be if
we re-ran the DSP many times. In practice since we dont know 2 we will need to estimate it
from the data. Using either the MLE
b2 or the bias adjusted 2 we can use our data to give us
an estimate of the true (b
). To capture the fact that it is an estimate we will write it as
d
(b
).
1
One interesting fact about () is that it is block-diagonal in nature. In particular we see
that the estimators of and are uncorrelated with the estimators of 2 , 2 and . Given the
fact that these estimators are multivariate normal (asymptotically) this shows that they are at
least asymptotically independent of each other. In fact it can be shown that they are independent
even in small samples. This is very convenient. It means that if we are testing hypotheses only
on the means, we need to consider only their covariance matrix i.e.
"
#
2

b = . This indicates that if the two variables
Looking at this, we note that
b
are positively correlated, then the sample estimates
b and
b will also be positively correlated.
This makes sense. If
b overshoots the true mean in a particular sample then the fact that
the values are positively correlated with the values we would expect
b to overshoot its mean
too.
85

We can show that the formula
b
b = holds in small samples too. We have

b
b = (b
b
)
P

Here we are making use of the fact that (b
) = 1
= and
b = . Now
X
X
1
1
(b
)
b
( )
=

=1
=1
X
X
1
=
( )

2
=1
=1
1
=
( )
2
=1 =1
=
1 XX
( )
2
=1 =1
of each other (by the assumption of

Since the observations ( ) and ( ) are independent
simple random sampling) ( ) = 0 whenever 6= . There are exactly pairs
( ) and in these cases ( ) = ( ) = . Consequently

(b
)
b =
The same logic establishes that (b

) = . Consequently the finite sample covariance
matrix of the MLE
b and
b reaches the Cramr-Rao lower bound. Consequently the MLE is
fully ecient (within the class of unbiased estimators).
5.5.4
Log-likelihood
We can derive an expression for the log-likelihood evaluate at the MLE:

2
P (

b )
2
2
2
2
2
b
b
b b
|x y
= ln (2) ln
b ln
b ln 1 b

b
2
2
2
2 1b
2
b2
P
P

b
(
b
b )
b
+
2 1b
2
b
b2
1b
2
b
Substituting in equations 5.16, 5.17 and 5.18 we get

b
b
b2
b2 b
|x y
= ln (2) ln
2
b2 ln
b2 ln 1 b
2
2
2
b
2

+

2 1b
2
2 1b
2
1b
2

= ln (2) ln
b2 ln
b2 ln 1 b
2 (5.19)
2
2
2
86
5.6
Restricted Maximum Likelihood estimation
We now impose the restrictions
= =
= 2 = 2
With these restrictions the likelihood function becomes:

( P
)
P
P
2 4

( )2
( )2
( ) ( )
2
2 2
|x y = 4 1
exp
+
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 ) 2
P
P
P

( )2 + ( )2 2 ( ) ( )
2
2
2
|x y = ln (2) ln ln 1
2
2 (1 2 ) 2
In order to maximise this we first get the derivatives of the log-likelihood function:
P
P
P
P
( ) + ( ) ( ) ( )
=
(1 2 ) 2
P
P
P
( )2 + ( )2 2 ( ) ( )
+
2
2
2 (1 2 ) 4
P
P
P
2
2
( ) ( )
1 + 2
( ) + ( )
+
=
2
2
2
2
(1 )
(1 )
(1 2 )2 2
5.6.1
Restricted Maximum Likelihood Estimators
The restricted maximum likelihood estimators will satisfy the equations

P
P
P
P
(
b) + (
b) b
(
b) b
(
b)
2
2
1b

b
P
P
P
2
2
(
b) + (
b) 2b
(
b) (
b)
2+
b
2 1b
2
b4
2 P
P
P
2
2
(
b) (
b)
1
+
b
b) + (
b)
(
b
+
2
2
1b
2
b2
b2
1b
2
1b
2
= 0 (5.20)
= 0 (5.21)
= 0 (5.22)
To distinguish these from the unrestricted MLE, we should really subscript these with R, to
make it clear that the restricted estimates will, in general, be dierent. This clutters up the
notation, so we use the subscripts only when reporting the final results.
The first equation can be rewritten as
P
P
(1 b
) ( (
b) + (
b))
=0
2
2
1b

b
This will hold only if
(
b) +
(
b) = 0
5.6. RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION
87
i.e.
=
=
+
2
1
( + )
2
(5.23)
So the restricted estimate of the mean will be the average of the unrestricted estimates, which
is equivalent to calculating the mean over the pooled sample.
Equation 5.21 can be rewritten as
P
P
P
(
2b
(
b)2 + (
b)2
b) (
b)
=
+ 2
2
2
2
2
1b

b
1b

b
(5.24)
Substituting this into equation 5.22 we get
P

P
b) (
b)
(
1+b
2
(
b) (
b)
+ 2 +
2
1b
2
1b
2
1b
2
b2
1b
2
b2
P
2 (
b) (
b) 2b
1b
2
2
b) (
b)
b2 2b
b2 + 1 + b
(
b
1b
2
2
b2
1b
2
b
2b
= 0
= 0
Consequently
X
b
1b
2
(
2
b) (
b) 2b
1b
2
b2 2b
b2 +
X
1+b
2
b) (
b) = 0
(
1b
2
b) (
b) = b
1b
2
(
b2
P
(
b ) (
b )
b
=
(5.25)
2
b
Substituting this expression for b

into equation 5.24 we get
P
2
2
b) + (
b)
(
1b
2
b2
X
X
(
b )2 +
(
b )2
P
b2
2b
b
+ 2
1b
2
= 2b
2 b
2
2 + 1 b
P
P
2
2
(
b ) + (
b )
=
2
=
(5.26)
These results are intuitively obvious: if and are drawn from the same distribution, then
it would be most ecient to estimate the mean and the variance by pooling the observations on
and . The correlation coecient then looks at the deviations from the pooled mean, normalised
against the pooled standard deviations.
88
5.6.2
Restricted loglikelihood
As before we can substitute the maximum likelihood estimates into the restricted loglikelihood
to evaluate what this maximum value actually is:

b
b2 b
|x y
= ln (2) ln
b2 ln 1 b
2
2
P
P
P
(
b )2 + (
b )2 2b
(
b ) (
b )
2
2
2 1b

b
Substituting in equations 5.25 and 5.26 we get
2b
2b
2 2b
b2 b
|x y
= ln (2) ln
b2 ln 1 b

b
2
2
2 1b
2
b2

= ln (2) ln
b2 ln 1 b
(5.27)
2
2
Part II
Single equation estimation
89
Chapter 6
Thinking about social processes

econometrically
6.1
Setting up an econometric research problem
Why study econometrics? Most of the time applied econometricians think it is quite obvious
what they do and how they should do it. Underpinning this is an implicit model of how the
world works. Some times it is quite useful to make this methodology explicit. The purpose of
this chapter is to provide you with some tools which may come in useful if you come up against
non-standard problems: situations in which it may no longer be obvious what you should do or
how you should do it. In fact an understanding of the methodology is useful even as the broad
backdrop to the most well-known of models, the classical linear regression model. We will spend
some time setting up this model against the backdrop of the broader framework within which it
fits.
Our point of departure is to marry a typology derived from Mittelhammer et al. (2000,
Chapter 1) and a framework given by Angrist and Pischke. The former suggest that the process
of econometric research may be crudely categorised into three parts:
1. A process of abstraction or model-building. In this process we want to capture the essential
relationships and characteristics of the real world in simplified form. The resultant
mathematical/econometric model can be used to make deductions about both what we can
or should not be able to observe in the world. At its core an econometric model can be
thought of as depicting a Data Sampling Process (DSP) 1 . This characterises both what
we know about the world and how the information available to the analyst is ultimately
derived from it.
2. A process of information recovery which can take both the form of estimation and inference.
The purpose of this step is to use the available information to extract more information
about the DSP, i.e. the world.
3. A final step is to reflect on the meaning of this additional information. The fundamental
problem (too often forgotten) is that the process of estimation and inference is conditional
on the econometric model. It is therefore always advisable to reflect on how plausible the
model is, given the results obtained. This process of analysis is, of course, often the prelude
1 Other
authors may call this the Data Generating Process (DGP).
91
92
CHAPTER 6. THINKING ABOUT SOCIAL PROCESSES ECONOMETRICALLY

to further rounds of model-building and information recovery. In some contexts this step
will also be associated with a discussion of the policy implications of the results.
Angrist and Pischke (2009, Chapter 1), by contrast, suggest that the key questions which can
be used to characterise most econometric research are:
1. What is the causal relationship of interest? This focus on causality is not self-evident, but
much of the time economists are interested in the determinants of social and economic
processes. If we understand what drives the observed outcomes, we will be more confident
about policy interventions that seek to modify these outcomes.
2. What ideal experiment could reveal the causal relationship? For the moment it is sucient
to note that thinking about the possibility of an experiment forces the analyst to be clear
about what could, in principle, be manipulated. Questions which could not in principle
ever be settled by an experiment are fundamentally unidentified questions (FUQs). If
you have one of these, you are FUQed, and you will need to change your research question.
3. In the absence of an ideal experiment, what identification strategy will reveal it? In order
to think about this we will need to know a lot more about how the process works in the
real world, i.e. outside experimental control.
4. What mode of statistical inference is appropriate?
6.2
The econometric model
Following Mittelhammer, et al (2000), we can categorise the components of an econometric model

as follows:
1. At its core there is an economic model. This sets up the core concepts and relationships
through which the phenomenon of interest will be examined. It will define, for instance,
what can (or should) be measured, e.g. aggregate income, prices, opportunity costs and so
on. It may describe how these theoretical variables are related, e.g. = ( ).
2. The sampling model describes how these theoretical constructs get turned into real data
available to the analyst. For instance the analyst may never be able to observe the opportunity cost of someones time. All that she may be able to observe is whether someone is
working or not. In this case the sampling model will describe how the economic variable
(opportunity cost) is converted into observable data (work/not work). Most of the time
the sampling model will have to go beyond this and describe how information that could
be observed gets converted into data that is actually available to the analyst. In the case of
microeconomic data, this may involve discussion of sampling design (who was sampled?),
measurement (do the answers to the questions actually tell us whether someone worked
or not?) as well as issues of data capture. In the case of macroeconomic data it may
involve questions of aggregation and data manipulation (such as detrending). Once the
sampling model is specified we know which empirical variables are available to the analyst,
Y = {Y1 Y2 Y }.
3. The probability model provides yet more structure to the data. It will specify that these
empirical data are the outcome of some random variable which has a certain distribution.
The observed outcome Y therefore has a joint distribution which can be characterised
in terms of some distribution function . Frequently this distribution will be deemed to
6.2. THE ECONOMETRIC MODEL
93
come from a certain family of distributions (such as the multivariate normal), which can
be indexed by a set of parameters . If we knew , we could then completely characterise
the Data Sampling Process (DSP). The point of the probability model is that it specifies
how likely certain outcomes are, compared to others.
Once the DSP has been fully specified, it should, in principle be possible to simulate the
process of data generation. The analyst could play God and recreate many possible outcomes
of the underlying economic process. Such simulations may be useful in answering questions about
the characteristics of this process.
6.2.1
Abstraction and causality
As noted above, the process of abstraction involves isolating the processes of interest from
the surrounding jumble of cross-cutting events and processes. To make this more concrete,
consider the relationship between the log of wages received and taking an advanced econometrics
course. We might have considered many facts about individuals other than their educational
trajectories, for instance their astrological star signs, their pain thresholds or their blood-type.
Which factors we choose to focus on will be guided by economic theory. We are interested mainly
in relationships that are not purely coincidental, but reflect stable underlying social processes.
Causal processes are the best examples of such stable relationships. Many economic models are
underpinned by causal stories. For instance human capital theory maintains that education
causes people to become more productive and hence earn higher wages. A human capital account
would therefore posit a link between taking econometrics courses and earning higher wages. It
would rule out links between astrological star signs and wages, except of a completely accidental
nature.
An alternative link is provided by signalling theory. This suggests that employers find it
dicult to measure ability of candidates accurately. Consequently high ability candidates need
to acquire a signal (such as an econometrics qualification) which low ability candidates find
it hard or impossible to do. Gaining such a qualification therefore causes a change in the
employers belief about the applicants ability and hence the wage which will be paid. In this case
the causal link is indirect and dependent on employers and candidates both understanding that
econometrics is dicult and hence a good signal of ability. There is nothing intrinsic about
econometrics which gives it that function. Studying Latin could do just as well. As such the link
between the econometrics course and higher wages is actually contingent on the prevailing norms
and beliefs. It could shift over time or could function dierently in dierent labour markets
(locations).
Even human capital theory would allow for changes in the relationship between wages and
taking econometrics courses. If there was a glut of econometrics graduates, this would depress
their earnings, even if the causal relationship between econometrics and higher productivity
remains unchanged. This points to an interesting relationship between the economic notion
of equilibrium and causality. Implicit in any equilibrium account are causal stories about the
impact of supply and demand: increases in supply cause drops in prices as long as demand
remains unchanged. The causal relationship between acquiring econometric skills and higher
wages is implicitly predicated on everything else staying the same.
Hence causes will invariably lead to particular eects ceteris paribus. The origin of this
notion of causality can be traced back at least to Hume, who defined causality in terms of the
constant conjunction of events. If we say "A causes B" we mean that whenever we observe A,
we also observe B. Of course this is not strictly speaking true: it is not the case that whenever
we switch on a light that we get illuminated: the light could have blown, there could be an
electricity outage or the circuit could have shorted. This means that the constant conjunction
94
has to be defined more carefully: in terms of equipment in working order, not subject to external
interruptions etc. Indeed controlling the interference of external forces is one of the key issues
for scientific laboratory experiments. Constant conjunctions occur only in closed systems,
i.e. systems isolated from their context; from the jumble of cross-cutting events and processes!
This suggests that sensible abstractions are those which pinpoint relationships which might
be isolated under laboratory conditions. Of course we do not expect the mechanisms that we
manage to isolate experimentally to stop working the moment that we leave the laboratory.
Science enables us to make sense of everyday processes ranging from medicine to mechanical
engineering, precisely because the same causal mechanisms operate, even if they are sometimes
confounded by other processes.
6.2.2
The Rubin causal model
Where laboratory experiments are designed to isolate causal mechanisms in the physical sciences,
it is much harder to achieve such experimental closure in the social sciences or indeed even in the
complex interactions inside biological systems. In these contexts we may not see the constant
conjunctions posited by the causal story. Instead we may need to look for statistical regularities
rather than deterministic patterns. An influential model for thinking about causality in these
contexts is provided by Rubin (Holland 1986). To make things more definite let us assume
that we are considering a treatment, such as administering a drug (non-recreational) or an
econometrics course (definitely non-recreational). The variable captures whether individual
receives the treatment ( = 1) or the control ( = 0). The outcome of interest (or response
variable) is which might be recovers from illness" ( = 1 or = 0) in the former example or
log of wages in the latter. The key idea in the Rubin framework is that for each individual we can
think of two possible outcomes: 1 and 0 , i.e. the outcome if individual is treated ( = 1)
or not ( = 0). Only one of these potential outcomes can ever be observed. Indeed Holland
(1986) makes the point that even if we could repeat an experiment on the same individual (e.g.
re-run a lab experiment) we can still not observe what would have happened at that time if we
had applied the cause dierently. The causal eect for individual of the treatment is defined
as
= 1 0
Because biological and social systems are such that we cannot control for all sources of heterogeneity (dierent individuals have slightly dierent genetics and so might respond to drugs
slightly dierently) we cannot assume in general that is a constant across all individuals. What
we might measure, though, is the average treatment eect ATE, defined as
= (1 0 )
(6.1)
Note that this is not the observed dierence in outcomes between those who are treated and
those who are not. This naive dierence (the prima facie causal eect in the terminology of
Holland (1986)) is
= (1 | = 1) (0 | = 0)
The two will be equal if (1 ) = (1 | = 1) and (0 ) = (0 | = 0). This means that
the potential outcomes are unrelated to the treatment status. In many cases this is likely to be
violated: people who are smart enough to take econometrics courses are likely to have earned
better than people who do not, even if they had not carried on with their education. We can
6.3. THE PROCESS OF INFORMATION RECOVERY
95
show the bias implicit in the PFCE by the following decomposition:

= (1 | = 1) (0 | = 1) + (0 | = 1) (0 | = 0)
= (1 0 | = 1) + (0 | = 1) (0 | = 0)
{z
} |
{z
}
|

where ATT is the average treatment eect on the treated.

attributes vs treatment variables (important thing about treatments is that they could have
been dierent)
6.2.3
6.3
Experimentation
The process of information recovery
In practice, of course, the econometrician does not fully know the DSP. Depending on how much
information the analyst has up front, the econometric specification will be more or less complete.
A general specification can be written as follows (Mittelhammer et al. 2000, p.9):
Y = (X )
(6.2)
where
Y is the set of random variables characterising the outcomes on the dependent variables.
X is a set of random variables characterising the additional observable information.
is a set of unobserved random variables
is a vector of parameters characterising the joint distribution of the Y variables.
is the function that relates the dependent variable to the explanatory variables and the
unobservables.
Any specification will capture how much the analyst is willing to assume about the underlying
DSP. For instance it is possible to make stronger or weaker assumptions about and (the
functional form). In many of our applications we will subdivide into parameters which aect
the mean value of Y and parameters which aect its variance.
Once the analyst has indicated how much she is willing to assume, the problem becomes how
to retrieve information about the unobservables and from the observed data (y x). Note
that we have written this in lower case to indicate that these are outcomes and not the random
variables themselves!
In short, the problem is how to go from (y x) to ( ). Mittelhammer et al (2000, p.9) call
this the inverse problem of econometrics.
There are dierent ways in which we may go about trying to get information about ( ):

1. The simplest is point estimation. In this we try to get estimated vectors b
b
2. A little bit more complicated is interval estimation. Here we try to find ranges within
which the unobservables are likely to lie, with say a 95% confidence.
3. Related to interval estimation is the process of inference. Here we try to ask questions
along the lines of: Is it plausible that the DSP could be characterised by ? Typically we
will separate the possible DSPs into two groups (based on 0 and 1 ) and decide whether
the DSP we are considering belongs to group 1 or 2. For instance we may ask the question
whether a particular production function is constant returns to scale, or not.
96
6.3.1
Properties of estimators and of rules of inference
A key issue in theoretical econometric research is to develop appropriate processes of estimation

and inference. Unsurprisingly, it will turn out that the properties of dierent estimators (or rules
for arriving at estimates) will depend on the nature of the DSP. Some of the properties that we
will consider in this course are:
Bias:
An unbiased estimator is one that on average gives the true parameter value, i.e.
b
= . In words this says that if the DSP is really characterised by the parameter ,
then on average our estimator b
will give us .
Consistency: A consistent estimator is one that in large samples will give estimates close
to the true value, i.e. lim b
= .
Eciency: An ecient estimator b
within a particular class of estimators
have
will
a
smaller variance than any other estimator e
within that class, i.e. b
e
.
Asymptotic normality: Many of the estimators that we will consider will have the property
that in large samples, the distribution of the estimator tends towards the normal distribution. We will be concerned to characterise the mean and the covariance matrix of these
estimators.
Similarly the properties of the rules of inference will depend on the nature of the DSP. We
will have much less to say on this subject, although more advanced texts will consider properties
such as the power of dierent tests.
An important point to bear in mind, is that all our analyses will be done within the standard
framework, i.e. the properties should be interpreted either
in a repeated sampling sense, i.e. how the estimator would behave if we had the opportunity
to re-run the analysis very many times; or
in an asymptotic sense, i.e. how the estimator would behave if we had an infinite sized
sample.
In an empirical problem we, of course, will never have infinite sized samples. Furthermore we
will hardly ever have the luxury to repeat the experiment. The fact that the estimator performs
well on average, does not guarantee that we will get estimates that are at all close to the true
values in any particular analysis! Hence even after the analysis has been performed there is still
some judgement involved as to whether we choose to believe our estimates or not. Because of this
diculty, Bayesian analysts argue that our a priori judgements about the nature of the DSP
should be explicitly incorporated into the process of estimation and inference. In this course we
will not introduce Bayesian approaches.
6.4
6.4.1
Examples of econometric research problems

The Keynesian consumption function
One of the favourite examples of an econometric model discussed in textbooks (see for instance
Greene 2003, Gujarati 2003) is the Keynesian consumption function. Keynes describes this as
follows:
6.4. EXAMPLES OF ECONOMETRIC RESEARCH PROBLEMS
97
We will therefore define what we shall call the propensity to consume as the
functional relationship between , a given level of income in terms of wage-units,
and the expenditure on consumption out of that level of income, so that
= ( ) or = ( ) .
The amount that the community spends on consumption obviously depends (i) partly
on the amount of its income, (ii) partly on the other objective attendant circumstances, and (iii) partly on the subjective needs and the psychological propensities
and habits of the individuals composing it and the principles on which the income is
divided between them (which may suer modification as output is increased). ...
Granted, then, that the propensity to consume is a fairly stable function so that,
as a rule, the amount of aggregate consumption mainly depends on the amount of
aggregate income (both measured in terms of wage units), changes in the propensity
itself being treated as secondary influences, what is the normal shape of this function?
The fundamental psychological law, upon which we are entitled to depend with
great confidence both a priori from our knowledge of human nature and from the
detailed facts of experience, is that men are disposed, as a rule and on the average,
to increase their consumption as their income increases, but not by as much as the
increase in their income. That is to say, if is the amount of consumption and
is income (both measured in wage-units) has the same sign as but is
smaller in amount, i.e.

is positive and less than unity. ...
But, apart from short-period changes in the level of income, it is also obvious that
a higher absolute level of income will tend, as a rule, to widen the gap between income
and consumption. For the satisfaction of the immediate primary needs of a man and
his family is usually a stronger motive than the motives towards accumulation, which
only acquire eective sway when a margin of comfort has been attained. These
reasons will lead, as a rule, to a greater proportion of income being saved as real
income increases. (Keynes 1936, Chapter 8, pp.90-91,96,97)
In terms of our earlier discussion, Keynes provides an economic model, but not a sampling
model or a probability model. Let us consider what might be involved in specifying these.
Note that Keyness description is about the behaviour of individuals or households. The data
on which this relationship is, however, characteristically estimated is aggregate macroeconomic
data. Our sampling model would therefore need to describe how the household level processes
get reflected in the national accounts, leading to the data that we might extract (e.g. from the
South African Reserve Bank Bulletin). If this is not discussed the analyst is implicitly assuming
that the macroeconomic data transparently reflect what is happening in the real world.
The text book way in which a probability model is grafted on to the sampling and economic
model is to specify the econometric model as
= + +
0 2
Comparing this to the general specification in equation 6.2 we see several things:
The dependent variable Y is a particular macroeconomic consumption series
The explanatory variable X is a particular macroeconomic GDP series
The functional form is linear
(6.3)
98

The pdf of is multivariate normal
The vector = 2 .
Note that as it stands this model can not be literally true. If really is a normal variable, it
can theoretically assume arbitrarily large positive and negative values. This means that could
be negative. Admittedly the probability of this might be vanishingly small, but it is not zero. In
short, the real world DSP cannot be a member of the family of DSPs represented by equations
6.3.
One of the questions arising in econometrics is how robust processes of estimation and inference are to misspecifications of this sort. Unsurprisingly, it depends on the nature of the
misspecification. Many of the techniques we will discuss are reasonably robust to small departures from their underlying assumptions. We will also see, however, that in certain cases our
processes of inference can become very misleading.
6.4.2
Estimating the unemployment rate
A very dierent problem is provided by the problem of estimating the unemployment rate. This
seems a dierent problem because it looks like a pure measurement issue, i.e. we are not looking
at the relationship between two or more variables. Nevertheless all of the same issues crop up.
In order to even measure unemployment we need to have a clear concept of what it is. This
requires us to depart from some economic model of the process. This turns out to be more tricky
than it might seem at first. In terms of standard neoclassical labour economics one wants to
distinguish between voluntary unemployment (when the wage one can command is below ones
reservation wage) and involuntary unemployment. The economically interesting measurement
is that of involuntary unemployment. The simplest economic model of unemployment will therefore state that attached to each individual within the economy there is a vector ( )
which contains the information on that individuals attainable wage , reservation wage and
employment status which might be coded as follows:
and there are jobs available

(the person is employed)
3 if
2 if and there are no jobs available (involuntarily unemployed)
=
1 if

(voluntarily unemployed)
The sampling model in this case will describe how the theoretical vector ( ) gets
converted into an actual observation on individual . In many surveys questions are not asked
about reservation wages. This information may therefore simply not be available. Even when
questions are asked, they may only be asked of people who are unemployed. More problematic
still, the analyst cannot observe for someone who is unemployed. It is not clear that asking the
unemployed to tell us how much they think they could command in the market place would give
us any approximation to either. Another problem is that we do not have direct observations
on . Instead we usually have a battery of questions along the lines of Did you do any work
during the past week?, How many hours did you spend on casual work last week? If you
were oered employment tomorrow, would you be willing to take it? and so on. It is not
clear whether the persons answering the questions understand these in the way that the analyst
intended. For instance a number of people who undertake casual work may not regard this
as proper work. Similarly the question about willingness to work does not specify under what
conditions. Dierent analysts will take responses to these questions and on the basis of these
code someone as being unemployed, employed or out of the labour force. The measured variable
6.5. TYPES OF INVERSE PROBLEMS
99
might be coded as follows
the person is employed

3 if
2 if
the person is unemployed
=
1 if the person is not economically active
In short, the typical data available to the analyst will be the vector ( ) if the person
is employed, and ( ) if the person is unemployed or not economically active. The dot here
indicates that the information is missing. From this the analyst might create the dummy variable
1 if the person is unemployed

=
0 if
the person is employed
If the analyst does not have a complete census available, the sampling model would also need to
describe the process by which the information was obtained. Typically this would be by means
of a cross-sectional survey.
It may look as though we are now done, and that with the information available it is clear
how the process of estimation will proceed. The unemployment rate will be estimated simply
as the number of unemployed people in the sample divided by the sum of those employed and
unemployed, i.e.
1X
Implicit in this is a particular view of the probability model. The assumption is that observations
are essentially independent of each other. If this is not the case and frequently in cross-sectional
surveys it is not then the appropriate method of estimation needs to take that into account
also (see Deaton 1997, Chapter 1). The implicit econometric model might be
=
()
where () is the Bernoulli distribution with parameter . It would be instructive to compare
this to the general specification given in equation 6.2.
In short even what looks like a simple measurement issue involves at least implicitly a view of
the underlying DSP. Analysts can disagree (sometimes vehemently) on the appropriate methods
of measuring such phenomena because they either have dierent views about the appropriate economic model that should inform the measurement, the data sampling process or the appropriate
probability model.
6.5
Types of inverse problems
In the previous section we already saw two very dierent econometric problems. In the one
case the dependent variable was continuous, in the other it was discrete. In the first there were
covariates, in the second we did not use any. The distribution of the errors was assumed to
be normal in the one and Bernoulli in the other. Mittelhammer et al. (2000, p.27) provide a
typology of dierent types of probability models which is given in Table 6.1.
In this course we will not be examining all the possibilities contained in this table. Nevertheless it is important to know that econometric techniques exist for all sorts of conditions.
Identifying which circumstances apply to your problem and picking the right tools for the job,
is absolutely vital if you want to perform high quality research.
In applied work it is important to beware of two kinds of errors:
100
Dependent
variable
Y=
Y
RV Type:
discrete,
continuous,
mixed
Range:
unlimited,
limited
Dimension:
univariate,
multivariate
Function
(
(X )
Functional
Form:
in X:
linear, transformable to
linear, nonlinear
in :
linear, transformable to
linear, nonlinear
General Probability Model

Explanatory
Parameters
Noise
)
Specific Model Characteristics
X
RV Type:
RV Type:
RV Type:
indefixed,
ran- iid,
fixed
pendent but
dom
random:
nonidentical,
independent
dependent
Dimension:
of
finite,
ununcorrelated
specified
with
Moments:
dependent
[|X] = 0
(|X) =
Genesis:
(X )
endogenous,
exogenous
X
PDF
Parameter
Space
|x f (e|x; ) ( )
f (e|x; )
PDF Family:
normal, nonnormal, unspecified
Prior Info:
unconstrained,
equality
constrained,
inequality
constrained,
stochastic
prior info
in :
additive,
nonadditive
From Mittelhammer et al. (2000, p.27)
Table 6.1: Classes of econometric models

Using too simple a technique in circumstances where it really doesnt apply. Many analysts
thoughtlessly impose Ordinary Least Squares on any empirical problem that they find. As
we will see later in this course the circumstances under which OLS provides reasonable
answers are circumscribed in various ways.
Using too fancy a technique where something simpler would work as well or better. In
each of us there is the temptation to show o how smart we are. One of the ways in which
we do this is by applying new-fangled techniques. However new and sophisticated is
not always better. One of the advantages of tried and tested routines (like OLS) is that
their limitations and behaviour are well understood. If you do apply something fancy, it is
generally advisable to also run the simpler routines, simply as a robustness check.
6.6
The classical linear regression model
The baseline standard of econometric analysis is the classical linear regression model. At its
simplest, the model assumes that each observation in the sample is generated by a process that
can be represented as
= 1 1 + 2 2 + + +
(6.4)
where the x variables are assumed to be independent of the error terms, the s are fixed and
each is distributed independently and identically with a mean of 0 and variance 2 (i.e. there
is no heteroscedasticity and no autocorrelation). Note that in this form we have already made
very particular choices in each of the dimensions represented in Table 6.1. If in addition we
6.6. THE CLASSICAL LINEAR REGRESSION MODEL
101
make the assumption that the errors are normally distributed, then the model is known as
the classical normal linear regression model. We can make all of this more precise.
6.6.1
Matrix representation
Since equation 6.4 is true of every observation , we can stack the observations and get the
equivalent expression
1
2
..
.
11
21
..
.
1
1 +
12
22
..
.
2
2 + +
1
2
..
.
1
2
..
.
which we can write in vector form as

y = x1 1 + x2 2 + + x +
We can write this more compactly still in matrix form as
1
2
..
.
11
21
..
.
12
22
..
.
..
.
1
2
..
.
1
2
..
.
1
2
..
.
or in short
y = X +
(6.5)
This is the fundamental equation of the linear regression model: y is the ( 1) column vector
of the observations on the dependent variable, X is an ( ) matrix in which each column
represents the observations on one of the explanatory variables, vector is a ( 1) vector of
parameters and is the ( 1) vector of stochastic error terms (or disturbance terms). Typically
the first column of X will be a column of 1s, so that 1 is the intercept in the model.
Using the notation introduced in Section 2.4.3, the assumptions about the mean of the error
term can be represented as follows:
[|X] =
1 |X
2 |X
..
.
|X
=0
It follows from Theorem 2.9 in Chapter 2 (Law of Iterated Expectations) that [] = 0.
102
The assumptions about the variance of the errors are captured in the following:
Var (|X) = [0 |X]
21 |X
[2 1 |X]
=
..
|X
6.6.2
[1 2 |X]
22 |X
..
.
..
.
[ 1 |X] [ 2 |X]
2 0 0
0 2 0
..
..
..
..
.
.
.
.
0
0 2
[1 |X]
[2 |X]
..
.2
|X
= 2 I
Assumptions
In summary, under the assumptions of the Classical Linear Regression Model the DSP can
be described as follows:
y = X +
(Assumption 1)
[|X] = 0
(Assumption 2)
Var [|X] = 2 I
(Assumption 3)
Together with one of the following assumptions about X

X is fixed (nonstochastic); or
X is independent of
(Assumption 4a)
(Assumption 4b)
The Classical Normal Linear Regression Model adds the assumption:
|X N 0 2 I
(Assumption 5)
Greene (2003, p.10) adds to these the further assumption that

(X) =
(6.6)
This is an important identification condition. It doesnt describe the DSP, but stipulates
under what conditions we can estimate the parameter vector . If the condition is violated, we
cannot solve the inverse problem.
We will briefly discuss these assumptions further.
Assumption 1: Linearity in X and and additivity in
The model assumes both linearity in X and linearity in . Linearity in X is not as restrictive as
it may look at first. Many nonlinear relationships in variables can be accommodated by this
model:
6.6. THE CLASSICAL LINEAR REGRESSION MODEL
103
polynomial regression: If the true relationship is a polynomial in , e.g.

= 1 + 2 + 3 2 + 4 3 +
we can transform this into a linear regression simply by renaming variables, e.g. = 2 ,
2 = 3 , 3 = 4 . The fact that there is a perfect nonlinear relationship between 2 , 3
and 4 does not invalidate any of the assumptions of the classical linear regression model.
loglinear model. The multiplicative model
=
can be transformed into a linear regression by taking logarithms:
ln = ln + ln + ln
Provided that the transformed error term meets the assumptions, this becomes a linear
regression model in the transformed variables.
semilog model. The model
ln = x + +
also meets the assumptions of the classical linear regression model, when suitably interpreted. (Note x is the column vector corresponding to observation , i.e. it is row of the
matrix X.)
As we noted in relation to the loglinear model, nonlinearities in can also be accommodated,
if we can reparameterise the model appropriately. In that case we do so by setting ln = 1 .
Finally the multiplicative error in that specification becomes an additive error in the logarithmic
version.
Assumption 2: Regression
The assumption that the error terms have a conditional mean of zero implies that
[y|X] = X
Any function of the form [y|X] = (X) is called a regression function, i.e. a regression
function describes how the conditional mean of y (the dependent variable) changes with X (the
explanatory variables).
Assumption 3: Spherical disturbances
This assumption states that the error process operates essentially constantly from observation
to observation. Furthermore errors that happen on one observation have no influence on what
happens to the next observation. This independence property will frequently be violated in
practice. Particularly in macroeconomic data processes from one period spill over to the next.
Even in microeconomic interactions individuals can influence each other. For instance if people
imitate the behaviour of their neighbours, this will induce (for example) correlations in their
consumption patterns which may go beyond the observables (i.e X) that we can control for in
the standard regressions.
104
Assumption 4: Exogeneity of X
Version (a) of this assumption (fixed regressors) is unlikely to be ever met in economic research.
Indeed econometricians have thought long and hard about how to analyse data that are essentially
non-experimental. We do not have the luxury of being able to predetermine the levels of our
explanatory variables and to measure the outcomes. In practice the crucial assumption will
therefore be version (b), i.e. that the regressors are generated independently of .
Assumption 5: Normality of
As noted above, this assumption is not a core assumption of the classical linear regression model.
It is, however, very useful for providing various optimality results and allowing us to deduce the
sampling properties of our estimators in small samples.
The identification condition: Full rank of X
It is important to understand what this condition says. Essentially it requires two things:
None of the columns of the X matrix should be able to be written as a linear combination
of the other columns. We can see what can go wrong if this condition is violated. If, for
instance, we had x4 = x3 + x2 and our DSP was given by
DSP1: y = x1 1 + x2 2 + x3 3 + x4 4 +
This DSP would generate precisely the same data as the dierent DSP
DSP2: y = x1 1 + x2 (2 2) + x3 (3 2) + x4 ( 4 + 2) +
In fact every possible data set generated by DSP1 would also be generated by DSP2. The
observed data could therefore never adjudicate which of these DSPs was really generating
the observations. In short the data could never identify the underlying process.
The number of observations must be at least as large as the number of variables . Again,
if we had too few observations, we could never hope to identify the unknowns.
6.7
Exercises
1. Consider the following model of a DSP
( | ) =
1
2
y = x +
if | |
if | |
( |x) = 0 if 6=
where x takes on only positive values.
(a) What is (|x)?
(b) What is |x ?
(c) Does this DSP satisfy the assumptions of the classical linear regression model? Explain.
6.7. EXERCISES
105
2. Consider the following model of a DSP
( |x) =
1 | | if | | 1
0
if | | 1
( |x) = 0 if 6=
and x (5 )
(a) What is (|x)?
(b) What is ()?
(c) What is 12 ?
(d) What is ?
(e) Does this DSP satisfy the assumptions of the classical linear regression model? Explain.
(f) (More dicult) What is the conditional distribution of given ?
106
Chapter 7
Least Squares
7.1
Introduction
The Classical Linear Regression Model represents the DSP as

y = X +
b 1 , then we can get a set of
If we pick any estimate (however arbitrary) of and denote it by
fitted values corresponding to these estimates:
b
b 1 = X
y
Corresponding to these fitted values, we will get a vector of residuals

y1
e1 = yb
b 2 we would get a dierent set of fitted values y
b2
b 2 = X
IF we pick another estimate say
and residuals e2 = yb
y2 . In fact it is clear that there can be infinitely many fitted values and
residuals. The problem is how to pick sensible estimates.
In this chapter we will present Least Squares from a purely mathematical perspective, i.e.
we will not yet be concerned with the statistical properties of this estimator. Nevertheless it is
essential to have a good grasp of the numerical properties of Least Squares, since it will help us
to understand why OLS is a very special method of estimation.
7.2
The Least Squares criterion
The Ordinary Least Squares Criterion stipulates that we pick that estimate which minimises
the Residual Sum of Squares, i.e.
b = arg min
X
=1
where x is the -th row of the matrix X.
107
( x b)
108
CHAPTER 7. LEAST SQUARES
We will write the residual sum of squares (RSS) as . We have
21
=1
= 21 + 22 + + 2
=
= e0 e
1
2
So the problem of OLS estimation is to minimise

= e0 e
where
7.2.1
b
e = y X
The solution to the OLS problem
Substituting this in:

0
b
y X
b 0 X0 y X
b
=
y0
e0 e =
b
y X
b
b X0 y+
b X0 X
b
= y0 y y0 X
We can simplify this if we note that each of these terms is a scalar (a 1 1 matrix). The
0
b =
b 0 X0 y. The
transpose of a scalar is of course just that number again. Now note that y0 X
two middle terms are therefore equal to each other and so
b 0 X0 y+
b 0 X0 X
b
= y0 y 2
(7.1)
b so as to minimise . The usual way of finding

In short, our problem is to pick the vector
the minimum or maximum of any expression is to dierentiate it. Of course here we dont have
a single variable - we have a whole vector. Nevertheless dierentiation is still the appropriate
technique. In this case we need to dierentiate with respect to each of the elements of the vector
and set the derivatives equal to zero:
= 0
= 0
b
= 0
b
7.2. THE LEAST SQUARES CRITERION
109
We need to simultaneously solve these equations. We can write this system of equations in vector
form:
=0
b
where we let
be the vector
It helps to be able to do the dierentiation directly on the vector expression (equation 7.1).
A short diversion on matrix dierentiation
To dierentiate this equation we make use of the following rules:
1. If = , where is a constant, then
=0
b
i.e. vector dierentiation with respect to a constant gives the zero vector.
b 0 c, where c is a vector of constants, then
2. If =
=c
b
b A,
b where A is a symmetric matrix of constants, then
3. if =
b
= 2A
b
Using these rules to dierentiate equation 7.1 we get
Setting the derivative equal to zero:
b
= 2X0 y+2X0 X
b
b
2X0 y+2X0 X
0 b
X X
= 0
= X0 y
(7.2)
These are the normal equations. Provided that X0 X has rank (which it will do, by the
b
identification condition that we imposed), we can solve out for
b = (X0 X)1 X0 y
(7.3)
This is the fundamental equation of OLS estimation.

We just need to satisfy ourselves that the solution does, indeed, define the unique minimum
(and not some other stationary point). The Hessian is given by
This is a positive definite matrix.
2
= 2X0 X
b
b0

110
7.3
The geometry of Least Squares
At this stage the obvious question seems to be, why should we want to square the residuals?
Why dont we minimise the sum of the absolute deviations instead? In fact, there is an estimator
(the Least Absolute Deviations or LAD estimator) that does precisely that.
In order to get some sense of the rationale for squaring, let us consider the simplest possible
regression problem, given by the model
y =x +
where there is precisely one explanatory variable. Let us consider the particularly simple case in
which we have only two observations, i.e. our model is
1
1
1
=
+
2
2
2
We can plot these points in the ordinary Cartesian plane where the axes correspond to the two
observations. Geometrically this is shown in Figure 7.1. We have plotted the points (1 2 ) and
(1 2 ). In this context it is useful to identify these vectors not only with a particular point in
the two-dimension space, but the directed line segment from the origin to that point. These are
indicated in the figure by the darker arrows.
b we trace out the line through the point x. In the diagram
By choosing dierent values for ,
we have indicated two possible fitted values, i.e. x1 and x2 . The residual vectors e1 and
e2 corresponding to these fitted values are simply the vectors starting at the points x1 and x2
respectively and going to y. This has to be the case, since by definition
b+e
y = x
b
for any choice of .
Minimising the Residual Sum of Squares in this particular context means minimising 21 + 22 .
This is just the square of the length of the vector e. Minimising the RSS therefore amounts to
b represent that point
picking a residual vector e that is as short as possible! The fitted values y
on the line through x that is closest to y.
This insight generalises to the case where y and x are arbitrary
vectors in -dimensional space.
p
The residual vector e = (1 2 ) has length kek = 21 + 22 + + 2 , so minimising the
RSS is equivalent to minimising kek2 . This, of course, is equivalent to simply minimising the
length of e.
Mathematically it is therefore obvious why one might want to minimise the residual sum of
squares. The reason why we square (and dont take absolute values) is that the usual distance
measures in dimensional space all involve squares, through Pythagorass theorem.
Going back to Figure 7.1 it is clear that the vector e which will minimise the length of e has
to be given by the perpendicular dropped from y onto the line that passes through the point x.
In other words, the vector e has to be at right angles to the line through x. This implies that
the inner product (dot product) of the vectors e and x has to be zero, i.e.
x0
e = 0
b
x y x
= 0
0
x0 y
b
= x0 x
We have therefore derived the normal equations through a geometric argument! Note that we
can re-run this derivation in reverse order to show that the OLS solution has to be such that the
residual vector is at right angles to the explanatory variable.
7.3. THE GEOMETRY OF LEAST SQUARES
111
Observation 2
(y1,y2)
e1
(e 1,e 2)
xb 1
e
e2
(x1,x2)
^y= ^(y1^,y2)
xb 2
Observation 1
Figure 7.1: The OLS residual vector e is at right angles to the x vector and any vector lying on
the line through x.
112
It turns out that this argument generalises to the case where there is more than one explanatory variable (for an extended geometric treatment of Least Squares see Davidson and
MacKinnon 1993). Provided that the vectors of explanatory variables x1 , x2 , , x are all independent of each other (which by our identification assumption will be the case), we can trace out
a dimensional subspace of < by considering all possible linear combinations of these varib
b
b An arbitary
ables. Dierent fitted values will be given by dierent choices of
1
2
point in this space will be given uniquely by the expression

or
b + x2
b + + x
b
x1
1
2
b
X
b is the vector from this space to the point y. Again the problem
The residual vector e = y X
is to minimise the length of this vector, i.e. to find the point inside the space spanned by x1 , x2 ,
. . . , x that is as close as possible to y. Again the solution is to drop the perpendicular from y
to this space, i.e. to make e orthogonal to X, which in this case implies that X0 e = 0, i.e.
b
= 0
X0 y X
X0 y
7.3.1
b
= X0 X
Projection: the matrix and the matrix
The process of dropping the perpendicular from y into the space spanned by x1 , x2 , . . . , x is
an example of a mathematical operation called projection. It is a mapping from any arbitrary
b which reside within the dimensional sub-space generated
vector y in < onto its fitted values y
b Substituting in from
b = X.
by the columns of X. The fitted values, of course, are given by y
equation 7.3 we get
1
b = X (X0 X) X0 y
y
This makes explicit how the fitted values are generated from the y vector.
The matrix X (X0 X)1 X0 is called a projection matrix since it accomplishes the projection
from y into the space spanned by x1 , x2 , . . . , x . It is suciently important that it is frequently
given its own name:
1
X = X (X0 X) X0
(7.4)
If there is no ambiguity about the regressors it is referred to simply as . This is sometimes also
called the hat matrix because it puts a hat on the original y values, i.e.
b = X y
y
(7.5)
Note that X X = X , i.e. the matrix is idempotent. This makes a lot of sense. If we start
with a point that has already been projected into the space and project it again, it will simply
stay where it is. This is obvious if we look at the example given in Figure 7.1. If we try to drop
b onto the line through x, the point will simply stay where
the perpendicular from the point y
it is. More generally if we regress our fitted values on the X matrix, we will just get the fitted
values back. Note also that the matrix is symmetric.
Having shown how to obtain the fitted values from y, it is equally possible to show how we
get the residuals. The OLS residuals are, of course given by
e = yb
y
= yX y
= (I X ) y

1
The matrix I X , or I X (X0 X)
113
X0 is suciently important that it merits its own notation:

1
X = I X (X0 X)
X0
(7.6)
This matrix is called the residual maker by Greene (2003) since it shows how the residuals are
created from the original y vector:
e =X y
(7.7)
Again if there is no chance of confusion, we will drop the subscript and talk about the matrix.
X is also idempotent, i.e. X X = X and symmetric. In fact X is also a projection
matrix. In this case the projection is onto the space that is orthogonal (i.e. at right angles) to
the space spanned by x1 , x2 , . . . , x (for more information see Davidson and MacKinnon 1993).
In Figure 7.1 this space is again a line - from the origin through the point (1 2 ). In the more
general case, this space will have dimensions .
In short there are two operations associated with ordinary least squares:
b from y
The projection which creates fitted values y
The projection which creates residuals e from y
Between them these two completely (and uniquely) decompose the y vector as y =b
y + e.
This decomposition is such that these vectors will be at right angles to each other. Indeed there
are two key relationships between the and the matrix:
+ =
and annihilate each other, i.e. = 0 = .
These two relationships have an interesting interpretation. The first says that the residuals
b on X will be zero, i.e. the fitted
that one would get from regressing the fitted values y
values can be perfectly explained by the X variables. The second says that the fitted values
one would get by regressing the residuals on X will also be zero, i.e. the X variables cannot
explain anything additional about the residuals.
It is useful to note that X will annihilate the X matrix also:
1
X X =
I X (X0 X) X0 X
= XX
= 0
We can understand what this means more intuitively if we write the matrix X in terms of its
column vectors as X = [x1 x2 x ]. The condition above simply means that X x = 0, for
every . This means that the residuals that we get from regressing x on the X variables is zero.
This is as it should be, since we can obviously retrieve any combination of the x variables from
those variables themselves!
One implication of this is that
e =
=
=
=
X y
X (X + )
X X+X
X
(7.8)
So there is an immediate connection between the true errors and the fitted values through the
X matrix.
114
7.3.2
Algebraic properties of the Least Squares Solution
We can summarise the numerical properties of the least squares estimator as follows:
1. The OLS estimator is a linear function of the dependent variable
b is a function of the sample values y. More particularly
From equation 7.3 it is clear that
1
0
b = Ay. But this means that
b is a linear
if we define the matrix A = (X X) X0 , then
function of the y vector.
2. The fitted values are a linear function of the dependent variable
This follows immediately from equation 7.5.
3. The residuals are uncorrelated with the explanatory variables
We have shown this in the context of our geometric interpretation of least squares.
4. The residuals are uncorrelated with the fitted values
This follows from the previous point, since the fitted values are just linear combinations of
the x variables. We can show it formally as follows
b
e0 y
= (X y)0 X y
= y0 X X y
= 0
5. The residuals are a linear function of the errors

This follows from equation 7.7.
6. The average of the residuals is zero if there is an intercept in the model
We showed above that
e0 X = 0
If there is an intercept in the model, then the first column of X will be a column of
ones. Let be the column vector of ones, then writing X in terms of its column vectors
X = [ x2 x ] we have
e0 x2 x = e0 e0 x2 e0 x
The first entry of this 1 row vector must be zero, i.e.
e0 =0
This however implies that
the residuals is zero.
= 0. It follows immediately that the sample average of
7. The average of the fitted values is y if there is an intercept in the model

This follows from the previous result. We have, by definition
y =b
y+e
Consequently
y0 =b
y0
i.e.
= b
115
8. The OLS estimators are invariant to linear transformation of the data

This is an interesting property that we have not, as yet, discussed. Consider the situation
in Figure 7.1 where the x vector is rescaled. One example of this would be where we
change the units in which we measure x (e.g. measure them in terms of millions rather
than thousands). A uniform rescaling like this will change the point (1 2 ) but will not
change the line through x since the rescaled point will still lie on the same line! It is clear
that the fitted value (b
1 b2 ) in the diagram will not change.
It turns out that this property generalises: if we transform the X matrix with any linear
transformation that can be undone, then the fitted values will not be aected. We can
show this formally as follows:
Let us assume that we transform the X matrix linearly to the matrix Z where
Z = X
where is a non-singular matrix of constants. Note that nonsingularity of implies

that we havent thrown away information, i.e. we can always recover the X variables from
the transformed variables, since X = Z1 . This means
y = X +
y = Z1 +
= Z 1 +
= Z +
The parameter vector of the new model is therefore given by = 1 . We will show
that the OLS estimator of on the transformed data will satisfy
We have
b
b = 1
b = (Z0 Z)
= (0 X0 X)
1
= 1 (X0 X)
Z0 y
0 X0 y
01 0 X0 y
= 1 (X0 X)
X0 y
b
= 1
Furthermore consider the fate of the fitted values. Prior to the transformation they were
given by
1
b = X (X0 X) X0 y
y
After the transformation they are given by:
b = Z (Z0 Z)
y
= X (0 X0 X)
= X (X0 X)
Z0 y
0 X0 y
X0 y
By comparison, it is easy to show that a rescaling of the y vector will rescale the fitted
values. But again the rescaling will happen in such a way that the underlying interpretation
116

of the relationship does not change, i.e. if we rescale y, so that y = y, where 6= 0 is
some real scalar, then
y = X+
b since
b = ,
and our estimator of = will satisfy
1
X0 y
X0 y
b
=
b = (X0 X)
= (X0 X)
7.4
b = y
b
Consequently y
Partitioned regression
In many cases we are interested in analysing the role of subsets of variables. In particular,
suppose that the regression involves two sets of variables X1 and X2 so that
y = X1 1 + X2 2 +
We will be interested in investigating the properties of the OLS estimates of 1 and 2 .
7.4.1
The Frisch-Waugh-Lovell Theorem
One extremely important result is contained in the Frisch-Waugh-Lovell Theorem. An easy proof
is provided by the results on projections that we derived above (for the full details see Davidson
and MacKinnon 1993, p.19).
Theorem 7.1 If we partition the X matrix so that X = X1

correspondingly, i.e.
y = X1 1 + X2 2 +
X2
and partition the vector
b is such that
then the OLS estimate
2
1 0 0
b = X0 0 1 X2
X2 1 1 y
2
2 1
0
1 0
= X2 1 X2
X2 1 y
where 1 is 1 so 1 y is the vector of residuals from the regression of y on X1 and 1 X2

is the matrix of the residuals when regressing each of the column vectors in X2 on X1 .
b in the multiple regression which includes
This says that we can get the OLS coecient
2
X1 , by regressing the residuals 1 y on the residuals 1 X2 .
Proof. Let us write
b + X2
b +e
y = X1
1
2
then by the argument above e will be orthogonal to both X1 and X2 . It follows that 1 e = e.
Multiplying through by 1 we get
b +e
1 y = 1 X2
2
7.4. PARTITIONED REGRESSION
117
Multiplying this through by X2 we get

b
X02 1 y = X02 1 X2
2
0
b we get the result. This means that
b is the vector of
since X2 e = 0. Solving out for
2
2
coecients that succeeds in minimising the distance between the residual vector 1 y and the
space spanned by the columns of 1 X2 , i.e. it is the vector of OLS coecients in the regression
of 1 y on 1 X2 .
7.4.2
Interpretation of the FWL theorem
In essence the theorem says that we can think of a multiple regression coecient as giving us the
impact of a variable after we have fully taken the impacts of all the other variables into account.
More specifically, it states that if we have more than one explanatory variable, we can get the
multiple regression coecient on any variable (or group of variables) by the simple expedient of
regressing that variable(s) on all the other explanatory variables and obtaining the residuals e2 .
(This notation is a bit awkward, because e2 may be a matrix rather than a vector.) Similarly we
regress y on those other variables and obtain the residuals e1 . The coecient in the regression of
e1 on e2 will be numerically equal to the coecient in the multiple regression. Figure 7.2 gives
a pictorial representation.
The overall variation of y is represented by the circle (areas 1,2,4,5). After taking the impact
of x1 fully into consideration we are left with the residual variation of the shaded areas 1 and
2. This is the variation of the residuals e1 around their mean. After fully accounting for the
impact of x1 the variable x2 will only contribute the additional information about y given by
area 2. This is the part in the variation of the residuals e1 explained by the residuals e2 - and
it coincides precisely with the impact of x2 on y in the multiple regression, with x2 included as
an additional variable.
In other words, if our regression is
y = 1 x1 + 2 x2 +
then 2 is the impact of x2 on y, once we have purged all of the eects that depend on x1 , i.e.
the direct eect of x1 on y and any indirect eects which may work through the impact that x1
has on x2 .
Of course the result cuts in the opposite direction too - the coecient on x1 in the multiple
regression is also the coecient in the relationship between the residuals of y, after accounting
for x2 and the residuals of x1 , after controlling for x2 .
This picture may also make it clear, why if x1 and x2 are highly correlated (so that the area
labelled 2 in figure 7.2 is very small), it will be extremely hard to estimate the dierential impact
of x2 on y with any degree of accuracy. There will be simply too little information on which to
base our estimates. This is referred to as the problem of collinearity which we will discuss later
in this course.
7.4.3
Alternative proof
We have given a proof which uses the properties of the projection matrices and . Greene
(2003, p.26) provides an alternative proof involving a consideration of the normal equations.
Using the partitioned form of the matrix X, these can be written as
#
0
"
b
X1 X1 X01 X2
X01 y
1
=
(7.9)
b
X02 y
X02 X1 X02 X2
118
variation
in y
1
4
variation in x1
5
6
2
3
variation in x2
Figure 7.2: The shaded areas labelled 1 and 2 represent the residual variation in y after x1
has been taken into account. It is a pictorial representation of e1 . The areas labelled 2 and
3 represent that portion of the variation in x2 which remains after controlling for x1 , which
represents e2 . The overlap area 2 is the variation in y explained by x2 holding x1 constant. The
b in the multiple regression of y on
Frisch-Waugh-Lovell theorem says that the OLS coecient
2
x1 and x2 is identical to the coecient obtained in the regression of e1 on e2 .
7.4. PARTITIONED REGRESSION
119
b from the first set of equations:

We can solve out for
1
b = (X0 X1 )1 X0 y X
b
1
2 2
1
1
(7.10)
Substituting this into equation 7.9 and then considering the second set of equations, we will get
X02 X1 (X01 X1 )
X01 y X02 X1 (X01 X1 )
b + X0 X2
b = X0 y
X01 X2
2
2
2
2
b we get
Rearranging and solving for
2
h
i1 h
i
1
1
0
0
0
0
0
0
b
=
X
(X
X
)
X
(X
X
)
X
I
X
X
X
I
X
1
2
1
2
2
1 1
1
2
1 1
1 y
1
= [X02 1 X2 ]
[X02 1 y]
Since 1 is idempotent this is identical to [X02 1 1 X2 ] [X02 1 1 y]. Noting that 1 y = e1

(in the notation of the previous subsection) and 1 X2 = e2 , we get
Which is what the theorem claims.
7.4.4
b = [e0 e2 ]1 [e0 e1 ]
2
2
2
Applications of the FWL theorem:
The deviations form of OLS

One interesting result that follows immediately from the FWL theorem is that OLS can equivalently be run in deviations form provided that the model contains an intercept. Assume that
the model is
y = 1 + X2 2 +
This means that the first column of the X matrix is just a column of ones, which we denote by
. We find that e1 = y is given by
1
y =
I (0 ) 0 y
1 0
= y (0 )
Now (0 ) = 1 and 0 y = . The second term on the right hand side is therefore just .
Consequently the vector y is just the vector of deviations of y values from their mean, i.e.
1
2
e1 =
..
It is clear that there is nothing special about y here. Any vector when premultiplied by will
have its mean removed. The matrix X2 will therefore contain in its columns the deviations
of each of the x variables from their mean. In short the slope coecients can all be estimated
from the deviations form of the regression model. The remaining coeent, i.e. the intercept, can
be retrieved from equation 7.10. Substituting the slope estimates into this equation we find that
b = 2
b
b
1
2
This provides an additional neat numerical result: if there is an intercept in the regression
model, then the fitted OLS regression line will go through the point of means (2 3 ).
120
Detrending or seasonally adjusting a data series

The original discovery of the FWL theorem came about in the context where Frisch and Waugh
were dealing with data that contained time trends. They discovered that they would get the
same results if they first detrended all the variables (by regressing them on a time trend) and if
they simply included the time trend as an additional regressor.
Lovell rediscovered this result in the context of trying to account for seasonality. The regression results are identically if you first seasonally adjust all series (by regressing them on
seasonal dummies) and if you put the dummies into the multiple regression.
Eliminating an observation by means of a dummy variable
0
Consider the eects of including the very particular dummy variable i1 = (1 0 0 0) , i.e. a
variable which is 1 in the first observation and otherwise 0. Our regression model is
y = i1 1 + X2 2 +
1 0
i1 y.
In this case 1 y = y i1 (i01 i1 )

Consequently
1 0
i1 y
It is straightforward to see that i1 (i01 i1 )
1 y
0
2
3
..
.
0
y
where y is the y vector without its first element. Similarly
0
0
22 23
1 X2 = 32 33
..
..
..
.
.
.
=
0
X2
= (1 0 0 0)0 .
0
2
3
..
.
where X2 is the X2 matrix without its first row.

By the FWL theorem, the OLS estimate of 2 will be given by the regression of 1 y on
1 X2 , i.e.
0
1 0 0
0
b
=
X
X
X2 1 1 y
1
2
2
2 1
0
0
0
0
0
0 X2
0 X2
=
X2
y
= (X0
2 X2 )
X0
2 y
This means that the regression estimate of 2 is determined as though the first observation did
b through equation 7.10. In fact it is
not exist. Instead the first observation only determines
1
b
easy to see that in this case 1 will be set so that the first observation fits perfectly!
7.5. GOODNESS OF FIT
121
One way of thinking about this result is that by allowing the first observation to have its
own coecient, we are in eect allowing it to have an arbitrarily large residual. Note that
the argument is perfectly general - it applies to any dummy variable i which has value 1 for
observation and zeros otherwise.
This trick of eliminating an observation by including a specific dummy for that observation
is used some times in time series analyses, if it is thought that one observation is atypical (e.g.
if it was the year of some major upheaval).
7.4.5
Omitted variable bias
Before leaving the topic of partitioned regression it is useful to note what happens to the OLS
estimates if we dont estimate the full model, but only estimate some of the coecients. In
particular, let us assume that the DSP is given by
y = X1 1 + X2 2 +
and we estimate instead
y = X1 1 + v
the OLS estimate of this misspecified regression will have the property that
b
b + A
b=
1
2
(7.11)
b are the OLS coecients that we would have obtained in the multiple regression
b and
where
1
2
and A is the matrix of coecients obtained by regressing each of the columns in X2 on X1 .
Proof.
b + X2
b +e
y = X1
1
2
hence
(X01 X1 )
b + (X0 X1 )1 X0 X2
b
X01 y =
1
1
1
2
b
b
b = + A
1
b = 0, or X2 is orthogonal to X1 the
Equation 7.11 highlights the simple fact that unless
2
OLS estimates of the coecients of X1 in the restricted regression will be dierent from those
in the multiple regression.
7.5
Goodness of Fit
We have noted above that the fitted values are orthogonal to the residuals. This allows us to
decompose the sum of the squares of the y values into two components: the Residual Sum of
Squares and the Regression Sum of Squares:
y0 y
y0 y
= (b
y + e) (b
y + e)
0
0
by
b+ee
= y
(7.12)
This particular decomposition is not used that often, because for many data series, the biggest
contribution to the sum of squares on the left hand side is the mean of the y values (think of
economic series like GDP!).
122
A better measure how much the explanatory variables have contributed to understanding the
behaviour of y is to exclude the intercept from consideration. So if our fitted model is
b + X2
b +e
y =
1
2
we can write the model in deviations form (see section 7.4.4) as

b + e
y = X2
2
We have noted above (it is implied by section 7.3.2) that e = e, so we will write this as
y
b +e
= X2
2
b +e
= y
b will
where the superscript indicates that we have centered the variables. The fitted values y
still be orthogonal to the residual vector e (since they derive from the multiple regression of y
on X2 ). Consequently we can write the decomposition in the form
b 0 y
b + e0 e
y0 y = y
(7.13)
The left hand side is the sum of squares of the deviations of the y values from their mean. This
is some times referred to as the variation in y. The first term on the right hand side is the
explained sum of squares, some times also called the regression sum of squares or model
sum of squares. The final term is the residual sum of squares, some times also called the
error sum of squares.
Regrettably nomenclature in this area is not uniform. What makes this particularly unfortunate is that some times the abbreviations have diametrically opposite meanings, i.e. the ESS
and RSS could refer to error sum of squares and regression sum of squares or explained
sum of squares and residual sum of squares! If you need to use one of these terms it is always
advisable to specify first what you intend it to refer to.
The decomposition in equation 7.13 is the basis for defining the coecient of determination or 2 :
b
b 0 y
y
e0 e
2 = 0 = 1 0
(7.14)
y y
y y
This is some times also called the centered 2 . The uncentred version would be based on the
decomposition given in equation 7.12.
The 2 ranges from zero (when the model explains nothing about y) to one, when it fits
perfectly. Consequently the 2 is frequently used to assess how well a regression seems to fit.
There are several problems with this particular measure:
Firstly, it is always possible to improve the fit of the regression by including more variables. Indeed, it is always possible to get a perfectly fitting regression if one were to use
regressors!
In order to get around this problem several other measures have been suggested, such as
the adjusted 2 . Greene (2003) has a discussion of some of the options.
Secondly, the size of the 2 depends on the nature of the y variable. If the y variable
is transformed (e.g. by taking logarithms), the total variation in y changes and with it
the 2 . The 2 cannot really be used to compare models that have dierent dependent
variables.
7.6. EXERCISES
123
Thirdly, there are some domains in research in which it is almost impossible to reduce the
intrinsic noise that is coming from . A regression with an 2 of 05 does not necessarily
fit badly if it is estimating certain kinds of labour market outcomes. In fact, regressions
with too high an 2 often need to be treated with extreme caution.
7.6
Exercises
1. Consider the formulae given in the appendix, equations 7.15 and 7.16.
(a) Verify that these expressions do, indeed, represent the OLS estimators.
(b) Prove that these values uniquely minimise the sum of squares.
2. Consider the data given in the appendix, table 7.1. Rewrite the information on the explana1
tory variable(s) in standard matrix form as the X matrix. Calculate X0 X and (X0 X)
1
0
0
and (X X) X y. Verify that this provides the same set of estimates as supplied in the
appendix.
3. Regress the residuals obtained from this expression on X. Verify that the OLS coecients
are all zero.
4. (Greene 2003, p.39,Exercise2) Suppose that b is the least squares coecient vector in the
regression of y on X and that c is any other 1 vector. Prove that the dierence in the
two sums of squared residuals is
(y Xc)0 (y Xc) (y Xb)0 (y Xb) = (c b)0 X0 X (c b)
Prove that this dierence is positive.
124

Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
124
132
202
14
192
7
20
84
136
68
118
56
74
94
174
Table 7.1: Hypothetical data set
7.7
Appendix: A worked example
The principle of ordinary least squares is fairly easy to explain. At one level one can view OLS
as simply a method for trying to fit a straight line to a set of points. To make these points more
concrete, consider the hypothetical data set contained in Table 7.1.
Figure 7.3 presents a scatterplot of the variable against the variable . As the plotted points
show, there seems to be a linear relationship between these variables. On the diagram we have
arbitrarily drawn a line through these points, with the equation of the line given by b = 8 + 06.
Given any such line, we can use it to predict the value of that we would expect, given
a particular value of . Such predictions are called fitted values and they are indicated by the
hat over the variable, i.e. b is the predicted value for corresponding to the equation given
above. In the case indicated on the diagram = 8 and so = 5 and = 84 According to the
equation we therefore have b8 = 11.
The dierence between the actual value and the fitted value is known as the residual and is
indicated in the diagram above by , i.e. = b .
In the specific case above, we have 8 = 84 11 = 26
Of course with a dierent line we would get very dierent fitted values and residuals. In
Figure 7.4 we have used the equation = 4 + and as is immediately obvious, both the fitted
value and the residual (or error) has changed. We now have b8 = 9 and 8 = 06. The absolute
value of the error in this case is much smaller. From this perspective we might be tempted to
conclude that the second line is better than the first one. Note, however, that for observation
= 11 we have = 5 and = 118. The errors for the two cases are therefore given by
11 = 114 11 = 04 for the first line and 11 = 114 9 = 24 for the second line.
In general we will not be concerned with trying to fit the line to a particular point on the
scatter diagram, but we would like to find that line that in some sense minimises the aggregate
error over all the observations.
The OLS criterion is a rule which specifies how we should measure the aggregate error. It
stipulates that the best fitting line is the one that minimises the Residual Sum of Squares or
7.7. APPENDIX: A WORKED EXAMPLE
125
line
20
15
^yi
10
yi
} ei
0
0
10
15
x
xi
Equation of line:
20
yi = 8 + 0.6 xi
Figure 7.3: Given a particular line we can use it to define a predicted (or fitted) value for . The
dierence between the actual and the fitted value is known as the residual.
line2
20
15
^yi10
ei
0
0
10
15
x
xi
Equation of line: yi = 4 + xi
20
Figure 7.4: With a dierent line, we get dierent fitted values and dierent residuals.
126
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
124
132
202
14
192
7
20
84
136
68
118
56
74
94
174
b
122
128
194
152
176
92
176
11
134
98
11
86
98
116
158
Line 1
2
02
004
04
016
08
064
12
144
16
256
22
484
24
576
26
676
02
004
3
9
08
064
3
9
24
576
22
484
16
256
RSS= 5404
b
11
12
23
16
20
6
20
9
13
7
9
5
7
10
17
Line 2
14
12
28
2
08
1
0
06
06
02
28
06
04
06
04
RSS=
2
196
144
784
4
064
1
0
036
036
004
784
036
016
036
016
2652
Table 7.2: Dierent lines give dierent Residual Sums of Squares

RSS where this is defined as
=
=1
Note that this criterion does not in itself tell us how to find that line. It only gives us the
criterion according to which we can decide which one of many dierent possible lines should
count as the best fitting one.
As an example of the application of the OLS criterion, let us investigate which of the two lines
that we considered above has the smallest RSS. In table 7.2 we have summarised the original
pairs of observations, together with the fitted values corresponding to the two lines, the residuals
and the square of the residuals. When the squares of the residuals are added up, we get RSS
values of 54.04 and 26.52 respectively. According to the OLS criterion, therefore, the second line
gives a better fit than the first line.
The problem with proceeding in this way is that there are infinitely many lines that we might
try to fit to the data. Fortunately for the linear case (and, indeed, for the polynomial case more
generally) it is possible to derive a formula for the equation of the line that is guaranteed to
produce a lower RSS than any other line.
The OLS formulae for the optimal values of and in the equation = + are given by
P
=1 ( ) ( )
b
=
(7.15)
P
2
=1 ( )
b = b
(7.16)
where is the sample mean of the values and is the sample mean of the values. Applied
to the data above, we can calculate the best fitting line according to the OLS formula as in
table 7.3 below. The middle panel provides the necessary calculations which lead to the OLS
estimates of b
= 0863 and b = 5231, as given in the bottom-most panel. The right-most panel
then calculates the fitted values and residuals based on the line = 5231 + 0863. Note that
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
= 8333
b
=
386267
447333
12.4
13.2
20.2
14
19.2
7
20
8.4
13.6
6.8
11.8
5.6
7.4
9.4
17.4
= 12427
= 0 863 489
-1.333
-0.333
10.667
3.667
7.667
-6.333
7.667
-3.333
0.667
-5.333
-3.333
-7.333
-5.333
-2.333
4.667
-0.027
0.773
7.773
1.573
6.773
-5.427
7.573
-4.027
1.173
-5.627
-0.627
-6.827
-5.027
-3.027
4.973
127
2
( ) ( ) ( )
b
0.036
1.778
11.275
-0.258
0.111
12.139
82.916
113.778
21.637
5.769
13.444
15.593
51.929
58.778
19.047
34.369
40.111
6.958
58.062
58.778
19.047
13.422
11.111
9.548
0.782
0.444
13.002
30.009
28.444
7.821
2.089
11.111
9.548
50.062
53.778
6.094
26.809
28.444
7.821
7.062
5.444
10.412
23.209
21.778
16.456
386267 447333
b = 12427 (0863) 8333 = 5 235 62
1.125
1.061
-1.437
-1.593
0.153
0.042
0.953
-1.148
0.598
-1.021
2.252
-0.494
-0.421
-1.012
0.944
RSS=
2
1.265
1.126
2.066
2.537
0.023
0.002
0.909
1.319
0.357
1.043
5.070
0.244
0.178
1.024
0.891
18.053
Table 7.3: Calculating the OLS regression coecients
the RSS associated with this line is, indeed, lower than the RSS of the other two lines considered
above.
The case above is the simplest one, where there is one independent variable and one
dependent variable and we fit a straight line. It is possible, however, to provide formulae for
the solutions to the least squares problem for whole classes of more complex functions. What
is required, however, is that the parameters of the function (in the case above and ) enter it
linearly. Examples of functions that are linear in the parameters are:
polynomials: = 0 + 1 + 2 2 + +
hyperplanes: = 1 + 2 2 + 3 3 + + (where 2 are dierent variables)
Even functions that on the surface appear to be nonlinear can often be written in a form
where they become linear in the parameters:
hyperbola: = 1 + 2 i.e. = 1 + 2 where = 1
exponential: = i.e. log = log + log . This is linear if we let = log , 1 = log
and = log
The equation of the hyperplane is the most general formulation of the linear model. It can,
for example, encompass the polynomial model provided that we let 2 = 3 = 2 = 1 .
128
Chapter 8
Properties of the OLS estimators

in finite samples
8.1
Introduction
In the last chapter we saw one motivation for OLS - the procedure amounts to minimising the
b as close to the actual values
length of the residual vector, i.e. it makes the fitted values y
y as possible. This is, however, a purely geometric consideration. In this chapter we will be
considering the statistical motivations behind OLS. In order to do this we need to specify the
type of DSPs that we assume have generated the data at our disposal. For the moment we will
make the following assumptions (compare with table 6.1):
1. Y: We assume that Y is a univariate variable with continuous, unlimited range.
2. : The function is linear in X and , additive in
3. X: The X variables are nonstochastic. In section 8.5 we allow the regressors to be stochastic
as long as they are exogenous.
4. : The parameters are fixed.
5. : The disturbances are independent and identically distributed, with (|X) = 0,
(|X) = 2 I
6. (e|X ): The distribution of the error terms is left unspecified. In section 8.6 below we
will restrict the analysis to normally distributed errors.
7. : The parameter space is unrestricted.
The model can therefore be written as
Y = X +
(|X) = 0, (|X) = 2 I
Note that we will need to assume that the model is identified, i.e. X has full column rank.
129
(8.1)
130
8.2
CHAPTER 8. PROPERTIES OF THE OLS ESTIMATORS IN FINITE SAMPLES
Motivations for OLS
In this section we will consider some of the statistical reasons that make OLS attractive. These
reasons fall into two types:
Reasons why minimising the residual sum of squares may seem like the logical thing to
do. Many of these turn out to be variants of letting the sample mimic the population
relationships.
Reasons why minimising the residual sum of squares is the optimal thing to do, particularly
when compared to other types of estimators. Prominent among these reasons is that OLS
will turn out to be ecient in certain categories of estimators. Establishing those optimality
properties will take up the bulk of the chapter.
8.2.1
Method of moments
The model specifies that the X variables must be uncorrelated with the vector of disturbances
. We can write the condition (|X) = 0 equivalently as
(X0 ) = 0
Now we saw that one of the fundamental characteristics of OLS estimation is that
X0 e = 0
We can write this equivalently in the form
1 0
Xe=0
We can think of this as a sample analogue of the population moment equation. It says that
the average sample correlation must be zero. However, as we saw, the sample equation X0 e = 0
b As is shown in the
leads directly to the normal equations, once we substitute in e = y X.
section of the course dealing with GMM estimation, such equations generally lead to consistent
estimators.
8.2.2
Minimum Variance Linear Unbiased Estimation
We will show in section 8.4 that the least squares estimator is the minimum variance linear
unbiased estimator.
8.2.3
Maximum likelihood estimation
With the addition of the normality assumption, we will show in Section 8.6 that the least squares
estimator is also the maximum likelihood estimator. Furthermore it will be the minimum variance
unbiased estimator.
8.3. THE MEAN AND COVARIANCE MATRIX OF THE OLS ESTIMATOR
8.3
8.3.1
131
The mean and covariance matrix of the OLS estimator

Unbiased Estimation
It is easy to show that the least squares estimator is unbiased:

b
= (X0 X)
X0 y
= (X0 X)
X0 (X + )
= + (X0 X)
X0
(8.2)
Consequently
h
i
1
b
|X
= + (X0 X) X0 |X
=
And
8.3.2

h
i
b = X |X
b

=
b
The Covariance matrix of
We can rewrite equation 8.2 in the form
b = (X0 X)1 X0
b (conditional on X) is given by
It follows that the covariance matrix of
0
b
b
|X
=
h
ih
i0
1
1
0
0
0
0
= (X X) X (X X) X |X
n
o
1
1
= (X0 X) X0 0 X (X0 X) |X
1
= (X0 X)
X0 (0 |X) X (X0 X)
= (X0 X)
X0 2 I X (X0 X)
= 2 (X0 X)
Note that we have used the assumption of homoscedasticity and zero autocorrelation to derive
this result.
8.3.3
Estimating 2
For the purposes of estimating this, we require an estimator of 2 . Since 2 is the common
variance of , it would seem sensible to base the estimator on its sample analogue:
the variance
P
of the . We could simply take the sample variance of the 2 , i.e.
b2 = 1 2 but this would
give us a biased estimator, as we can show:
!
X
2
= (e0 e)
= (0 X )
132
where I have substituted in a result from the last chapter and made use of the fact that X is
symmetric and idempotent.
This does not seem to have got us much further. At this point we use a trick involving the
trace of a symmetric matrix (a discussion of the trace is in the appendix). In particular we use
the fact that the trace of a scalar is just that scalar, as well as the properties () = ()
(where and are both symmetric) and ( ()) = ( ()).
Applying these we have
(0 X ) =
=
=
=
( (0 X ))
( (0 X ))
( (0 X ))
2 I X
= 2 (X )
h
i
1
= 2 I X (X0 X) X0
n
o
1
= 2 (I ) X (X0 X) X0
n
o
1
= 2 X0 X (X0 X)
= 2 { (I )}
= 2 ( )
In short we have shown that
(e0 e) = ( ) 2
An unbiased estimator of 2 is therefore given by
8.4
b2 =
e0 e
(8.3)
Gauss-Markov Theorem
The Gauss-Markov theorem is of fundamental importance in econometrics. It states that under

the assumptions of the Classical Linear Regression Model, OLS is optimal. In particular it shows
b is the best linear unbiased estimator (BLUE) of , i.e. the covariance
that the OLS estimator
b by a positive semidefinite
matrix of any other linear unbiased estimator will dier from that of
matrix.
= Cy be another linear unbiased estimator. We will consider the dierTo prove this, let
ence
b

= Cy (X0 X) X0 y
h
i
1
= C (X0 X) X0 y
= Dy
where D = C (X0 X)
X0 . This means that
b + Dy
=
0
= (X X)
(8.4)
0
X y + Dy
8.5. STOCHASTIC, BUT EXOGENOUS REGRESSORS
133
We assume that the true model is given by a special case of the DSP given in equation 8.1, with
= 0 and 2 = 20 . Substituting in y = X0 + we get
= 0 + DX0 + (X0 X)1 X0 + D
can be unbiased only if DX 0 = 0. This can be guaranteed for all values of

It is clear that
b and Dy have zero covariance:
0 only if DX = 0. It follows that Dy has mean zero and that
i
h
i
h
1
b (Dy)0
= (X0 X) X0 0 D0
0
= 20 (X0 X)
= 0
X0 D0
Note this result again uses the assumption of homoscedasticity and zero autocorrelation.
is the sum of the OLS
Equation 8.4 therefore says that the unbiased linear estimator
estimator plus a random component Dy with which it is uncorrelated (Davidson and MacKinnon
1993, p.159). It turns out that this is true more generally:
Asymptotically, an inecient estimator is always equal to an ecient estimator
plus an independent random noise term. (Davidson and MacKinnon 1993, p.159)
It follows that
i
0
0 0 = 20 (X0 X)1 + 20 DD0
b is 2 DD0 , which is a positive

and
So the dierence between the covariance matrices of
0
semidefinite matrix.
Note: In order to derive the Gauss-Markov theorem we only used the properties (|X) = 0
and (|X) = 2 I . We did not use the assumption that the errors were independently and
identically distributed, nor did we need to make any particular distributional assumptions about
.
b is the vector of OLS estimates and
is any other linear unbiased estimator
Corollary 8.1 If
b a0
for any linear combination a of the estimates.
of , then a0
8.5
Stochastic, but exogenous regressors
In the discussion thus far we have made the assumption that the X matrix was nonstochastic,
i.e. we would be able to fix it in repeated sampling. In practice this is almost never the case. We
therefore want to consider how the results above change if we allow X to be nonstochastic, but
independent of . It turns out that most of the results go through, provided that we condition on
X first. In fact in many of the derivations given above we have already done this conditioning,
which would be unnecessary if X was nonstochastic.
8.5.1
Lack of bias
This proof goes through precisely as before
134
8.5.2
b
The covariance matrix of
We have shown that
= 2 (X0 X)
|X
If X was nonstochastic, this would be the unconditional covariance matrix too. If X is stochastic
we can make use of the variance decomposition formula (Theorem 2.10):

h
i
h
i
b = X |X
b
b

+ X |X
b
The first term on the right hand side is zero, since |X
= for all values of X. Consequently
8.5.3
h
i
1
= X 2 (X0 X)
h
i
1
= 2 X (X0 X)
The estimator of 2
Our previous derivation will go through, provided that we condition on X initially. Since
(e0 e|X) = ( ) 2
It turns out that the unconditional expectation must be ( ) 2 also.
8.5.4
Gauss-Markov Theorem
As Greene (2003) notes, the Gauss-Markov theorem also goes through in this case. For any given
X we will have that the OLS estimator will be the BLUE of . It follows that the OLS estimator
must be more ecient than any other linear unbiased estimator for all X.
8.6
The normal linear regression model
We will now explicitly consider the case where is distributed normally, i.e we now make the
assumption that
|X N 0 2 I
(8.5)
It follows from this that
y|X N X 2 I
(8.6)
Here we have used the fundamental result from section 8.8.2 that a linear function of a vector of
normal random variables is itself normal. The fact that the random variables are uncorrelated
with each other implies that they are statistically independent of each other and each is
distributed normally, i.e.
(8.7)
|x x 2
where x is the th row of the matrix X, i.e. it is the row vector of explanatory variables
corresponding to observation .
8.6. THE NORMAL LINEAR REGRESSION MODEL
8.6.1
135
Finite sample distribution of the OLS estimators
b
The distribution of
b = (X0 X)
Since
b
distribution of
X0 y, we can use the conditional distribution of y given above to derive the
(Again using section 8.8.2).
b N 2 (X0 X)1
|X
The distribution of
b2
Furthermore we observe that
1
N (0 I )
Applying a result on quadratic forms listed in section 8.8.2 it follows that

1
1 0
X 2 ( )
since (X ) = , a fact that we showed in section 8.3.3 above. Consequently

1 0
e e 2 ( )
2
2
b 2 ( )
2

b2 =
Since the variance of a 2 ( ) distribution is 2 ( ), it follows from this that
2 2

2 4
2
(
),
i.e.
b2 =
.
b and
The independence of
b2
b and the residual vector e

Given the numerical properties of OLS, it follows fairly readily that
are uncorrelated. We have
h
i
h
i
1
b e0
= (X0 X) X0 0 X

1
= 2 (X0 X)
= 0
X0 I X
b is normal and e is normal1 so it follows that

b and e must be statistically
since X0 X = 0. Now
2
b and
independent of each other. So it stands to reason that
b should be independent of each
other.
We can show this more precisely
by using a result on quadratic forms, given in section 8.8.2.
1 b
We will show that is is independent of 12 e0 e, from which the result will follows. We
b = (X0 X)1 X0 and 12 e0 e = 0 . N (0 I ), and (X0 X)1 X0 = 0,

have 1
X
X
1 b
(as above). The result of section 8.8.2 then guarantees that is independent of 12 e0 e.
1 One slight complication is that the distribution of e is singular multivariate normal, i.e. there are only
proper random variables in the e vector. The remaining are perfect linear combinations of the others.
136
8.6.2
Maximum likelihood estimation
With the addition of the normality assumption, we have fully identified a family of DSPs, and
consequently we can estimate the parameters by means of maximum likelihood. The pdf of the
random variable defined in equation 8.7 is given by
(
)
2
( x )
1
2
exp
|x =
2 2
2 2
Since the are independent of each other, the joint pdf is2
( P
)
1
2
( x )
y|X
exp
=
2 2
2 2

(y X)0 (y X)
= 2 2 2 exp
22
The likelihood function is

(y X)0 (y X)
2 |y X = 2 2 2 exp
22
and the log-likelihood is
(y X) (y X)
2 |y X = ln (2) ln 2
2
2
2 2
b
The maximum likelihood estimators
b2 are those values that maximise this log and
0
likelihood. You may recognise the term (y X) (y X). This is the residual sum of squares.
b
Since does not feature anywhere else in , picking
to maximise is equivalent to picking
0
b
to maximise (y X) (y X) or minimising the residual sum of squares! It turns out
b
b
that
must be equal to the least squares estimator ! We can show this more formally:
X0 y X0 X
2
2
(y X)0 (y X)
= 2+
2
2 4
(8.8)
(8.9)
By setting these derivatives equal to zero, we get the likelihood equations
+
2b
2
b
X0 X
X0 y
2
2
b
0
b
b
y X
y X
2b
4
= 0
= 0
2 We could have derived this more simply by using the fact that the joint pdf of the multivariate normal is
given by
1
(y )0 1 (y )
(2) 2 || 2 exp
2
where is the covariance matrix of y and || its determinant. Substituting in = 2 I gives the same result.
It is perhaps more instructive to derive the joint pdf from the individual pdfs, since this corresponds to how we
have derived ML estimators previously.
8.7. DATA ISSUES
137
b , which confirms
From the first set of equations we retrieve the normal equations X0 y = X0 X
b
b
that the MLE is equal to the OLS estimator
1
0
b
X0 y
(8.10)
= (X X)
0
0
b
b
y X
Relabelling the term y X
as e e and solving the second equation we get
b2 =
e0 e
(8.11)
We note that this is not equal to the estimator

b2 that we advocated using in section 8.3.3.
In particular it is a biased estimator. However it is a simple fraction of
b2 and the dierence
between them shrinks to zero as the sample size increases.
8.6.3
The information matrix
The second derivatives of the log-likelihood are given by

2
0
2
2
( 2 )
=
=
X0 X
2
0
(y X) (y X)
2 4
6
X0 y + X0 X
2 4
P 2
0
2
0
Taking expectations of this, we note that (y X) (y X) =
= . [X y] =
0
0
[X (X + )] = X X. Consequently the information matrix will be given by
1 0
0
2X X
2
I =
0
2 4
2
2
(Note that one zero vector is a 1 row vector while the other is a 1 column vector.) The
Cramr-Rao Lower Bound (CRLB) is therefore given by
"
#
1
0
2 (X0 X)
2 1
I
=
2 4
0
b
is the minimum variance unbiased
It follows that
(and hence the least squares estimator)

b = 2 (X0 X)1 . The unbiased estimator
estimator since we have already shown that
b2 has a variance which exceeds the CRLB. It turns out, however, that there is no unbiased
estimator which has a lower variance, i.e.
b2 is also the minimum variance unbiased estimator
(Mittelhammer et al. 2000, p.44). In short, with the assumption of normality the least squares
estimators are ecient.
8.7
Data Issues
The DSP describes how the data arrive on the analysts desktop. It is clear that the matrix of
explanatory variables X plays a substantial role in this process and is therefore an important
138
factor in how eectively we can solve the inverse problem. We have seen one example of this
already: The identification condition
(X) =
determines whether or not we are able to get unique estimates. There are two ways in which
this particular condition might fail:
The problem may be rooted in the DSP itself, e.g. it may be the case that whatever sample
we may get, it will always be the case that x2 = 2x1 . In this case the problem is intrinsic
to the model and we will never be able to estimate all the structural parameters of the
DSP.
The problem may be just a sample problem, i.e. we may just have the rotten luck that in
our particular sample we have x2 = 2x1 . Alternatively we may simple not have enough
observations to estimate the structural parameters. Theoretically there is no problem
with estimating the model, but practically there will be, since we are hardly ever in the
situation of having the luxury of re-generating the sample or extending the data run.
Since all our estimation procedures are only as good as the data on which they are based (i.e.
all our estimates are conditional on X), it is worthwhile to spend some time to look at a few
common data problems:
8.7.1
Multicollinearity
It is clear that if (X) we cannot estimate the model, i.e. (X0 X) does not have an inverse.
A more common problem, however, is that our explanatory variables are highly correlated, but
not perfectly so. In this case X0 X is almost singular. It is clear why this might cause a problem:
in a sense we are multiplying through by the inverse of something that is almost zero. This will
lead to very unstable and imprecise estimates. In fact we can show (Greene 2003, p.57) that

2
b =

(8.12)
P
2
2 )
(1
( )
2
is the 2 obtained in the regression of x on all other variables (including the constant).
where
b can be due to three sources:
This formula shows that a lack of precision in estimating
the intrinsic noise in the data sampling process. The more noise, i.e. the higher 2 , the
less precisely we will be able to estimate the parameter.
the variation in x . The more variation we have at our disposal, the more accurately we
are able to measure how changes in x aect y.
the correlation between x and the other explanatory variables. The higher the correlation,
the less accurately we are able to isolate the independent eect of x . Intuitively, the
regression estimates try to purge the eects of all the other variables first (this is the FWL
theorem). If x is highly correlated with the other variables, there is very little variation
left on which to assess the separate impact of x on y.
There are various diagnostics that are available for assessing whether multicollinearity is
likely to pose a problem. One of these is the
=
1
2
1
8.7. DATA ISSUES
139
Actual $ exchange rate
22433
-120.053
.787
Purchasing power parity
13478
Plot of exchange rate against PPP
Figure 8.1: The point at = 13478 has high leverage. It pulls the regression line towards
itself. The top line is the regression line with the observation included, while the lower one is
the regression line without .
This shows how much the estimated variance has been aected by the correlation between x
and the other variables.
Various fixes for multicollinearity have been suggested in the literature (for discussions see
Greene 2003, Gujarati 2003). At the end of the day many of these may create new problems
or try to force relationships on to the data that the data simply do not want to accept. As
Gujarati (2003, Chapter 10) points out, multicollinearity can perhaps most usefully be thought
of as being akin to micronumerosity, i.e. the problem of having too few observations. In a sense
one is asking questions of the data that the data are not equipped to answer.
8.7.2
Influential data points
Another problem is exemplified by Figure 8.1. In this particular example we have regressed the
exchange rate on a measure of purchasing power parity (the Big Mac Index). The data are for
1994. It is clear that the observation at = 13478 (Poland) has a disproportionate influence
on the regression line. The lower line in the diagram is the regression line that would have been
obtained if that observation had been deleted from the data set.
b
We can think about this somewhat more rigorously by noting that each element of the
vector is a linear combination of the elements of the vector y. An influential observation
b The fundamental
is such that has a disproportionate influence on one or more elements of .
result (proved in Davidson and MacKinnon 1993, pp.3239)is that
b
b () =
1
1
(X0 X) X0
1
(8.13)
140

()
b is the vector of parameter estimates without observation , X is the -th row of the
where
data matrix and is the -th residual when the model is run on all observations. A particularly
important quantity in this formula is . It is defined as
= X (X0 X)
X0
i.e. it is the -th diagonal element of the projection matrix X . It is intuitively clear why the
b . The
projection (or hat) matrix X may be important, since it shows the impact of y on y
diagonal element measures the impact that has on its own fitted value b .
b will
Looking at equation 8.13, it is clear that the impact of observation on the estimates
be great if:
is large or
is large
Interestingly enough only depends on the X matrix, i.e. it is a feature of the structure of
the explanatory variables. We can show that
0 1
If is close to one, then the observation is said to have high leverage. Indeed any point which
has a value of greater than has potentially more influence than the others. In the example
given in Figure 8.1 the leverage associated with the point at is 967, i.e. it is very high. Points
with high leverage potentially have a great influence on the regression estimates. Whether this
potential is translated into actual exercise of influence depends on . If the value for Poland
had been = 13316, it would have been right on the lower line and so the regression line would
not have changed with the deletion of the observation. In short if = 0 in equation 8.13 the
point may have high leverage, but will not actually aect the estimates.
It is therefore desirable to investigate not only the leverage of the observations, but also the
associated residuals. We note that e =X , i.e.
(ee0 ) = 2 X
= 2 (I X )
The diagonal elements of this will give the respective variances of the associated residuals, i.e.
( ) = 2 (1 ). Note that although we have assumed that the error process is homoscedastic, this is not true of the residuals - the residuals associated with points of high leverage will be
smaller on average than the residuals elsewhere. We can standardise the residuals by dividing
through by an estimate of the standard error
b =
b 1
Standardised residuals that are large (say bigger than two) are an indication of points that will
have significant impacts on the regression estimates.
8.7. DATA ISSUES
141
We can see the combined eect of and by observing that the predicted value for based
b () will be given by
on the estimates
()
b ()
= X
1
1
X (X0 X) X0
1
= b
1
b
= X
So the quantity 1
measures the impact on its own fitted value of the deletion of observation
.
The key question for empirical work is what to do about influential data points. There are
at least three ways to think about it:
It is possible that the influential point simply represents bad data - the recorded data
may be in error or we may have mixed in observations that really belong to a dierent
regime or DSP. In the example cited above, it is possible that the exchange rates of
countries undergoing drastic political change (such as Poland) may work to a logic that
is not described adequately by purchasing power parity. The appropriate response in this
case would be to delete the observation.
It is possible that the observation represents disproportionately informative data. What
gives the observation on Poland such a high leverage, is that it is far from the other values.
In short it gives us information about what happens outside the normal range of . It
therefore helps us to fix the regression line much more accurately than would otherwise be
the case. In this case the last thing that we should do is to delete the observation.
It is possible that the observation represents neither completely bad data nor hundred
percent good data. Instead we may have misspecified the model. In the PPP example,
it is plausible that the error process is heteroscedastic - at high levels of the exchange
rate perhaps becomes more volatile, i.e. PPP still has some influence, but other random
factors become more important. In this case the appropriate response would be to either
respecify the model, reweight the data or perhaps transform the data.
It is important to understand this last point properly: the influence of an observation is
always with reference to the particular model that we are trying to estimate.
8.7.3
Missing information
The problems of multicollinearity and influential data points are both a result of the fact that
we hardly ever have the luxury of controlling the variables in our studies, i.e. we hardly ever
have experimental data. We are therefore constrained by what the DSP happens to throw up for
us - and this may involve both highly correlated variables as well as too few observations with
high values. For instance, we generally have too few really rich individuals in our household
data sets. Correspondingly the few high income observations will have a high leverage on any
estimates where income features as an explanatory variable.
What can exacerbate this problem is nonresponse by sampled individuals. If the pattern of
nonresponse is random (e.g. if high income individuals have the same propensity to refuse as low
income individuals) this will not materially aect any estimates. If, however, it turns out that
there are systematic patterns then our analyses may be subject to sample selection bias. A
discussion of this problem is given in Wooldridge (2002, Chapter 17).
142
8.8
8.8.1
Appendix
The trace of a matrix
Definition 8.2 The trace of a square matrix

elements, i.e. if
11 12
21 22
= .
..
..
.
1
is defined as the sum of its main diagonal
..
.
1
2
..
.
then
() = 11 + 22 + +
Remark 8.3 It follows immediately from the definition that
( + ) = () + ()
whenever the matrix addition makes sense.
Remark 8.4 It also follows that
() = ()
where is a scalar.
Proposition 8.5 If the matrix is and the matrix is , so that both the matrix
product and the product are defined and both product matrices are symmetric (one is
and the other ), then
() = ()
Proof. We will show this first for the case where = 1 and is 1 , i.e.
1
2
= . and = 1 2
..
Multiplying out we see that
so that
1 1
2 1
..
.
1 2
2 2
..
.
..
.
1
2
..
.
() = 1 1 + 2 2 + +
But
= [1 1 + 2 2 + + ]
which is a 1 1 matrix and so trivially
() = 1 1 + 2 2 + +
8.8. APPENDIX
143
In this case we find that the proposition holds.

In the more general case we write the matrix as a column of row vectors and as a row
of column vectors, i.e.
1
2
= . and = 1 2
..
where is a 1 vector and is a 1 vector. Once again we get
1 1
2 1
..
.
1 2
2 2
..
.
..
.
1
2
..
.
where each term is a 1 1 scalar. Consequently we again have

() = 1 1 + 2 2 + +
In this case we also have
= 1 1 + 2 2 + +
This, however, is now a sum of matrices, each of which is a matrix. We have
() = (1 1 + 2 2 + + )
= (1 1 ) + (2 2 ) + + ( )
= (1 1 ) + (2 2 ) + + ( )
= 1 1 + 2 2 + +
Here we have used first the fact that ( + ) = () + (). Then we have used the fact
that and are vectors and we have already shown that () = () in this special case.
Finally we use the fact that are all scalars and the trace of a scalar is just the scalar itself.
Remark 8.6 If is a random matrix, then

( ()) = ( ())
Proof. This follows by substituting in the definition of the trace of a matrix.
8.8.2
Results on the multivariate normal distribution
These are taken from Greene (2003, Appendix B.11, pp.871).

Linear functions of a normal vector
If x N ( ) then Ax + b N A + b AA0
144
Standardising a normal vector

1
If x N ( ) then 2 (x ) N (0 I )
1
Note that 2 will exist, provided that is positive definite (i.e. provided that is of full
rank).
Distribution of an idempotent quadratic form in a standard normal vector
If x N (0 I ) and A is idempotent, then x0 Ax 2 (), where = (A)
Independence of idempotent quadratic forms
If x N (0 I ) and x0 Ax and x0 Bx are two idempotent quadratic forms in x, then x0 Ax and
x0 Bx are independent if AB = 0
Independence of a linear and a quadratic form
A linear function Lx and a symmetric idempotent quadratic form x0 Ax in a standard normal
vector are statistically independent if LA = 0.
Chapter 9
Asymptotic properties of the OLS

estimators
(Compare with Wooldridge (2002, pp.4955))
9.1
Introduction
In this chapter we will consider the asymptotic properties of the Least Squares estimator in the
context of the classical linear regression model. The reason for doing this include:
We can say something about the accuracy of our estimation procedures in large samples.
The appropriate concept is that of consistency.
We can say something about the distribution of our estimators in large samples. The
justification for using the normal distribution for inference will turn on this.
The approach that we develop in this chapter has more general application. We will make
use of similar forms of argument in contexts outside the linear regression model. Frequently
it will be impossible to derive finite sample properties of the estimators while the asymptotic
properties might be relatively straightforward to derive.
In order to contextualise the discussion, we need to remind ourselves of what we are assuming
about the DSP:
3. X: The X variables are exogenous.
(|X) = 2 I
6. (e|X ): The distribution of the error terms is left unspecified.
145
146
CHAPTER 9. ASYMPTOTIC PROPERTIES OF THE OLS ESTIMATORS
The model can therefore be written as

Y = X +
(|X) = 0, (|X) = 2 I
(9.1)
Note that we will need to assume that the model is identified, i.e. X has full column rank.
9.2
The sampling process
Note that when we are talking about the consistency of the OLS estimator we are really talking
about the behaviour of the series of estimators
n
o
1
(X0 X ) X0 y
=
where X and y are the data matrices and dependent variables from a sample of size . A key
question is how dierent (or otherwise) the additional rows of the data matrix are when compared
to the previous ones. The upwardly trending data characteristic of many macroeconomic time
series require some care in this regard, because it is clear that additional observations are generally
not from precisely the same distribution as earlier ones. Indeed in the previous chapter on
asymptotic theory we ruled out processes such as
=
If we are dealing with cross-sectional data we can simply make the assumption that each
observation is a separate draw from the same underlying distribution. In this case, the only
assumption that we need to prove consistency is that
[ (x0 x)] =
(9.2)
where x is the row vector of explanatory variables, i.e. each row of the X matrix can be thought
of as a separate draw from the same multivariate distribution as x. We will make the weaker
assumption that
lim
1 0
X X = A, where A is a positive definite matrix
(9.3)
Note that Assumption 9.2 implies assumption 9.3, but not necessarily vice versa. It is possible
to prove consistency with yet weaker conditions on the data sampling process (Mittelhammer
et al. 2000, pp.44):
1
(X0 X)
must exist for all (so that the OLS estimator exists)
P P
(X0 X) = =1 =1 2 as and the ratio of the largest to the smallest
eigenvalues of X0 X is upper bounded.
The latter condition rules out that one (or more) of the eigenvalues of X0 X goes to zero.
It would do this if one of the columns in the X matrix became more and more like a linear
combination of some of the other columns or if one of the columns became more increasingly like
a column of zeros (e.g. if = 1 ). We give a proof of the consistency of OLS based on these
assumptions in the appendix. Note that these conditions are fairly unrestrictive.
b
9.3. ASYMPTOTIC PROPERTIES OF
9.3
9.3.1
147
b
Asymptotic properties of
b
Consistency of
Theorem 9.1 Assume that the DSP meets the conditions listed in Section 9.1 as well as the
b
assumption that 1 X0 XA where A is positive definite, then
By definition:
= (X0 X)
X0 y
1
= + (X0 X) X0
1
1 0
1 0
= +
XX
X
If lim 1 X0 X = A, a positive definite matrix, then

b = + A1 lim 1 X0
lim
(using Slutskys theorem). To show that

in this vector is given by
1 0
X
converges to zero we note that the typical term
1X

=1
(9.4)
where is the -th observation on variable x (i.e. the -th column of X). Now ( ) = 0
2 P
(by our assumptions about and X) and ( ) = 2 2 (since all terms involving
have zero expectation), so if the 2 remain finite (if they dont the matrix 1 X0 X is unlikely to
P
converge), the sum 2 is () and ( ) will converge to zero. Consequently 0. It

1 0
follows that lim X = 0.
We give an alternative proof using the weaker assumptions in Section 9.1 in the appendix.
b
9.3.2
Theorem 9.2 (Mittelhammer et al. 2000, p.96) Assume that the DSP meets the conditions
listed in Section 9.1 and assume further that 1 X0 X A where A is a hfinite positive
definite
i
2+
symmetric matrix, the elements in X are bounded in absolute value and | |
for some
finite constants and , then
b
2
N 0 2 A1
In order to prove this we write
1
1
1
b
= 2 (X0 X) X0
2
1
1
1 0
=
2 X0
XX
Now as before
1
1 1
0
A .
X X
2
1
We need to use a central limit theorem on 2 X0 to show
that 2 X0 0 A . The additional conditions in the theorem allow us to invoke a
148

1
central
theorem
that
limit
on 1the sequence1 2 where is defined as in equation

9.4.
Note
1
12 0
2 0
0
2
2 1
0
2
b
var X = X X
= (X X) A and var 2
=
A1 2 AA1 = 2 A1
We can rewrite the result of the theorem as
2 1
b
N A
The matrix A1 is somewhat awkward in here, but we can replace it with something more
tractable by noting that if X x and Y Y, then X Y xY. Consider the sequence

1
1 1
of random variables A 2 X0 X 2 . Since A and 1 X0 X are symmetric positive definite, these
1
1
1
1
matrices
are
well defined. We have 1 X0 X 2 A 2 , so A 2 1 X0 X 2 I . Furthermore
b
2
0 2 A1 . It follows now that
12
A
and it follows that
12
1
1 0
b
2
N 0 2 A1
XX
1
1
1
1
N 2 (X0 X) 2 A 2 A1 A 2 (X0 X) 2
1
= N 2 (X0 X)
This holds by one of the properties of normal variables. If X N ( V), then BX + c N B + c BVB0 .
12
In this case use B = (X0 X)
9.4
9.4.1
A 2 and c = .

b
Asymptotic properties of e,
b and d

2
Consistency of e as an estimator of
Theorem 9.3 (Mittelhammer et al. 2000, p.98) Under the conditions of the DSP set out in
1
Section 9.1 and on the additional assumption that X (X0 X) X0 0 as , it follows that
for all and consequently for all .
Proof. Observe that e = = X (X0 X)1 X0 . Let be the ( 1) vector of zeros

except for a value of 1 in the -th position. It follows that
= 0 X (X0 X)
X0
Consequently ( ) = 0 and ( ) = 2 0 X (X0 X)1 X0 . If X (X0 X)1 X0 0 it
follows that ( ) 0. Hence for all . The result follows.

1
Comment: We would generally expect X (X0 X) X0 to converge to zero. This is relatively

b
9.4. ASYMPTOTIC PROPERTIES OF E,
b2 AND d

easy to see if we write this as
X (X0 X)
X0
x1
x2 0 1
=
(X X)
x
1
x1 (X0 X) x01
x2 (X0 X)1 x0
1
x (X0 X)1 x01
x01
x02
149
x0
x1 (X0 X) x02
1
x2 (X0 X) x02
..
.
x (X0 X)1 x02
1
x1 (X0 X) x0
1
x2 (X0 X) x0
1 0
0
x (X X) x
As we see that this matrix gets larger and (X0 X)1 0, but the elements of
each row x should remain bounded (and there will only be of these), so that the product term
x (X0 X)1 x0 will tend to zero.
The implication of this theorem is that the asymptotic distribution of the residuals will be
the same as the distribution of the stochastic errors, i.e. it will be normal only if the errors are
normally distributed.
9.4.2
Consistency of
b2 as an estimator of 2
We will write
b2
0 X (X0 X)1 X0
0 X (X0 X) X0
Now we use Markovs Inequality (see the Appendix to Chapter 4). This Inequality states
)
( )
that Pr ( ) (
. We will turn this around as Pr ( ) 1
. Consequently
0
1
X(X0 X) X0
0 X (X0 X)1 X0
Pr
1
1
X(X0 X) X0
But
. Taking limits
= 2
lim Pr
0 X (X0 X) X0
Consequently for any 0 we have

lim Pr
lim 2

0 X (X0 X)1 X0
=1
0 X(X0 X) X0
= 0.
But this proves that lim
0
0
0
2
The term = converges to , since is the mean of i.i.d. random variables
2
1 as .
having expected value = 2 and
150
9.4.3
b2
This can be shown if the fourth order

moments
of the are finite.
It will have asymptotic variance 04 4 where 04 is the fourth moment of the s.
9.4.4

b
Consistency of
b2 (X0 X)1 as an estimator for var
Note that the estimator is unbiased, since

h
i
h i
1
1

b2 (X0 X)
=
b2 (X0 X)
= 2 (X0 X)
It is consistent, since
b2 2 0, and

1
b var
b =
d
v
ar
b2 2 (X0 X)
This will converge to zero, provided that (X0 X)
1
is 2 .
b
Appendix: Alternative proof of consistency of
9.5
Theorem 9.4 (Mittelhammer et al. 2000, p.45) Assume that the DSP meets the conditions listed
P P
in Section 9.1 and assume further that (X0 X) = =1 =1 2 and that the ratio of the
b
largest to the smallest eigenvalues of X0 X is upper bounded, then
Proof. The trace of a square matrix is P

equal to the sum of the eigenvalues of the matrix.
Given that (X0 X) it follows that =1 (X0 X) . We have assumed that the
ratio of the largest to the smallest eigenvalue is upper bounded, i.e. for some real
number . Consequently the sum of the eigenvalues can increase without bound only if the
smallest eigenvalue increases without bound, i.e. (X0 X) as . The eigenvalues of
(X0 X)1 are the reciprocals of the eigenvalues of X0 X, so it follows that the largest eigenvalue
1
1
of (X0 X) 0 as . Consequently all the eigenvalues of (X0 X) converge to zero and
P
1
1
hence =1 (X0 X) 0. This means that (X0 X) 0 as

P
b = 2 (X0 X)1 . Consequently the variances of each
But we have shown that =1

b vector go to zero. Furthermore we know that
b = . It follows that each
element of the
b
b
and hence is consistent for .
Chapter 10
Inference and prediction in the

CLRM
10.1
Introduction
In this chapter we will consider in more details how to test assumptions about the DSP based
on the least squares estimates, i.e. we continue to make the following assumptions:
3. X: The X variables are exogenous.
(|X) = 2 I
6. (e|X ): The distribution of the error terms is left unspecified. We will also consider the
special case where we know that the error terms are normally distributed.
7. : The parameter space is unrestricted. Below we will consider the specific case where we
impose a set of linear restrictions on the parameter space.
As noted in Chapter 5, there are broadly three approaches to testing:
We can base the test on the unrestricted model and investigate how dierent the unrestricted estimates are to the values given by the null hypothesis
We can base the test on how much the fit of the regression changes from the unrestricted
to the restricted model
We can estimate the restricted model and investigate whether the restrictions appear to be
binding, i.e. whether we would get very dierent estimates if we relaxed the restrictions.
151
152
CHAPTER 10. INFERENCE AND PREDICTION IN THE CLRM
In all cases we need to make some distributional assumptions about the estimator. We
saw in the last chapter that under fairly broad conditions the OLS estimators will be normally
distributed, provided that the assumptions of the classical linear regression model hold. If we
assume that the error term is normally distributed, we can give precise results even in small
samples.
In all cases we will be concerned with testing a set of linear restrictions stated in the null
hypothesis
0 : R = c
against the alternative
1 : R 6= c
10.2
Wald type tests
10.2.1
A Wald test
Under the stated assumptions we have noted that the Wald statistic
1
0
b c
b c
R R0
R
R
1
will be distributed as 2 (). We know that = 2 (X0 X) , so our test statistic becomes
0
1
1 b
1
b c
= 2 R
c
R (X0 X) R0
R
(10.1)
b2 =
The only unknown quantity in this expression is 2 . We know that
2
estimator of , so in large samples we could base our Wald statistic on
0
1
1
0
0
b c
b c
c = 1 R
X)
R
R
(X
R
b2
10.2.2
e0 e
is a consistent
(10.2)
F test
We can do better than this if we know that the errors are normally distributed. In this case we
know that
( ) 2
b 2 ( )
(10.3)
2
b and
Furthermore we showed that
b2 are statistically independent of each other. So the Wald
2
statistic given in equation 10.1 will be independent of 2 . So we can form an F statistic by
dividing each chi-square variable by its degrees of freedom, i.e.
()
2
(
2
This will be distributed as an variable. Now this simplifies to

0
1
1 b
1
b c
= 2 R
c
R (X0 X) R0
R
b
(10.4)
While this statistic is based on the normality assumption, we know that 2 () as

, so tests using the F statistic are asymptotically valid more generally, in the sense that
they will give similar results to tests based on formula 10.2. Furthermore, they probably will
perform better in smaller samples.
10.3. LIKELIHOOD RATIO LIKE TESTS
10.2.3
153
t tests
In the particular case where our test involves only one restriction, the test statistic can equivalently be formulated as a t-test. In these cases the R matrix is a row vector and the matrix
1
Rb
2 (X0 X) R0 is a 11 matrix, i.e. a scalar. In fact this scalar is just the variance of the linear
b We can therefore rewrite the 1 statistic given in formula 10.4 equivalently
combination R.
as
2
b

=
b
d

2
b

=
b
b
= 2
where
=
is the t-statistic associated with the test
0 : =
against
1 : 6=
Since the distribution of a variable with degrees of freedom is exactly equal to the distribution of the square root of an 1 variable, these two tests are statistically and numerically
equivalent.
10.3
Likelihood ratio like tests
10.3.1
Asymptotic LR test
In order to implement this, we will initially consider the case where we assume normality of
the errors. Assume also that we know 2 but need to estimate . Under the assumption of
b
normality, the maximum likelihood estimator will be (as before) the least squares estimator .
Furthermore the log likelihood in this case will be given by
0
b
b

y X
y X
b = ln (2) ln 2

2
2
2 2
b be the estimator obtained under the restriction R = c. The restricted log-likelihood

Let
will be
0
b
b

y
b

ln 2
= ln (2)
2
2
2 2
154
So the likelihood ratio statistic will be given by 2 ( ), i.e.
0

0
b
b
b
b
y X
y X
y X
y X
=
2
0
0
e e e e
=
2
(10.5)
This is asymptotically distributed as 2 (). In fact we can show that under the assumption
of normality, it will be precisely distributed as 2 (). We could operationalise this as a test
statistic by substituting in a consistent estimator of 2 .
10.3.2
Precise results: F test
We will generate the precise distribution of the LR statistic above, for the special case where the
restrictions are null restrictions, i.e. where of the parameters have been set equal to zero
(Davidson and MacKinnon 1993, pp.8287). In section 10.3.3 we show that this is not, in fact,
a restrictive assumption. In this special case our unrestricted model can be written as
y = X1 1 + X2 2 +
(10.6)
whereas the restricted model, under the hypothesis that 2 = 0, is given by

y = X1 1 +
(10.7)
Estimating the restricted model by OLS we get

e0 e = y0 1 y
(10.8)
By the FWL Theorem, we know that the residuals that we get from estimating model 10.6 are
identical to the residuals that we would get if we first created the residuals e1 = 1 y and the
residuals e2 = 1 X2 and then regressed e1 on e2 . The latter regression can be written as
1 y =1 X2 2 + 1
(10.9)
The projection matrix 1 X2 from this regression is given by

1 X2 = 1 X2 (X02 1 X2 )
X02 1
and the residual making matrix 1 X2 is given by

1
I 1 X2 (X02 1 X2 )
X02 1
The sum of residuals squared from regression 10.9 are given by

0
e0 e = (1 y) 1 X2 (1 y)
= y0 1 y y0 1 X2 (X02 1 X2 )
X02 1 y
(10.10)
Substituting in equations 10.8 and 10.10 we see that

1
X02 1 y
X02 1
e0 e e0 e = y0 1 X2 (X02 1 X2 )
= 0 1 X2 (X02 1 X2 )
(10.11)
10.3. LIKELIHOOD RATIO LIKE TESTS
155
where the last step is valid provided that the null hypothesis is true.
This expression is valid whether or not normality holds. Under the assumption of normal
errors, the random vector
v = X02 1
is normally distributed with a mean of zero and covariance matrix
(vv0 ) = [X02 1 0 1 X2 ]
v = 2 X02 1 X2
The right hand side of equation 10.11 is therefore of the form
2 v0 1
v v
So
e0 e e0 e
= v0 1
v v
2
By one of the previous results we can conclude that
e0 e e0 e
2 ()
2
where is the number of elements in v.
We can turn this into an F statistic by using a consistent estimator of 2 . As above (equation
10.3) we will use the fact that
e0 e
1
= 2 0 X
2
is distributed as 2 ( ) so
e0 e e0 e
2
e0 e
2 ( )
has distribution provided that the chi-squared variables in the numerator and denominator are independent of each other. By a result on quadratic forms (given in the appendix to
1
Chapter 8) they will be, provided that the product of 1 X2 (X02 1 X2 ) X02 1 and X is 0.
Now 1 X = X , since the X1 variables are just a subset of the X variables, but X02 X = 0
1
since X2 is also a subset of X. Consequently 1 X2 (X02 1 X2 ) X02 1 X = 0.
In short we find that
(e0 e e0 e)
(10.12)

e0 e ( )
This result depends on the normality of the error terms. Nevertheless from equation 10.11 it
looks as though the result should hold up asymptotically,
provided that
we can apply
a central
limit theorem to 1 X02 1 . Under the null hypothesis 1 X02 1 = 0 and 1 X02 1 =
2 0
X2 1 X2
Provided this is bounded above we can apply the Lindberg-Feller central limit theorem to establish asymptotic normality.
It is interesting to note that the term y0 1 X2 (X02 1 X2 )1 X02 1 y (see equation 10.11) can
2
b0 y
b - from the regression of e1
be written as k1 X2 yk , i.e. it is an explained sum of squares y
on e2 . It is the additional sum of squares that can be ascribed to X2 once all the eects of X1
have been stripped out. In short our F statistic can also be written as
2
k1 X2 yk
b
2
(10.13)
156
10.3.3
Restricted least squares: Reparameterising the model
Above we made the claim that all linear restrictions could be subsumed by the case of zero
restrictions, provided that we reparameterise the model appropriately. The argument is very
simple (Davidson and MacKinnon 1993, pp.1619). The matrix R is and of rank . By a
suitable reordering of the variables we can ensure that R = [R1 R2 ] where R1 is a matrix
of full rank. Our null hypothesis
R = c
can therefore be reformulated as
1
= c
2
R1 1 + R2 2 = c
[R1 R2 ]
i.e.
1
1 = R1
1 c R1 R2 2
Our unrestricted model is

y = X1 1 + X2 2 +
(10.14)
Imposing the restriction we get:

1
y = X1 R1
1 c X1 R1 R2 2 + X2 2 +
y X1 R1
X2 X1 R1
1 c =
1 R2 2 +
1
Let y = y X1 R1
1 c and Z2 = X2 X1 R1 R2 , then we could estimate the restricted model
as:
y = Z2 2 +
(10.15)
The simplest way to estimate a version of the unrestricted model is to run it as

y
y
= X1 1 + Z2 2 +
= X1 1 + R1
1 c + Z2 2 +
(10.16)
The residuals from the last model will be equal to the residuals from the original model. Furthermore 2 = 2 and the parameter 1 will be equal to zero, if the null hypothesis is true.
Consequently the model given in equation 10.16 has also precisely the same residual sum of
squares and is in a form where the restriction can be tested by means of a zero restriction on 1 .
To show all this, in essence we just apply the fact that a linear transformation of the data
transforms the estimates appropriately (see Chapter 7). In this case the transformation matrix
is given by
I R1
1 R2
A=
0
I
i.e. Z = XA. The parameters of the transformed model are given by A1 . But
I R1
1 R2
A1 =
0
I
1
which gives 1 + R1
1 c = 1 + R1 R2 2 and 2 = 2 . Now R1 1 + c = R1 1 + R2 2 .
Under the null hypothesis this is c, which is possible only if R1 1 = 0, i.e. 1 = 0.
The fact that the residual sum of squares is identical, whether we base our estimates on
the original model (10.14) or the reparameterised one (10.16) proves that our discussion in the
10.4. LM TYPE TESTS
157
previous section carries through. Even in cases where linear restrictions are not zero restrictions,
the formula given in equation 10.12 will still apply. Note this is not true of some other forms of
the F test, based on the 2 from the restricted and the unrestricted regressions. This is a good
reason for not using those forms of the test.
10.4
LM type tests
Lagrange multiplier (or score) tests are based on estimating the restricted model. In other words
the DSP is now characterised by a restriction on the parameter space given by R = c.
The problem therefore is how to minimise the residual sum of squares, subject to the restrictions R = c. We can set up the Lagrangian
1
(y X)0 (y X) + (R c)0
2
where we have multiplied by
are
1
2
in order to simplify the algebra later on. The first order conditions
b + R0
b = 0
X0 y X
From equation 10.17 we see that
b c = 0
R
(10.17)
(10.18)
b
b = X0 y X
R0
= X0 e
It is clear that if the restriction is valid, the term on the right hand side should asymptotically
converge to zero. It is also plausible that we should be able to apply some central limit theorem
to this vector to show that it is asymptotically normal with convariance matrix 2 X0 X. We
should therefore be able to base a test of the hypothesis that is zero on the statistic
=
where e2 =
e0 e
1 b0
1
b
R (X0 X) R0
e2
is the estimate of the common variance given the restricted model. Equiva
b to write the statistic as

b X0 y X
lently we can use the fact that R0 =
1
b X (X0 X)1 X0 y X
b
y X
(10.19)
2
e
This is the score form of the LM statistic. This is the more common form in which this test is
implemented, since we will generally not be estimating the restricted model by means of Lagrange
multipliers. Instead we will often estimate it by imposing the constraints in the way that we
sketched out in section 10.3.3. Note that the score form is equivalent to
0
1
b X y X
b
y X
2
e
1
b X y X
b is the
where X is the projection matrix X (X0 X) X0 , i.e. 12 y X
explained sum of squares from the artificial linear regression
1
b = Xb + u
(10.20)
y X
e
=
158
The residuals from the restricted regression are standardised (by dividing through by the estimated standard error of the restricted regression) and then regressed on the full set of explanatory variables1 . If the restriction is valid the explained sum of squares should be small. As
Davidson and MacKinnon (1993) note, LM tests can almost always be calculated by means of
artificial regressions.
The argument that we have produced here did not formally invoke the gradient of the loglikelihood, although that is how we defined LM tests in Chapter 5. In the case of the normal
linear regression model the two will coincide if we fix 2 initially. The gradient vector will then
just be given by 12 X0 e and the information matrix by 12 X0 X. The statistic will be identical.
Given normality the score form of the LM statistic will be distributed as 2 even in small samples.
10.5
Equivalence between tests
In Chapter 5 we noted that the Wald, LR and LM tests are asymptotically equivalent. We
will show this for the current model in the context of the hypothesis 2 = 0 in the model 10.6
(Davidson and MacKinnon 1993, pp.934). We saw in section 10.3.2 equation 10.13 that the LR
F statistic is equivalent to
k1 X2 yk2
b
2
b c. For
If we test the hypothesis by means of Wald test, the test statistic will be based on R
the particular hypothesis that we are considering this turns out to be particularly simple, i.e.
b . By the FWL theorem we know that this is
2
b = (X0 1 X2 )1 X0 1 y
2
2
2
(10.21)
Furthermore the R matrix is equal to [0 I ]. Thus the term R (X0 X) R0 which goes into the
Wald statistic and gives the covariance matrix of the test statistic is just the lower right
1
block of the (X0 X) matrix. We can easily find this (e.g. directly by looking at formula 10.21).
1
It is given by (X02 1 X2 ) . The Wald statistic is
1
1 b0
1
0
0
b
=
X)
R
R
(X
2
2 2
0
1 0
1
1
0
0
0
0
=
X
)
X
y
(X
X
)
(X
X
)
X
y
(X
1
2
1
1
2
1
2
1
2
2
2
2
2
2
1 0
1
=
y 1 X2 (X02 1 X2 ) X02 1 y
2
k1 X2 yk2
=
2
So if we use the Wald form of the test (equation 10.2) then the only dierence between the LR
formulation and the Wald is that the former uses an F test whereas the latter does so by means
of a 2 test. In the F form of the Wald type test the two statistics are precisely equivalent.
1 Alternatively,
we might note that
X y X
y X
1
0
y X
=
X y X
2
e0 e
= 2
where 2 is the usual 2 in the artificial regression of e on Xb (Wooldridge 2002, p.58) provided that the
restricted regression includes a constant. If it does not, then we would use the uncentred 2 .
10.6. NON-LINEAR TRANSFORMATIONS OF THE ESTIMATORS: THE DELTA METHOD159

As far as the LM statistic is concerned, we observed that the LM statistic is the explained
sum of squares from the artificial regression 10.20 which in this case is
1
1 y = X1 b1 + X2 b2 + u
e
Since 1 X1 = 0, the explained sum of squares from this regression must be precisely equal to
the explained sum of squares from the regression
1
1 y =1 X2 b2 + u
e
This is because the residuals are identical, i.e. the residual sum of squares is identical, and the
total sum of squares is identical. The explained sum of squares from the latter regressions is:
1
1 0
1
2
y 1 X2 (X02 1 X2 ) X02 1 y = 2 k1 X2 yk
e2
e
So the only dierence between the LM statistic and the other two statistics is that the former
uses e2 , i.e. the estimate of the variance is based on the restricted regression, whereas in both
the other two cases it is based on the unrestricted regression. Of course if the null hypothesis
is true then in large samples these should give very similar quantities.
10.6
Non-linear transformations of the estimators: the delta

method
Thus far we have looked at tests of linear restrictions on the data. We may, however, wish to
investigate also nonlinear functions of the estimators. This turns out to be fairly straightforward.
In this case we will write our set of hypotheses in the form
R () = c
where R is a set of functions. An example might be
"
#
12
1
=
1
2
+
2
The R transformation defines a new set of random variables which

we might call , i.e. = R ().
b
b . In the particular case where
b
The obvious way of estimating given our data is as = R
b would be the MLE of .
is the ML estimator of ,
b will be a consistent estimator, by taking a first order
We can show more generally that
Taylor expansion of the nonlinear function R around the true parameter value = R (), i.e.

b = R () + R ( )
b
R
0
(10.22)
b and (perhaps dierent for each row of the vector

where is a convex combination of
equation). This is just a straightforward application of the multivariate version of the mean
b we have
value theorem (Simon and Blume 1994, p.826). As by the consistency of
R()
b and hence . Provided that the Jacobian of the transformation
is well 0

b R ().
behaved near , it is obvious that R
160
Writing the above expansion as the linear approximation

b and
we get
R () b
b
R
R () +
b
b +
(10.23)

0
b
b

0
0
An approximate Wald test of the hypothesis
=c
is given by the Wald statistic
0
1
b c
b
bc
So we can test the nonlinear restriction
R () = c
with the Wald statistic
"
#

0 R ()
R () 0 1
b c
b
b c
R

R
0
0
For the example given above, we have

"
2
R ()
=
1
0
2 + 3
10.7
1
(
1
2
2 + 3 )
0
(
1
2
2 + 3 )
Nonlinear relationships
We have already seen that many nonlinear relationships can be turned into linear ones. The
Cobb-Douglas production function
= 2 3
becomes linear when we take logs
ln = 1 + 2 ln + 3 ln +
Note that in this transformation we have lost the original parameter . The transformed
model is linear in the parameters 1 , 2 , 3 where 1 = ln .
Definition 10.1 In the classical linear regression model, if the parameters 1 2 can
be written as one-to-one, possibly nonlinear functions of a set of underlying parameters
1 2 , then the model is intrinsically linear in .
10.8. PREDICTION
161
The important point here is that the functions have to be one-to-one. In this case we can
retrieve the original parameters 1 after estimating the regression coecients through the
appropriate inverse transformation. Since our parameter estimates b
1 b
will be (possibly)
b
b
nonlinear functions of our estimates 1 we may need to use the delta method to test
hypotheses on the original parameter values..
Other examples of intrinsically linear regressions are polynomial regression models and the
semi-log model (such as the Mincerian wage regression above). Not all models of the form
= 1 () 1 + + () +
are intrinsically linear. The relationships
= + 1 + 2 + 3 +
is not intrinsically linear, since the relationship between the parameter vector ( )0 and the
regression coecients ( 1 2 3 4 )0 is not one-to-one.
10.8
Prediction
There are two kinds of predictions that we might make: we might wish to estimate 0 |x0 or
we might wish to predict the observation 0 itself. By the Gauss-Markov theorem
b
b0 = x0
0
where
0 x0 is a row vector of observations, is the minimum variance linear unbiased estimator of
|x . We have

b = x0
b0 = x0
This means that
b
b0 = x0 + x0
The covariance matrix of this estimator follows easily

1
b0 = 2 x0 (X0 X) x00
When viewed as a prediction of an individual observation, the prediction error is

0
and its variance is given by
= 0 b0
b
= x0 + 0 x0 x0
b
= 0 x0

1
0 = 2 + 2 x0 (X0 X) x00
162
10.9
Exercises
1. You are given the following structural model of income determination:

log = 1 + 2 + 3 + 4 2 + 5 +
where is the wage, is the number of years of education completed, is the
number of years experience and is a dummy variable equal to one if the person
is a black South African. The stochastic error term is assumed to be homoscedastic and
normally distributed.
You are given the following regression estimates of this model:
Source |
SS
df
MS
-------------+-----------------------------Model |
3652.7455
4 913.186376
Residual | 4260.70023 6067 .702274638
-------------+-----------------------------Total | 7913.44573 6071 1.30348307
Number of obs
F( 4, 6067)
Prob > F
R-squared
Adj R-squared
Root MSE
=
6072
= 1300.33
= 0.0000
= 0.4616
= 0.4612
= .83802
-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------highed |
.1435481
.0030851
46.53
0.000
.1375003
.1495959
years |
.0545642
.0033909
16.09
0.000
.0479169
.0612115
years2 | -.0007096
.0000626
-11.33
0.000
-.0008324
-.0005868
African | -.8052136
.0244181
-32.98
0.000
-.8530819
-.7573454
_cons |
5.504426
.0580638
94.80
0.000
5.3906
5.618252
(a) Interpret the coecients on and .
(b) You would like to predict the turning point of the relationship between experience
and (expected log) wages. Generate a consistent estimate of this turning point.
(c) Assume that the covariance matrix
is given by:
9 517 8 106 4 184 5 106

4 184 5 106 1 149 8 105
3 862 5 108 1 273 6 107
2 26 105
4 139 9 106
6
8 956 6 10
9 844 4 106
b ,
b ,
b ,
b ,
b )
of the estimators (ordered as
2
3
4
5
1
3 862 5 108
1 273 6 107
3 918 8 109
1 222 9 107
3 634 8 107
2 26 105
4 139 9 106
1 222 9 107
5 962 4 104
1 417 8 104
8 956 6 106
9 844 4 106
3 634 8 107
1 417 8 104
3 371 4 103
Now generate standard errors for the turning point by means of the delta method.
(d) Test for the joint significance of the and 2 variables.

(e) The coecient of the dummy variable might be interpreted as an indication of discrimination. A researcher suggests the following alternative estimators:
The coecient 2 in the regression
e| = 1 + 2 e | + u1
10.9. EXERCISES
163
The coecient 2 in the regression

log w = 1 + 2 e | + u2
where:
e| is the vector of residuals in the regression of log wages on , ,
2 and a constant.
e | is the vector of residuals in the regression of the dummy variable African
on , , 2 and a constant.
and the u1 and u2 terms are errors. Discuss the relationship between the original
estimator and these alternatives.
2. You are trying to estimate a demand function for chocolate using annual time series data
on the period 1960 to 1994 (inclusive). You have information on the following variables:
ln
1
2
The
The
The
The
natural
natural
natural
natural
log
log
log
log
of
of
of
of
chocolate consumption (in R 100 million)

disposable income (in R100 million)
the price of chocolate
the price of sweets
You have also created the following variables:

ln 2
ln 12
= ln 2 where is disposable income and 2 the price of sweets

= ln 12 where p1 is the price of chocolate and 2 the price of sweets
You have run several regressions, together with some diagnostics. The (edited) Stata output
is as follows:
Regression A
. regress lncons lnm lp1 lp2
Source |
SS
df
MS
---------+-----------------------------Model | 6.93935951
Residual |
---------+-----------------------------Total | 7.02610983
34 .206650289
Number of obs
F( 3,
31)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
35
-----------------------------------------------------------------------------lncons |
Coef.
Std. Err.
t
P>|t|
---------+-------------------------------------------------------------------lnm |
1.273924
.0744083
lp1 |
.6665821
.1901708
lp2 | -1.614593
.2645366
_cons |
.4053666
.6017865
-----------------------------------------------------------------------------. vif
Variable |
VIF
1/VIF
---------+---------------------lp1 |
47.35
0.021118
164

lp2 |
45.21
0.022118
lnm |
14.66
0.068208
---------+---------------------Mean VIF |
35.74
Regression B
. regress lncons lnmp2 lnp1p2
Source |
SS
df
MS
---------+-----------------------------Model | 6.90670345
2 3.45335172
Residual | .119406383
32 .003731449
---------+-----------------------------Total | 7.02610983
34 .206650289
Number of obs
F( 2,
32)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
35
925.47
0.0000
0.9830
0.9819
.06109
-----------------------------------------------------------------------------lncons |
Coef.
Std. Err.
t
P>|t|
---------+-------------------------------------------------------------------lnmp2 |
1.397402
.0751029
18.606
0.000
1.244422
1.550381
lnp1p2 |
1.044132
.1787048
5.843
0.000
.6801216
1.408141
_cons |
2.030167
.4257199
4.769
0.000
1.163004
2.89733
-----------------------------------------------------------------------------. vif
Variable |
VIF
1/VIF
---------+---------------------lnmp2 |
3.29
0.303518
lnp1p2 |
3.29
0.303518
---------+---------------------Mean VIF |
3.29
(a) Interpret the coecients in Regression A.
(b) Which of the coecients in Regression A are statistically significant?
(c) Test the significance of the regression as a whole.
(d) Test the following hypotheses, using Regression A (at the 5% level)
i. chocolate is a luxury good
ii. sweets are a substitute good for chocolate
(e) Test the following hypothesis (in the main regression):
0 : 2 + 3 + 4 = 0
where model A is written as
ln = 1 + 2 ln + 3 ln 1 + 4 ln 2 +
10.9. EXERCISES
165
(f) Do you detect evidence of multicollinearity in these data? If yes, what might be the
cause and what corrective measures might you take?
(g) Comment on all the results. How might you improve this research?
3. You are given the model
= +
where
(|) =
1
2
if ||
elsewhere
and where is some positive constant. You are also given the following matrices:
22191 00186
1
( 0 ) =
00186 00024
1864
0 =
19396
and told that = 15 and e0 e =18053
(a) Do the assumptions of the classical linear regression model hold in this case? Explain.
(b) What is (|X)?
(c) What is |X ?
(d) Estimate by OLS
(e) Estimate
(f) Test the joint hypothesis 1 = 0 and 2 = 0 by means of the appropriate test.
4. You are given the following description of a DSP:

= 1 + 2 +
with each independently and identically distributed as ( ) for some positive real
number . You have samplePof size 1000 P
from this distribution.
You P
have the following
P
information on the sample:
= 2000,
2 = 9000,
= 5000,
= 8000
(a) What is ( | )?
(b) What is ( | )?
(c) What would the (X0 X) matrix look like in this instance?
(d) Estimate 1 and 2 by OLS.

(e) What would be the (true) covariance matrix of the OLS estimators?
(f) Comment briefly on the properties of the OLS estimators.
For the remainder of this question assume that e0 e =300 000 where e is the vector of
OLS residuals.
(g) What would be your estimate of ?
(h) Test the statistical significance of each regression coecient. Are these tests valid?
(i) Test the joint hypothesis 0 : 1 = 0, 2 = 1
i. By means of a Wald 2 test. Is this test valid?
166

ii. By means of an test. Is this test valid?
(j) Assume that is the exchange rate and is purchasing power parity. Does this
formulation of the DSP make economic sense?
(k) Interpret the OLS coecients in the light of this information.
5. You think that the process by which wages are set is given by
log = 0 + 1 + 2 + 3 2 +
where is the wage rate, is the highest level of education obtained and is
potential experience (in years).
You have the following Stata output:
. reg logw educ exper exper2
Source |
SS
df
MS
-------------+-----------------------------Model | 12296.5615
3 4098.85383
Residual | 20936.2047 22482 .931242981
-------------+-----------------------------Total | 33232.7662 22485 1.47799716
Number of obs
F( 3, 22482)
Prob > F
R-squared
Adj R-squared
Root MSE
=
22486
= 4401.49
= 0.0000
= 0.3700
= 0.3699
= .96501
-----------------------------------------------------------------------------logw |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------educ |
.2002941
exper |
.0480962
exper2 | -.0004388
_cons |
4.622752
-----------------------------------------------------------------------------. matrix list e(V)
symmetric e(V)[4,4]
educ
exper
educ
3.309e-06
exper
4.131e-07
2.918e-06
exper2
3.425e-09 -4.787e-08
_cons
-.0000414 -.00003776
exper2
_cons
8.913e-10
4.509e-07
.00097137
(a) Interpret the regression coecients

(b) Which of these coecients are significant at the five percent level?
(c) Estimate the turning point in the relationship between experience and (expected) log
wages.
(d) Obtain an estimate of the standard error of the estimator used in (c) by means of the
delta method.
(e) Construct a 95% confidence interval for the turning point calculated in (c).
10.9. EXERCISES
167
(f) Test the joint hypothesis that the coecients of experience and experience2 are both
zero, by means of a Wald Test.
(g) You know that your regression has an omitted variable. You expect ability to have
an independent eect on (log) wages. You suspect that the true coecient of ability
(as measured by IQ) in a multiple regression (controlling for education, experience,
experience2 and a constant) is 005. Additionally you assume that the true relationship
between ability and all the other variables is given by:
() = 90 + 2
(note: this is a statistical relationship, not a causal one.) Discuss how this omitted
variable aects the estimates obtained from the empirical regression.
(h) Assume that you have a proxy variable for ability in the form of an IQ test conducted
when the individual was still at school. Under which circumstances would this proxy
variable deal with the omitted variable bias?
168
Chapter 11
Generalised Regression Models

Based largely on Mittelhammer et al. (2000, Chapter 14 and 15).
11.1
Estimation with a known general noise covariance

matrix 2
Thus far we have assumed that the error process was i.i.d. In this chapter we will relax this
assumption. We will first consider the case where we know the structure of the covariance
matrix of the error terms. Although this is hardly ever the case it will enable us to discuss the
implications for OLS estimation and establish a set of useful benchmarks. In the next chapter
we will turn to the case where the covariance matrix is unknown.
11.1.1
The model
The assumptions that we will make about the DSP are:

2. (X ): The function is linear or nonlinear in X and , additive in , ( |X) = X
or g (X ) and var ( |X) = 2
3. X: The X variables are fixed or exogenous.
5. : The disturbances are non i.i.d with (|X) = 0 and var (|X) = 2 where is a
known positive definite symmetric matrix.
6. (e|X ): The distribution of the error terms is left unspecified or is normal.
11.2
The impact of ignoring that 6= I
What happens if we ignore the fact that the covariance matrix of the errors is no longer 2 I
0
(yX ) (yX )
b = (X0 X)1 X0 y and
and we continue to use the OLS estimators
b2 =
?
169
170
CHAPTER 11. GENERALISED REGRESSION MODELS
11.2.1
Point estimation of
b will continue to
We will show that under fairly general conditions the OLS slope estimators
be:
Unbiased:
This follows since

b = (X0 X)1 X0 (X + )
= + (X0 X)
Consequently
X0
(11.1)

b =

(11.2)
It follows from equations 11.1 and 11.2 that
0
b
b
b
var =
h
i
1
1
= (X0 X) X0 0 X (X0 X)
= 2 (X0 X)
X0 X (X0 X)
Note that only if = I does this reduce to 2 (X0 X)1 .

Consistent

b is unbiased, consistency will follow if var
b
Since
0 as (since then

1
b
we can invoke mean square convergence). It turns out that var
(X0 X)
where is the largest eigenvalue of . So if remains finite (bounded) as the
OLS slope estimators will be consistent under precisely the same conditions that we used
to establish consistency when we had = I, i.e. if (X0 X)1 0.
In the case where we have only heteroscedasticity so that is diagonal, it is easy to see
that the condition that remains finite is the condition that none of the error variances
should become arbitrarily large as .
Asymptotically normal (or normal in small samples if (e|X ) is normal)
In order to establish asymptotic normality we need to be able to invoke a central limit

theorem on 1 X0 . It will generally be easier to establish normality by writing this as
1
X0 =
1
1
1
X0 2 2
1
W0
where = 2 is a vector of transformed error terms. This transformation will turn out
to be important later on so it is important to note that it will always exist. This follows
from the spectral decomposition theorem for symmetric matrices. We can always write
= TT0
(11.3)
11.2. THE IMPACT OF IGNORING THAT 6= I
171
where T is an orthogonal matrix1 of eigenvectors and is the corresponding diagonal

matrix of eigenvalues. Since is positive definite it has an inverse which will be given by
1
1
1
1
1
0
now define 2 = T 2 T0 and 2 = T 2 T0 . The matrix 2 is just
T1
T1 . We can
1
12 2 , i.e. where we take the square root of each element on the diagonal.
Since is positive definite every eigenvalue is strictly positive and hence has a square root.
Now observe that
1
( ) = 2 () = 0
and
var ( ) = ( 0 )
1
1
= 2 2 2
= 2 I
2
2
It follows that 1 W0 = 0 and var 1 W0 = W0 W = X0 X. If 1 X0 X h
where h is a symmetric positive definite matrix, then we can use the same reasoning as in
Chapter 9 to show that 1 W0 0 2 h . It follows that
b
2
0 2 A1 hA1
where we are assuming, as in Chapter 9, that 1 X0 X. We can use the same procedure
to get a tractable version of this, i.e.
1
1
b
2 (X0 X) X0 X (X0 X)
(11.4)
This will be the exact distribution if the errors are normally distributed.
b will no longer be ecient. We will show this in due course below.

However
11.2.2
Point estimation of 2
While the OLS slope estimates retain many of their desirable properties, the standard estimator
of 2 is biased and inconsistent. We have that
e0 e = 0 M
1
where M = I X (X0 X)
X0 . Using the same trick as before we have

(e0 e) = ( (0 M))

= M0

= M0
= 2 (M)

But in general (M) 6= ( ). So
b2 6= 2 . Furthermore there is no reason to believe
that this bias would disappear as .
1 This
means that T0 = T1 , i.e. TT0 = I.
172
11.2.3

b
Point estimation of var
It should be fairly obvious that

b2 (X0 X)
11.2.4
will be a biased and inconsistent estimate of 2 (X0 X)
Hypothesis testing
It should also be fairly

clear that all hypothesis tests based on the biased and inconsistent
b
estimator of var will produce biased and inconsistent results. This includes the standard
tests and tests discussed in Chapter 10.
11.3
Transforming the data: Generalised Least Squares
11.3.1
Derivation of the GLS estimator
We observed above that it was possible to transform the error term to get which had a much
better behaved covariance matrix. We will use the same transformation to transform the data
into a form in which they obey the assumptions of the classical linear regression model. The
existing model is (in linear form)
y = X +
1
Multiplying both sides by 2 we get

1
2 y
y
= 2 X + 2
= X +
(11.5)
We have already observed that ( |X) = 0, so certainly ( |X ) = 0. Furthermore ( 0 |X) =

2 I. Since we assume for the moment that is known we can generate the transformed variables
y and X .
Since the assumptions of the classical linear regression model all apply to the model in equation 11.5, estimation of that model by OLS will yield the minimum variance linear unbiased
estimator (by the Gauss-Markov theorem). Note that the parameter has not been aected by
this data transformation - it is still the same parameter that is being estimated. This procedure
of using OLS on transformed data is referred to as generalised least squares.
The OLS estimator of in equation 11.5 will be given by the standard formula:
b
11.3.2
= (X0 X ) X0 y
1 0 1
= X0 1 X
X y
(11.6)
Properties of the GLS estimator
We can now use the full armoury of results from the previous chapters applied to model 11.5 to
establish the unbiasedness, consistency and asymptotic normality of the GLS estimator. It is,
however, useful to derive some of them directly from equation 11.6. Note in particular that
So that
0 1 1 0 1
b
X
X
= + X
b

=
X0 X (X0 X)
11.3. TRANSFORMING THE DATA: GENERALISED LEAST SQUARES

It follows that
173
0 1 1
2
b
var
X X
=
We therefore have

1
1
b var
b
var
= 2 (X0 X) X0 X (X0 X) 2 X0 1 X
= 2 C0 C
1 0 1
1
where C0 = (X0 X) X0 X0 1 X
X . To see this we just multiply the expression out:
h
1 0 1 i h
1 i
1
1
(X0 X) X0 X0 1 X
X
X (X0 X) 1 X X0 1 X
h
1 0 1 i h
1 i
1
1
= (X0 X) X0 X0 1 X
X X (X0 X) 1 X X0 1 X
1 0 1 1 0 1 1
1
1
X X
+ X X
= (X0 X) X0 X (X0 X) X0 1 X
1
1
1
= (X0 X) X0 X (X0 X) X0 1 X
The matrix C0 C is at least positive semi-definite since is positive definite. This follows
since if we take any vector x let z = Cx, then z0 z 0, i.e. x0 C0 Cx 0. It will be strictly
positive whenever z is not the zero vector. Note that if = I then C = 0, so we cannot
guarantee that if x is non-zero that z = Cx will be non-zero. Nevertheless we are assured that
C0 C is positive semi-definite, which is enough to prove that the GLS estimator is more ecient
than the OLS estimator, a claim that we made in section 11.2.1.
The proof that we have just given works directly o the two covariance matrices. We can, of
course, also appeal to the Gauss-Markov theorem on the transformed model given in equation
b
11.5. We know that
will be more ecient than any other linear unbiased estimator of
b
in that model. We know that
is an unbiased estimator of , so we only need to show that
it is a linear estimator, i.e. it is linear in the dependent variable y . Now
b
X0 y
X0 2 2 y
X0 2 y
= (X0 X)
= (X0 X)
= (X0 X)
= Ay
b
b
where A is just a matrix of constants, i.e.
is linear in y , i.e. is more ecient than
the OLS estimator.
11.3.3
Alternative derivation of the GLS estimator
Again consider the model

y = X +
1
1
1
Let P = 2 T0 where T and are defined as in equation 11.3. Note that PP0 = 2 T0 TT0 T 2 =
I . Then it is easy to show that the transformed model
Py = PX + P
also obeys the assumptions of the classical linear regression model. Estimating this model by
OLS yields the same GLS estimator!
174
Example 11.1 Estimation with AR(1) noise

A model with AR(1) noise can be written as
= x +
= 1 +
The covariance matrix (0 ) in this case can
2
1
1
1
3
..
..
..
..
.
.
.
.
2
3
where (u|X) = 0 var (u) = 2 I . and || 1

be written as
1

2
2
(0 ) =
2
1 ..
.
1
One can show that
2
..
.
1
1
..
.
..
.
1
2
3
..
.
..
.
2
It is easy to verify that the matrix

p
1 2
P =
..
0
0
1
12
1
2
0
..
.
0
0
0 0
1 0
1
..
.. . .
.
.
.
0 0
0 0
1
2
2
+1
12
1
2
..
.
0
0
0
0
0
..
.
0
0
0
..
.
1
2
2
+1
12
..
.
0
0
..
.
2 +1
12
1
2
1
2
1
12
1 0
1
Is such that PP0 = 1 2 I . It follows that the transformed model Py = PX + P can

be consistently estimated by OLS. If you write this model out in full it becomes:
p
p
p
1 2 1 =
1 2 x1 + 1 2 1
2 1 = (x2 x1 ) + 2 1
1 = (x x1 ) + 1
p
The first observation is transformed by multiplying through by 1 2 . This is known as the
Prais-Winsten
transformation. This transformation ensures that the variance of the first error
term is 1 2 2 which is identical to the variance of the other transformed error terms. The
other observations are transformed by taking the generalised dierences. Note that in each case
1 = , so the error process has been transformed to be i.i.d.
Example 11.2 Estimation with heteroscedasticity
11.4. EXERCISES
175
The case of heteroscedasticity is fairly simple. The noise covariance matrix is diagonal, i.e.
2
1 0 0
0 22 0
(0 ) = .
.. . .
..
..
.
.
.
0
so writing
P =
1
1
0
..
.
0
0
1
2
..
.
0
..
.
0
0
..
.
1
we can immediately verify that PP0 = I . The transformed model Py = PX + P can be

written out as
1
1
1
=
x1 +
1
1
1
1
2
2
=
x2 +
2
2
2
=
x +
This model is also referred to as weighted least squares, because it amounts to a reweighting
of dierent observations.
11.3.4
Estimation of 2
If we estimate the transformed linear model (equation 11.5) by OLS, then the residual sum of
squares from that estimation can be used to estimate 2 . We have
b
b
y X
e0 e =
y X
1
1
12
12
b
b
=
2 y 2 X
y
b
b
=
y X
1 y X
Our estimator of 2 will therefore be

0
b
b
1 y X
y X
b2 =
We can show that this will be an unbiased and consistent estimator
11.4
Exercises
1. You are given the DSP

y = X +
(|X) = 0
|X = 2
176

You are also given the following matrices:
022191 00186
00186 00024
1864
X0 y =
19396
008484 00084
1
(X0 WX)
=
00084 000123
4024
0
X Wy =
34762
975
95
0
1
XW X =
95 12415
1
(X0 X)
where W = 1 and W =(1 1 1 1 1 1 1 1 4 4 4 4 4 4 4). You are told that e0 e =

18053, where e is the vector of OLS residuals and that 327895 is a consistent estimate of
2 as defined in the DSP.
(a) What is ? (Assume it is 20 if you cant tell).
(b) Estimate by OLS.
(c) If you mistakenly assume that |X = I what would be your estimate of the
b
variance-covariance matrix of the OLS estimators ?
(d) Test the nullhypothesis 2 = 1 on this assumption.
b
(e) Now find the true variance-covariance matrix of the OLS estimators .
(f) Test the nullhypothesis 2 = 1 on this assumption.
(g) Estimate by GLS.
b
(h) Estimate the variance-covariance matrix of the GLS estimators
.
(i) Test the nullhypothesis 2 = 1 using the GLS estimators.
(j) Test the joint hypothesis 1 = 0 and 2 = 1.
Chapter 12
Estimation with an unknown

general noise covariance matrix
2
In the last chapter we saw that:
Ignoring the violation of the assumptions of the Classical Linear Regression Model can lead
to misleading results if the aim of the exercise is more than point estimation of the slope
coecients
If we know the structure of the covariance matrix 2 , then there will always be a transformation of the data that gets rid of the problem. In these circumstances the GLS estimator
will be optimal.
Most of the time, of course, we will not know the structure of this matrix. Under these
circumstances two approaches can be adopted:
We can try to estimate
We can estimate a more appropriate covariance matrix for the OLS estimators and continue
to use OLS.
12.1
Feasible Generalised Least Squares
In general we will not know 2 . One approach would be to estimate and then replace
b This estimated GLS estimator (EGLS) or
the unknown with a consistent estimate .
feasible GLS estimator (FGLS) is given by
1
0 b 1
b
b 1 y
X
X0
(12.1)
= X
Note that if we let = 2 then an equivalent expression is given by
1
0 b 1
b
b 1 y
X
X0
= X
In fact any scalar multiple of will also work.
177
(12.2)
178CHAPTER 12. ESTIMATION WITH AN UNKNOWN GENERAL NOISE COVARIANCE MATRIX 2
12.1.1
The problem of estimating
The problem of estimating or is non-trivial once one realises that there are ( + 1) 2
distinct elements to be estimated many more than the observations available. In practice we
need to impose some restrictions on the shape of the covariance matrix. This is generally done
by imposing some parametric structure on the matrix, i.e. we assume that = () where
is a vector of parameters of dimension considerably smaller than .
Example 12.1 If we think there is autocorrelation
the form of an (1) process, i.e. that
2
2
2
=
..
..
.
.
1
in the data, we might assume that it takes

2
1
..
.
..
.
1
2
3
..
.
Note that = (), i.e. we need to estimate only one parameter in order to fully characterise
the matrix.
Example 12.2 In many cross-sectional data sets there is correlation within households, neighbourhoods, schools etc. One simple way of capturing this is with the hierarchical eects model
= x + +
In this model the subscript refers to the individual and the subscript to the household.
The random variable is assumed to be common within the household while the error term
is assumed to be uncorrelated between individuals. In short we would add the assumptions:
( ) = 0, ( ) = 0

2 = 2

2 = 2
( ) = 0 if 6=
( ) = 0 if 6=
The error term in the regression is = + , so the covariance matrix of the error terms
is block diagonal once we order the observations so that indviduals within the same group are
next to each other. A typical matrix might look like:
2
1 + 2
21
21
0
0
0
21
2 + 2
2
0
0
0
2
2
2
2
0
0
0
1
1
1
2
2
2
0
0
0
0
2
2
2
2
2 =
0
0
0
0
2
2
2
2
0
0
0
0
0
0
3
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
0
2 + 2
Provided that the number of groups is less than the number of observations , it should be
possible, in principle, to estimate the within group covariances 2 .
12.1. FEASIBLE GENERALISED LEAST SQUARES
179
Example 12.3 Even if we assume that the matrix is diagonal there would still be separate
variances to be estimated. In this case we might suppose that the variances might be constant
within groups (e.g. households) so that the matrix might be:
2
1 0
0
0
0
0
0
0 21 0
0
0
0
0
2
0
0 1 0
0
0
0
0
0
0 22 0
0
0
2
= 0
0
0
0 22 0
0
0
0
0
0
0 23
0
..
..
..
..
..
.. . .
..
.
.
.
.
.
.
.
.
0
Alternatively we might suspect that the heteroscedasticity is driven by some explanatory variables,
so that
( |x ) = x a
The parameter vector a could, in principle, be estimated.
12.1.2
Approach to estimating
Since the true error terms are unobserved it is unclear how we might estimate their covariances
b
or correlations. Note, however, that the OLS estimators
remain consistent so that the OLS
residuals e also remain consistent estimates of the true error vector .
The general approach is therefore:
Estimate the original model using OLS and obtain the vector of OLS residuals e
Estimate the parameters using the residuals e as estimates of .
b matrix, letting
b = (b
Form the
)
Reestimate the model using the FGLS estimator

In some approaches the residuals from the FGLS estimator are used in turn to reestimate
the parameters, leading to an iterated version of the FGLS estimator.
Example 12.4 The Cochrane-Orcutt procedure
In the last chapter we saw that the (1) error process could be corrected through a process
of obtaining the generalised dierences. This is the basis for the Cochrane Orcutt procedure. The
method is simple:
1. Estimate the model using OLS and obtain the OLS residuals e
2. Run the OLS regression
= 1 +
and estimate
3. Use b
to transform the observations, i.e. create
= b
1 , if 1
x = x b
x1 , if 1

(If desired) transform the first observation as
q
2 1
1 = 1 b
q
x1 = 1 b
2 x1
4. Estimate the transformed model
y = x +
by OLS
5. If desired, iterate - until convergence is achieved.
Remark 12.5 Please note, however, that the state of the art in time series econometrics has
progressed far beyond this particular technique. If you have serious autocorrelation problems it
should be seen as an indicator that you probably have nonstationary data and in that case dierent
approaches to estimation are probably called for.
12.1.3
Properties of the FGLS estimator
The FLGS estimator will be consistent, asymptotically ecient and asymptotically normal provided that the matrix can, in fact be characterised as = ()
12.2
OLS with robust estimation of the covariance matrix
b
We observed in the last chapter that the OLS estimators
remain unbiased and consistent.
Their covariance matrix, however, is given by
2 (X0 X)
X0 X (X0 X)
An alternative approach would be to obtain consistent estimates of this covariance matrix and
use that for purposes of inference instead.
12.2.1
Heteroscedasticity consistent standard errors
One particular
application of this
the context of heteroscedasticity. In this case the matrix
P is2 in
X0 2 X can be written as
x0 x where x is the i-th row of the X matrix. It is possible
to show that
1 X 2 0
x x
is a consistent estimator of
1
lim X0 2 X
provided that the latter exists. In this case the OLS covariance matrix can be estimated as
X
1
1
(X0 X)
(12.3)
2 x0 x (X0 X)
Note that we are using the consistency of the

residuals 2 for 2 to construct this estimate.
OLS
b
The justification for this estimate of var
is asymptotic. The finite sample properties
of the estimator can be somewhat improved by adjusting the errors upward. This is some times
called a degrees of freedom correction. A typical one will be tomuliply the standard errors by
2
2
. Note that if we were to estimate we would have used .
12.3. SUMMING UP
12.2.2
181
Heteroscedasticity and autocorrelation consistent (HAC) standard errors
If there is autocorrelation in the data as well as heteroscedasticity we can again use the relationships of the OLS residuals in the sample to approximate the asymptotic correlations given
in 2 . One problem we immediately encounter, however, is that certain correlations (e.g. the
correlation between 1 and ) will only ever be estimated by one observation and so 1 cannot
possibly yield a consistent estimator of the population covariance (1 ). In order to get decent
asymptotic behaviour one therefore needs to impose a maximum lag length above which we
assume that there is no or negligible correlation. We could then estimate X0 X as
0 2 X =
X\
2 x0 x
=1 =+1
x0 x + x0 x
We can then allow to increase as increases (but at a slower rate!). It turns out that these
empirical estimates need not be positive definite, however. This is a major drawback. Newey
and West have shown that one can get a positive definite set of estimates if we down-weight the
estimates coming from correlations over longer periods. The Newey West estimator is given
by
0 2 X =
X\
X 2 0
X
x x +
x0 x + x0 x
1
+1
=1
where we have also introduced the finite sample correction
12.3
=+1
Summing up
Feasible Generalised Least Squares will be asymptotically ecient provided that we have parameterised the structure of the covariance matrix correctly. If there is doubt about this, the
FGLS estimates could be more or less ecient than OLS. In this context using OLS with robust
standard errors has much to recommend it.
Chapter 13
Heteroscedasticity and
Autocorrelation
13.1
Introduction
In this chapter we will introduce the issue how we might diagnose the presence of heteroscedasticity or autocorrelation. The fundamental approach in both cases involves analysis of the residuals
from an OLS regression. Indeed visual inspection of these residuals can frequently be very
instructive. The formal tests tend to be of the LM type in which the homoscedastic/zero autocorrelation model is the restricted model. The test statistics can generally be obtained by
an auxiliary regression of the sort discussed previously, i.e. we will regress the OLS residuals
on a broader set of explanatory variables and then use a Chi-square test with 2 as our test
statistic, where 2 is the 2 of the auxiliary regression.
13.2
Tests for heteroscedasticity
13.2.1
Breusch-Pagan-Godfrey test
Note that the original BPG test is more involved than the procedure outlined below (see, for
instance, Gujarati (2003, p.411)). This discussion follows Mittelhammer et al. (2000, pp.536
539).
The null and alternative hypotheses underlying the BPG test are given by
0 : 2 = 2 for all versus 1 : 2 = 0 + z
where z is a 1 , row vector of variables that are thought to explain the level of the variance
for observation (note that these may include a set of dummies) while is an 1 parameter
vector. Note that the hypothesis of homoscedasticity is now equivalent to the hypothesis
0 : = 0
The auxiliary regression in this case will be given by
2 = 0 + z +
and the 2 test statistic will be distributed asymptotically as 2 ().
183
(13.1)
184
CHAPTER 13. HETEROSCEDASTICITY AND AUTOCORRELATION
Why does this work?

We can represent the alternative hypothesis in quasi-regression form as
2 = 0 + z + 2 2
2
= 0 + z +
(13.2)
We have
( ) = 0

2
= 0 + z
If we could observe the terms we could estimate this model by OLS and test the hypothesis
= 0 by any of the tests discussed in a previous chapter. In particular, we could test it by the
LM approach. In this case the restricted model would be given by the intercept only regression:

2 = 0
(13.3)
The residuals from this model would be regressed on the full model, which includes the intercept
and the additional explanatory variables. 2 from this regression will then be a valid test
of the restriction = 0. Since the residuals from the model 13.3 are identical to the 2 value
minus a constant, the 2 in the regression 13.2 will be identical to this statistic.
Of course the 2 terms are not observed. Under our assumptions, however, the OLS residual
vector e . Consequently 2 will be an asymptotically valid test of the restriction = 0.
13.2.2
White test
A similar logic underlies Whites general test for heteroscedasticity. In this case the hypotheses
are:
0 : 2 = 2 for all versus 1 : 2 = x Ax0
where the row vector x is the vector of explanatory variables used in the regression and A is
some symmetric matrix. The auxiliary regression in this case is
2 = x Ax0 +
If x does not include an intercept term, then a constant is added into this regression. Note that
the auxiliary regression involves all squares and cross product terms. Some of these variables
may need to be dropped (e.g. the square of a dummy variable is perfectly collinear with the
variable itself).
13.2.3
Other tests
There are a number of other tests available. For instance we could assume that 2 = 0 +1 ( ),
in which case our auxiliary regression would involve regressing the square of the residuals on
b . 2 from the auxiliary regression would in this case be
a constant and the fitted values y
2
distributed as (1)
Another test that is sometimes encountered is the Goldfeld-Quandt test (discussed in
Gujarati (2003, p.408)). We could split the sample up into two groups and estimate the regression
separately on the subsamples, i.e. our model is
y1
y2
= X1 + 1
= X2 + 2
13.3. TESTS FOR AUTOCORRELATION
185
If the errors are normally distributed, then our OLS estimates of the error variances on each
subsample should be distributed as chi-square, with. 1
b21 2 (1 ) and 2
b22
2
2
1
2
2 (2 ) where the subscripts indicate which sample they are taken from. The ratio of these
two, divided by their respective degrees of freedom, should be distributed as an statistic, since
these 2 statistics will obviously be independent of each other, i.e.
1 2
b1 (1
21
2 2
b2 (2
2
2
Under the assumption that
21
22
)
)
1 2
we get that
b21
1 2
b22
13.3
Tests for autocorrelation
13.3.1
Breusch-Godfrey test
The augmented model is specified in this case as

y = x +
=1
The test of the hypothesis 1 = 2 = = = 0 can be implemented as an LM type test using

the auxiliary regression on the residuals
= x +
() +
=1
where the variable () is defined as

() =
if
otherwise
The test statistic is 2 which is asymptotically distributed as 2 ()
13.3.2
Durbin-Watson d test
The most famous test for first order autocorrelation is the Durbin-Watson d test. The test
statistic is given by
P
( 1 )2
2 + 2
= =2P
= 2 (1 ) P1
2
2
=1
=1
where is the sample autocorrelation coecient. The critical values for this test are somewhat
awkward, because they divide into regions in which the null hypothesis is rejected, where it is
accepted and a region where the test is inconclusive! The procedure is adequately described in
the undergraduate econometrics text books (. Gujarati 2003, p.467471)
Mittelhammer et al. (2000, pp.550) observe that an asymptotically equivalent test can be
derived through the inverse auxiliary regression model
(1) = +
186
and testing the null hypothesis that = 0 (this could be done by means of a t-test). They
suggest that unlike with the DW test, this test would be valid even if the error process is not
normal.
13.4
Pretest estimation
Some caution is appropriate if these diagnostic tests are used for model selection. In that case the
process of estimation-testing-reestimation can be thought of as a dierent algorithm for arriving
at estimates. For instance our pretest estimator might look as follows:
(
b
if specification test passes
=
b
otherwise

A typical example of this is where the analyst switches to using the Cochrane-Orcutt procedure
once the DW test has failed.
One should note that the properties of the pretest estimator cannot be determined from
the properties of the OLS and the FGLS estimators taken separately. When we analysed those
properties we assumed that the analyst estimated the model once and once only. If the analyst
engages in a serial specification search, the properties of the resulting estimator are likely to be
very dierent from the theoretical properties that we outlined. Indeed there are some Monte
Carlo results which suggest that the pretest estimator will have fare badly particularly in the
cases where its application is most likely to bind!
13.5
A warning note
Misspecification of the regression can frequently result in OLS residuals that look heteroscedastic
or autocorrelated. In particular omission of a relevant variable or the choice of an inappropriate
functional form can lead to such problems. Failure of a specification test may therefore be
grounds for rethinking your specification as much as for worrying about the error process.
13.6
Exercises
1. You are trying to estimate a PPP type of relationship on time series data. In particular
your theoretical model can be represented as
ln = 1 + 2 ln + 3 ln +
where is the exchange rate (in this case South African cents per dollar), is the
domestic price level (in this case given by the South African Producer Price Index) and
is the foreign price level (given by US producer prices). You have estimated this on
quarterly data from the first quarter of 1970 to the third quarter of 1997. Your empirical
results are given in the (slightly edited) Stata output given below. You may find it useful
to know that is a dummy variable equal to one for the period from June 1984 to April
1994, i.e. the period of peak political conflict in South Africa. The operator L. in front of
any variable is the lag operator, i.e. it refers to the previous period, e.g. L. would
be equivalent to 1 . The Prais-Winsten regression is equivalent to a CochraneOrcutt regression with the Prais-Winsten correction.
13.6. EXERCISES
187
. regress lnexrate lnsappi lnusppi

Number of obs =
109
-----------------------------------------------------------------------------lnexrate |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------lnsappi |
1.057781
.035494
29.80
0.000
.987411
1.128152
lnusppi | -1.134649
.0880065
-12.89
0.000
-1.309131
-.9601679
_cons |
6.398329
.2706282
23.64
0.000
5.861782
6.934876
-----------------------------------------------------------------------------. matrix list e(V)
symmetric e(V)[3,3]
lnsappi
lnsappi
.00125982
lnusppi
-.0029445
_cons
.00849819
lnusppi
_cons
.00774515
-.02355327
.07323965
. predict error, residuals

. regress L.error error, noconstant
Number of obs =
108
-----------------------------------------------------------------------------L.error |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------error |
.8888503
.0442049
20.11
0.000
.8012192
.9764814
-----------------------------------------------------------------------------.
. regress error lnsappi lnusppi D
Source |
SS
df
MS
-------------+-----------------------------Model | .214795764
3 .071598588
Residual | 1.30392951
105 .012418376
-------------+-----------------------------Total | 1.51872528
108 .014062271
Number of obs
F( 3,
105)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
109
5.77
0.0011
0.1414
0.1169
.11144
-----------------------------------------------------------------------------error |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------lnsappi | -.0387676
.0343342
-1.13
0.261
-.106846
.0293108
lnusppi |
.016065
.0820243
0.20
0.845
-.1465741
.1787042
D |
.1131396
.0272041
4.16
0.000
.0591989
.1670802
_cons |
.0157327
.251981
0.06
0.950
-.4838991
.5153645
------------------------------------------------------------------------------
188

. di e(N)*e(r2)
15.416046
.
. prais lnexrate lnsappi lnusppi
Prais-Winsten AR(1) regression -- iterated estimates
Source |
SS
df
MS
-------------+-----------------------------Model | 2.42077756
3 .806925853
Residual | .316889523
105 .003017995
-------------+-----------------------------Total | 2.73766708
108 .025348769
Number of obs
F( 3,
105)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
109
267.37
0.0000
0.8842
0.8809
.05494
-----------------------------------------------------------------------------lnexrate |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------lnsappi |
1.000519
.1079209
9.27
0.000
.7865319
1.214506
lnusppi | -1.005464
.2548018
-3.95
0.000
-1.510689
-.5002393
D |
.003786
.0385567
0.10
0.922
-.0726648
.0802368
_cons |
6.039999
.7748212
7.80
0.000
4.503672
7.576326
-------------+---------------------------------------------------------------rho |
.892923
-----------------------------------------------------------------------------. matrix list e(V)
symmetric e(V)[4,4]
lnsappi
lnsappi
.01164692
lnusppi -.02526347
D -.00022855
_cons
.06969469
lnusppi
_cons
.06492398
-.00017145
-.19399087
.00148662
.00099607
.60034786
(a) What might be your theoretical expectations about the coecients 1 , 2 and 3 ?
(b) Interpret the regression output of the first regression.
(c) Test the following hypotheses on the first regression by the appropriate F or t-test:
i. 0 : 2 = 3
ii. 0 : 2 = 3 = 0
(d) Test to see if there is first-order autocorrelation.
(e) Use an LM type of test to see if the coecient on the dummy variable in the first
model is zero.
(f) Compare the Prais-Winsten regression to the OLS regression.
Part III
Estimation with endogenous

regressors - IV and GMM
189
Chapter 14
Instrumental Variables
Comment: include testing with IV (incl heteroscedasticity), natural experiments
(some more stu on weak instruments?)
14.1
Introduction
In this chapter we will be considering violations of the assumption that the X variables are
exogenous. In practice this is an enormously important topic and one that is still very much the
subject of ongoing theoretical work.
14.1.1
The model
In this case we are making the following assumptions about the DSP:
3. X: The X variables are endogenous/correlated with .
5. : The disturbances are independent and identically distributed with (|X) =
6. (e|X ): The distribution of the error terms is left unspecified.
14.1.2
Least squares bias and inconsistency
The OLS estimator is given (as before) by

b
Consequently
= (X0 X)
X0 y
1
= + (X0 X)
X0
1
b
|X
= + (X0 X) X0
191
192
CHAPTER 14. INSTRUMENTAL VARIABLES

h
i
b = X |X
b
and hence
6= , except in very particular circumstances.
Furthermore if (x0 ) = (where x is a row of X) then we can apply a central limit
theorem to show that for well-behaved cases
1 0
X=
lim
so that
b = + Q1
lim
and OLS is not only biased, but inconsistent.
14.1.3
Examples
Omitted variables
One of the cases that we have already looked at is omitted variable bias. In some cases there is a
straightforward solution - include the omitted variable! In many contexts, however, the variable
may not have been measured on the data set, or it may even be unmeasurable. One example
that has been extensively analysed in the labour economics literature is that of the relationship
between schooling and wages. Consider the DSP given by
ln = 2 + 3 +
where is schooling (in years) and is innate ability and everything is expressed in deviations
from the respective means1 We assume that ( ) = 0, i.e. someone of higher ability will
find it easier to attain more schooling, everything else held constant. Generally, however, it is
very dicult to measure innate ability, so we will estimate the model
ln = 2 +
where = 3 + . We note that 1 s0 = 3 . We have already seen that OLS will lead
to biased results in this case:
b
1 0
= (s0 s)
s ( 2 s + 3 a + u)
1 0
= 2 + 3
b + (s0 s)

b

= 2 + 3 (b
)
2
su
where
b is the coecient that we would have obtained in a regression of ability on schooling.
It is straightforward to see that the OLS estimator will be inconsistent, with
b = +
lim
2
2
3 2
where 2 = ( ) = lim 1 s0 s.
In this particular context we would expect 3 0 and 0, so the estimated returns to
schooling from the typical regression will be overestimated, since part of the measured eect
of schooling is due to the fact that schooling is correlated with ability, but we have not been able
to control for ability.
1 This
1.
is justified in terms of the FWL theorem. Note that the deviations model does not include the constant
14.2. THE INSTRUMENTAL VARIABLES SOLUTION
193
Systems of equations
Another context in which the explanatory variables might become correlated with the error term
is if the relationship that we are trying to estimate is in fact part of a system of equations. One
of the simplest text book examples of this is given by the macroeconomic consumption function
= + +
(14.1)
which is simultaneously subject to the national accounting identity:

= + +
(14.2)
Substituting equation 14.1 into this, we find

=
1
+
+
+
1
1
1
1
From this last equation it is immediately obvious that ( ) 6= 0. Consequently (as before)
OLS estimates of equation 14.1 will produce biased and inconsistent coecients.
Measurement error
This is an interesting topic which we will explore in more detail below. For the moment let us
assume that the DSP is given by
= +
But we do not measure x accurately. Instead, we measure
= +
where is a random error, which we assume to be uncorrelated with . The model that we are
able to estimate is given by
= ( ) +
= +
where = . It is immediately clear that ( ) 6= 0, since the measurement error

term is a component of both as well as of .
14.1.4
The problem of nonexperimental data
The fundamental problem in all cases is that we generally do not have the luxury of controlling
the level of explanatory variables. Unlike the settings on a machine in a laboratory which can
be preset to specified levels, we cannot control the level of schooling that our research subjects
have; we cannot easily loosen or tighten their budget constraints or force them to reveal their
private information truthfully. As such the issue that we are addressing here goes to the heart
of the estimation problems facing applied economists.
14.2
The instrumental variables solution
In all these cases the theoretical solution is given by the instrumental variables estimator. The
simple IV estimator is based on the assumption that there is a matrix of instruments
194
W such that:
1 0
1
0
lim
W = lim
(W ) = 0

1 0
1
0
lim
W W = lim
(W W) = Q and

Q is positive definite
1 0
lim
W X = Q and has full rank

It is defined as
14.2.1
Rationale
b = (W0 X)1 W0 y
(IV Assumption 1)
(IV Assumption 2)
(IV Assumption 3)
(14.3)
We will show in a moment that this estimator is consistent. Before we do so, it is useful to
consider how we might arrive at even thinking about an estimator of this sort. The first point
to note is that the fundamental assumption is the first one given above, viz.
(W0 ) = 0
(14.4)
i.e. we need a set of variables that are uncorrelated with (orthogonal to) the error term in
the regression equation. If we can find such variables, they are instrumental in solving our
estimation problem. Equation 14.4 is a population moment condition, so we might think of
applying a method of moments logic to the estimation. In this case we would get
1 0
b
W y X
= 0
(14.5)
so that we would get a set of equations akin to the normal equations:

b
W0 y = W0 X
These could hold trivially if W0 y = W0 X = 0. Provided, however, that W0 X has full rank
b
(which we have assumed at least asymptotically) we can solve out for
which will give us
the equation of the instrumental variables estimator given above. In short we require the set of
instruments to be uncorrelated with the errors but suciently correlated with the explanatory
variables.
Intuitively we can think of the situation as sketched out in Figure 14.1.We have ( ) 6= 0
so that as changes, so does . This means that it is dicult for OLS to decompose the observed
changes into changes which occur due to changes in and changes in . The instrumental variable
in essence acts like a seismometer: it moves when moves, but it does not change when
changes. By observing we can decompose the changes in into real changes (independent
of ) and changes which are correlated with . It is intuitively obvious that we should be able
to retrieve an estimate of by observing the relationship between and and the induced
relationship between and .
If we assume that
y = x + , x = w + v
It follows that
y = w + (v + )
(14.6)
14.2. THE INSTRUMENTAL VARIABLES SOLUTION
195
y
x
w
Figure 14.1: A schematic representation of the model y = x + , x = w + , (w ) = 0,
( x) 6= 0.
Let = . We can estimate the relationship between and through OLS and get an estimate
of . Similarly we can estimate the relationship between and and get an estimate of . It
is now easy to get an estimate of as
b
b=
b
It is obvious that this should give consistent estimates, since
b and
b give consistent estimates
of and respectively. This is eectively what the instrumental variables estimator does. In
this particular case
1
1
b = (w0 w) w0 y,
b = (w0 w) w0 x
so
b = (w0 x)1 w0 y
because the (w0 w) terms divide out. In order for this estimation strategy to work, we require
in particular the assumption that is uncorrelated with (because otherwise we cannot estimate
the relationship between and consistently). It is also evident that we potentially run into
trouble if our estimate of is very small. Unfortunately even if and are strongly correlated
within the population, it is always possible to get a pathological sample in which
b happens to be
too small. This means that instrumental variables estimation should be employed with suitable
caution! We will discuss this in more detail below.
14.2.2
Consistency
Consistency is relatively easy to establish. We have

b
= (W0 X)
W0 y
= + (W0 X)
W0
196
so
b
lim
= + lim
= + Q1
0
=
1
1 0
1
lim W0
WX

Note that in general the IV estimator will not be unbiased, since the random variables
1
(W0 X) W0 and are not independent of each other, so even if we could
condition on iW,
h
1
0
0
0
we will not be able to break the expression (W X) W up, i.e. (W X)1 W0 |W 6=
h
i
1
(W0 X) W0 |W [|W]
b
b
y
= 1 e0 e is a consistent estimator of 2 , since

It follows that 1 y X
1 0
e e

=
=
1
1
1
X + X X (W0 X) W0
X + X X (W0 X) W0
1
1
1
X (W0 X) W0
X (W0 X) W0
1 0
2
1
1
1
1
0 X (W0 X) W0 + 0 W (X0 W) (X0 X) (W0 X) W0
so
1
lim e0 e

= lim
"
= lim
1 0
14.2.3
1 1
1 0
2 0
0
X 1 W0 X

W +
1
1 1
1 0
1 0
1 0
1
0
0
W X W
X X
W X
W
Asymptotic normality
We observe that

b =

1
1 0
1
W0
WX
So provided that we can apply a central limit theorem to 1 W0 (which we should be able to),
we will have
1
b

0 2 Q1
Q Q
i.e.
14.3
2
1
Q1
Q
Q
The overidentified case
In the discussion thus far we have assumed that we have exactly as many instruments as we
have explanatory variables. If we have more instruments we potentially have dierent ways of
estimating our coecients. Which of these would be best?
14.3. THE OVERIDENTIFIED CASE
197
One possibility is suggested by Figure 14.1. Another way of understanding what is happening
in that case is to write x as
x = (x|w) + v
so that we are in eect breaking x up into two components: one that reflects the correlation with
w and one that is independent of it. The sample analogue of this is
b
x =b
x+v
b is the set of residuals from the regression of x on w. The fitted values and these residuals
where v
are guaranteed to be uncorrelated with each other. If we write our model now in the form
y
b) +
= (b
x+v
b + u
= x
(14.7)
b will be uncorrelated with u, since it is uncorrelated with v

b (by construction), but x
b
Note that x
is just a linear function of w and w is uncorrelated with (by assumption). So we can estimate
equation 14.7 consistently by OLS. It is useful to write this out in terms of the underlying
instrumental variable, i.e.
y = wb
+ u
b = . We
Looking at this relationship it is again clear that we can construct our estimate as
b as proxies for the stochastic

can see that the IV estimator in some sense uses the fitted values x
x variable.
14.3.1
Two stage least squares
This case suggests a simple way in which we can generalise our approach to the situation in which
we have more than one instrument for x. Suppose that we have two instruments w1 and w2 .
Obviously we could use either one to obtain consistent estimates, but this is obviously not the
best use of the information available. Indeed, if w1 is correlated with x and w2 is also correlated
with x and both are uncorrelated with , then any linear combination w1 1 + w2 2 will also be
correlated with x and uncorrelated with .
Our approach in this case is to regress x on the instruments w1 and w2 and then use the
b = w1
fitted values x
b 1 + w2
b 2 instead of x in the regression. Note that as in equation 14.7 the
b and and so the coecient can be consistently
fitted values will be uncorrelated with both v
estimated by OLS. This procedure is referred to as two stage least squares, because in the first
b and in the second stage we run the regression as y =b
stage we create the fitted values x
x + u.
Observe that the first stage creates the optimal linear combination w1
b 1 + w2
b 2 ; optimal in
the sense that the correlation between the linear combination and x will be maximised.
The estimate obtained by the 2SLS procedure is
b = W x, so
But x
1 0
b = (b
b) x
by
x0 x
b = (x0 W x)1 x0 W y
This result generalises if there is more than one endogenous variable. Indeed if the DSP is given
by
y = X +
198
we can think of the matrix of instruments W for X, where some of the columns of W may simply
be identical to some of the columns of X (if those particular variables are uncorrelated with .
The generalised IV estimator or two stage least squares estimator is defined as:
b
= (X0 W X) X0 W y
1
1
1
=
X0 W (W0 W) W0 X
X0 W (W0 W) W0 y
Note that this estimator is defined only if the W matrix has at least the same rank as the X
matrix. In the latter form it is clear that there is no need to do the estimation in two stages.
Indeed in general we would not want to do the estimation in two stages, since the estimate
of the
e0 e
b
error variance 2 will be incorrect: instead of being based on where e = y X
,
e0 e2
b
b
it would be based on 2
where
e
=
y
X
2
.
It is fairly easy to check consistency of the generalised IV estimator.

b
= (X0 W X)
X0 W y
= + (X0 W X)
X0 W
If W is uncorrelated with , then asymptotically 1 W0 0. Consistency will follow provided

that 1 X0 W X converges to some positive definite matrix. It will do so provided IV assumptions
2 and 3 hold.
e0 e
The consistency of
b2 = follows just as before. By a similar argument we can also
show that
1
1 0
1
2
b
2
plim
X
i.e.
1
2
0
b
(X
X)
W
e0
We will estimate the covariance matrix as (X0 W X) . Note that in this case there is
no compelling case for dividing by , since it is not the case that (e0 e ) = ( ) 2 .
e0 e
Indeed there is no intrinsic reason why should underestimate 2 , since the instrumental
variables procedure is not based on minimising e0 e . Of course asymptotically it makes little
dierence whether one divides by or .
14.3.2
Test of the overidentifying restrictions
One of the key conditions in order for instrumental variables estimation to be valid is that
1 0
W =0
We used this population moment condition to derive the IV estimator in equation 14.5. Note,
however, that the sample moment conditions
1
1 0
b
W y = W0 X
have a unique solution only if there are exactly equations. If there are more columns in the W
matrix than in the X matrix we have more equations than unknowns. This is why we refer to
b that will
this situation as the overidentified situation. In general there will be no value of
14.4. IV AND ORDINARY LEAST SQUARES
199
solve out all of these equations. In essence we could take any equations and get a dierent
set of sample estimates. We would assume, however, that if the population condition is really
true that these estimates should all be approximately equal. We can test for the validity of these
overidentifying restrictions by means of a simple test (Davidson and MacKinnon 1993, pp.232237). The procedure is as follows: we regress the IV residuals e on the set of instruments
W. From this regression you calculate times the uncentered 2 . This will be distributed
approximately as 2 ( ) where is the rank of the W matrix and is the rank of the X
matrix. Note that we can obviously not use this test if = .
14.4
IV and Ordinary Least Squares
14.4.1
OLS as a special case of IV estimation
If we let W = X we see that OLS is just a special case of IV estimation with the X variables as
instruments for themselves!
There is an additional relationship between them. We note that the IV estimator is a linear
estimator which will be unbiased in the special case where X is independent of . So if the
assumptions of the Gauss-Markov theorem hold, we come to the conclusion that the IV estimator
would be a linear unbiased estimator and hence by the Gauss-Markov theorem less ecient than
the OLS estimator. In short, if X is uncorrelated with the errors we would prefer to run OLS
rather than IV estimation with some instruments W 6= X.
14.4.2
Hausman specification test
This suggests that it is in general an interesting question to see if X is uncorrelated with the
error vector . In particular we wish to test the hypothesis
0 : y = X + , 0 2 I , (X0 ) = 0
1 : y = X + , 0 2 I , (W0 ) = 0
b
Note that our test supposes that
is definitely consistent, but inecient under 0 . By conb
trast is ecient and consistent under 0 , but inconsistent under 1 . Tests of inecient but
consistent estimators against possibly ecient estimators can be carried out by means of a Hausman test (this discussion is based on Davidson and MacKinnon 2004, pp.3412). The intuition
for these tests follows from the fact that the inecient estimator can be written asymptotically
as the sum of the ecient estimator plus an independent noise variable, i.e.
1
b 0
b 0 +
2
= 2
Since the two terms on the right hand side are independent of each other, we have asymptotically
1
b 0
b 0 + var ()
var 2
= var 2
Furthermore
1
b
b
= 2
and hence has zero mean under 0 . A suitable test statistic is therefore given by
0
1
b
b
b
b V
b
b
200
In the context of IV our test statistic would be
b
b
b
b
b
b
explanaThis should be distributed as 2 with degrees of freedom. However, if there

are some
b
b
tory variables that are instruments for themselves then the matrix V
may
have rank less than (in general it should be at least of the order 2 where 2 is the number of
endogenous variables). Furthermore there is no guarantee that in finite samples the matrix
b
b
b
b
V
V
is of full rank or that it is positive definite. As Davidson and MacKinnon (2004, pp.3412)
note, one can base a valid test on a subvector. In this context one might want to create the
test statistic from only the coecients on the endogenous elements of X. The 2 would have 2
degrees of freedom. Indeed this would be the preferable version of the test.
14.4.3
Hausmans test by means of an artificial regression
It turns out that one can implement a test of the hypothesis given above by means of an artificial
regression. The idea is straightforward. Assume that the model is given by
y = X1 1 + Z2 2 +
Here we assume that the 2 variables Z2 are endogenous and that we have a matrix W = X1 W2
of instruments, where the number of elements in W2 is 2 . We now run the auxiliary regression
y = X1 1 + Z2 2 + MW Z2 +
(14.8)
Note that MW Z2 is the set of residuals from the first stage regression of Z2 on all the instruments
W. A test of the hypothesis that = 0 amounts to a test of the hypothesis that the OLS and
IV coecients are equal. One useful feature of the auxiliary regression 14.8 is that the OLS
estimates of 1 and 2 are numerically identical to the IV estimates! This is fairly easy to show
(see Exercises).
14.5
Problems with IV estimation
14.5.1
Finite sample properties
The logic underpinning instrumental variable estimation is asymptotic. It turns out, however,
that the finite sample properties of IV are problematic. In general the sampling distributions
b
are very dicult to derive, but in some typical cases the sample distribution of
will have
moments, where is the number of instruments and is the number of regressors. In
b will have no mean! This means that the tails
particular if = the sample distribution of
of the distribution are very fat, i.e. extreme outcomes will occur fairly often. Even if we have an
extra instrument the distribution will have no variance, which again points to the possibility of
extreme outcomes.
It may seem strange that it is possible for the IV estimator to be asymptotically well-behaved,
while so badly behaved in small samples. As Davidson and MacKinnon (2004, p.327) note, it
is not the case that if a sequence of random variables converge to a limiting distribution that
14.6. OMITTED VARIABLES
201
the sequence of moments will converge. In this case the limiting distribution has allnmoments,
o
b
whereas in the exactly identified case, none of the random variables in the sequence
=
has any moments at all!
Convergence, of course, implies that the CDFs converge to the CDF of the limiting distribution. To that extent the asymptotic distribution can yield valid -values and confidence
intervals.
b has a mean, this mean is typically biased. Indeed this bias will tend to increase
Where
with the degree of overidentification. This arises from the fact that the better the first stage
regression fits, i.e. the closer the fitted values are to the endogenous variables themselves, the
closer the IV results will be to the OLS ones.
14.5.2
Weak instruments
Asymptotically it does not matter how weak the correlation between the instrument and the endogenous variable is: any correlation can be good enough to identify the structural relationship.
In practice, however, weak instruments can create many problems. In the first place with weak
instruments there can be susbstantial departures from the asymptotic distributions even with
hundreds of thousands of observations. This means that standard inference procedures can be
very unreliable.
Secondly weak instruments will lead to large standard errors, so that even correctly estimated coecients may turn out to be non-significant. It is quite easy to see this in relation to
the standard formula for the variance of an OLS estimate, given in equation 8.12, i.e.

b =

2
P
(1 2 ) ( )2
Consider the case where the variable is the only endogenous variable and where the instruments not included in the structural equation have weak explanatory power for (after
controlling for 1 1 ). We know that that the 2SLS estimates use
b but this will now be
highly correlated with the other explanatory variables, so 2 will be close to one. Perforce the
IV estimates will have higher standard errors
A key issue that arises in this context is how to detect weak instruments and what to do
about them. A basic precondition is that the instruments that are not in the main regression
should be jointly significant in the first stage regression, i.e. they should have explanatory power
in addition to variables that are included in the regression. As Stock, Wright and Yogo (2002,
p.522) point out, however, in many cases the statistic will have to be large, typically above
10, for inference to be reliable. It is these days regarded as unacceptable to publish IV and 2SLS
results without reporting these diagnostic statistics.
Nevertheless even where the instruments are weak, there are now techniques available for
providing more reliable inference. An accessible discussion is provided by Murray (2006).
14.6
Omitted variables
Above we showed that the omission of a relevant variable can lead to regressors that are correlated
with the error term. Let us write the model in the form
y = x1 1 + x + z +
(14.9)
202
where z is a variable that is not measured (or measurable). If z is correlated with the x variables,
then we have the standard case of omitted variable bias, with
b = +
lim
where is the coecient on x in the (population) projection of z on x1 x .

Two possible instrumentation strategies present themselves:
1. If we can find instruments w that are correlated with any x variable where 6= 0 but
that are not themselves correlated with z (or ), we could lump the term z in with the
error and estimate the model
y = x1 1 + x + u
by instrumental variables, using w as instruments for x .
2. Assume that we have an indicator variable for z, say z1 where
z1 = 1 + 2 z + 1
(14.10)
and we assume that 2 6= 0, ( 1 ) = 0, ( 1 ) = 0. We can rewrite this as

z=
1
1
1
+
z1
1
2
2
2
Substituting this into the main regression we get the model

y = 0 + x1 1 + x + z1 1 + u
where z1 is now correlated with u because u =
indicator variable for z say z2 , where
2 1 +.
(14.11)
Assume that we have a second
z2 = 1 + 2 z + 2
and 2 6= 0, ( 2 ) = 0, ( 2 ) = 0 and ( 1 2 ) = 0. With these assumptions
z2 is a valid instrument for z1 and the variables x are valid instruments for themselves.in
the modified regression 14.11. The parameters of that regression can therefore be estimated
consistently.
As Wooldridge (2002, pp.6367) discusses, under certain conditions the omitted variable
problem can also be addressed by means of proxy variables, using OLS. The key dierence
between a proxy variable and an indicator variable is that we assume that we can write
z =1 + z1 2 +
(compare to regression 14.10) where we now assume that (1 ) = 0 and ( ) = 0. With
these assumptions the main regression can be written in the form 14.11, but with u = +.
With the assumptions that we made this regression can be consistently estimated by OLS.
14.7
Measurement error
The indicator variable model given in equation 14.10 can be thought of as an error in variables
model if 1 = 0 and 2 = 1. If ( 1 |) = 0, then we have the case of classical measurement
error.
14.7. MEASUREMENT ERROR
14.7.1
203
Attenuation bias
Consider the model

y = 1 + z 2 +
where z is the correctly measured variable, but all that we have available is the mismeasured
variable z1
z1 = z + 1
The estimated equation will therefore be given by
y = 1 + z1 2 + ( 1 2 )
The OLS estimator under these circumstances is given by
P
1
b2 = P 2
1
where both 1 and are written as deviations from their respective means. Consequently
P
1
1 (1 2 + 1 2 )
P 2
lim
b2 = lim
1
1
P
P
1
1 1
lim 1
2 lim 1
P 2
P 2
= 2 +
1
1
lim 1
lim 1
( 1 )
= 2 + 0 2
(1 )
(1 )
= 2 1
(1 )
()
(14.12)
= 2
() + ( 1 )
This formula suggests that in the case of classical measurement error the OLS coecient estimate
will be attenuated, i.e. biased towards zero.
We can invoke the Frisch-Waugh-Lovell theorem to work out what would happen to the bias
if we add some correctly measured covariates, i.e. if our model is now given by equation 14.9. In
this case the OLS coecient
b2 is identical to the coecient in the regression
e = e2 2 + u
where e and e2 are the residuals obtained by regressing y and z1 respectively on the covariates.
If the covariates are uncorrelated with the measurement error, then the regression error u is
unaected. It is easy to show now that
( )
lim
b2 = 2
( ) + ( 1 )
where is the residual that we would obtain if we projected the correctly measured variable z on
the covariates. Since ( ) (), it is easy to see that the addition of correctly measured
covariates can increase the problem of attenuation. The more collinear the other explanatory
variables are with z, the worse the problem is likely to be.
204
14.7.2
Errors in variables estimator
The attenuation bias formula 14.12 can be used to correct the OLS estimates, provided that
we have a consistent estimator either of () or ( 1 ). This may be available from other
sources. For instance if we have access to administrative records, we may know precisely what
() is in the population, even though we do not have measured accurately in our sample.
Some times detailed validation studies on subsamples can provide estimates of the variance of the
measurement error, i.e. (1 ). In these circumstances the errors in variables estimator is
given by
(1 )
b =
b
()
14.7.3
Instrumental variables solution
If we have another indicator variable for z we can use the second indicator as an instrument for
z1 , provided that the error component in z2 is uncorrelated with the measurement error in z1 .
14.8
Exercises
b and
b in the artificial regression 14.8 are identical to
1. Show that the OLS coecients
1
2
the IV coecients
b = (X0 Pw X)1 X0 PW y
where X = X1 Z2 .
Hint: Show that the IV coecients are identical to the coecients in the OLS regression
y = X1 1 + PW Z2 2 +
b in the artificial regression
b and
Show that these in turn give identical coecients
1
2
y = X1 1 + PW Z2 2 + MW Z2 +
b in the artificial regression

b and
and that these in turn give identical coecients for
1
2
14.8. Can you explain why in that regression should be zero?
b
2. Show that the IV residuals e = y X
have a sample mean of zero, provided that the
intercept features in the list of instruments. Show that this implies that the usual (centred)
2 can be used in the overidentification test.
3. Acemoglu, Johnson and Robinson (2001) have suggested that malaria deaths in the seventeenand eighteen-hundreds, i.e. at the beginning of the process of colonisation, might provide a
useful instrument for the quality of governance institutions in a cross-sectional regression.
(a) Sketch out the argument for why malaria deaths may be a good instrument. (Read
the article!)
(b) What might be some of the problems with this instrument?
4. You are given the regression model
= 1 + 2 +
14.8. EXERCISES
205
where y is the vector of the log of wages and x is the vector of the (true) level of schooling. We assume that this model obeys the standard assumptions of the Classical Linear
Regression Model. Unfortunately schooling is measured badly in your data set. Indeed you
have reason to believe that measured schooling x is given by
x = x + u
where ( ) = 0 and ( ) = 0. On your data set you observe that (x) = 96.
You also have a study available which suggests that (u) = 15. On top of this you have
data available for a subset of your observations on the schooling of a sibling. This variable
z is also badly measured, i.e.
z = z + v
where (z v) = 0 and ( v) = 0.
(a) Derive an expression for the asymptotic value of the OLS estimator.
(b) What would be the appropriate estimator of 2 ?
(c) Under what circumstances could you use z as an instrument for x? Explain.
5. You are given the following model:
ln
= 1 + 2 + 3 + 1
= 1 + 2 + 2
where is the wage of individual , is schooling, is experience and is the schooling

of the parents of individual .
You estimate the relationships empirically. The Stata output is as follows:
. reg logpay highed exper _I*

Source |
SS
df
MS
-------------+-----------------------------Model | 244.761626
7 34.9659466
Residual | 476.807323
808 .590108072
-------------+-----------------------------Total | 721.568948
815 .885360673
Number of obs
F( 7,
808)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
59.25
0.0000
0.3392
0.3335
.76818
-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------highed |
.1333759
.0091304
14.61
0.000
.1154538
.151298
exper |
.0312511
.0051414
6.08
0.000
.021159
.0413433
_Imetro_2 |
.1612915
.0754299
2.14
0.033
.0132297
.3093532
_Imetro_3 |
.4565405
.0724531
6.30
0.000
.314322
.598759
_Irace_2 |
.0398263
.0883555
0.45
0.652
-.1336071
.2132597
_Irace_3 |
.2365166
.1069858
2.21
0.027
.0265137
.4465195
_Irace_4 |
.477008
.1072434
4.45
0.000
.2664995
.6875166
_cons |
4.850398
.1244376
38.98
0.000
4.606139
5.094657
------------------------------------------------------------------------------
206
. reg highed parent_ed exper _I* if logpay~=.

Source |
SS
df
MS
-------------+-----------------------------Model |
4605.8463
7 657.978043
Residual | 6633.29953
808 8.20952912
-------------+-----------------------------Total | 11239.1458
815
13.790363
Number of obs
F( 7,
808)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
80.15
0.0000
0.4098
0.4047
2.8652
-----------------------------------------------------------------------------highed |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------parent_ed |
.2288393
.0310682
7.37
0.000
.1678555
.2898232
exper | -.2888812
.0162631
-17.76
0.000
-.320804
-.2569584
_Imetro_2 |
.551927
.2826728
1.95
0.051
-.0029326
1.106787
_Imetro_3 |
.4677897
.2761861
1.69
0.091
-.0743371
1.009917
_Irace_2 | -.6611566
.331181
-2.00
0.046
-1.311233
-.0110801
_Irace_3 |
.1372697
.4034849
0.34
0.734
-.6547326
.9292719
_Irace_4 | -.1738147
.4272961
-0.41
0.684
-1.012556
.6649266
_cons |
10.73762
.2738222
39.21
0.000
10.20014
11.27511
-----------------------------------------------------------------------------. predict u_ed, res
(1507 missing values generated)
. reg logpay highed exper _I* u_ed
Source |
SS
df
MS
-------------+-----------------------------Model | 257.400261
8 32.1750326
Residual | 464.168688
807 .575178052
-------------+-----------------------------Total | 721.568948
815 .885360673
Number of obs
F( 8,
807)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
55.94
0.0000
0.3567
0.3503
.7584
-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------highed |
.2964423
.0359358
8.25
0.000
.2259036
.3669809
exper |
.0816785
.0118951
6.87
0.000
.0583296
.1050274
_Imetro_2 |
.0236001
.0800533
0.29
0.768
-.1335372
.1807374
_Imetro_3 |
.3015268
.0788048
3.83
0.000
.1468403
.4562134
_Irace_2 |
.1050021
.0883318
1.19
0.235
-.068385
.2783893
_Irace_3 |
.1383059
.1076816
1.28
0.199
-.0730632
.3496751
_Irace_4 |
.3206584
.1110075
2.89
0.004
.102761
.5385558
u_ed | -.1740156
.0371227
-4.69
0.000
-.246884
-.1011472
_cons |
2.923025
.4291273
6.81
0.000
2.080688
3.765362
------------------------------------------------------------------------------
14.8. EXERCISES
207
The variables are defined as follows:

logpay: log of income received
highed: years of education attained

exper: length of experience
The regression is estimated over individuals where the parents education could be determined.
(a) Given the regression output, what would be the estimate of the returns to education
if you were to estimate the first equation by instrumental variables, using parents
education as an instrument for own education?
(b) Perform a Hausman test for the dierence between the OLS and IV estimates. How
might you explain the results?
(c) Do the results suggest that you might have the problem of weak instruments?
(d) Interpret both the OLS and the IV estimates of the first equation.
(e) Discuss the empirical results in relation to the following two possible reasons for the
use of instrumental variables:
omitted variable bias in the main regression
measurement error in the schooling variable .
(f) What assumptions would you need to make for the OLS estimates to be valid? And
what assumptions are required in order for the IV estimates to be valid? Do you think
that any of these assumptions hold in this case?
208
Chapter 15
Estimation by Generalised
Method of Moments (GMM)
This chapter introduces a powerful generalisation of the Methods of Moments considered earlier,
applicable if we have more equations than unknown parameters. The techniques introduced in
this chapter can be applied to individual equations as well as to systems of equations. In this
chapter we will outline the basis of the approach and show some applications. In the next chapter
we will apply it specifically to the analysis of systems of equations.
15.1
The moments of a Pareto distribution
In earlier work we considered estimating the parameters of a Pareto distribution by the method
of moments. We know that the pdf is given by (|) =

+1 and
= 0 and by definition the MoM

provided that 1. We can rewrite this as 1
estimator will satisfy
!
b
1X

=0
(15.1)
1
() =
We noted in passing that there are potentially other moment conditions that we might have used.
In this context, for example, we know also that

2
2 =
2
provided that 2. We could therefore potentially get a dierent MoM estimator defined by
!
b
1X
2 2
2

=0
b
2 2
209
(15.2)
210 CHAPTER 15. ESTIMATION BY GENERALISED METHOD OF MOMENTS (GMM)

Except by pure coincidence these two estimators will produce dierent estimates on the same
data set. Indeed if we write the two equations as one system
!
b
1X
= 0
(15.3a)
1
!
b
1X
2
2
= 0
(15.3b)

b
2
there will, in general, be no solution for b
Instead of trying to do the impossible and solve both equations 15.3a and 15.3b we can think
of any candidate solution b
for these moment equations as defining an error vector:
!
b
1X
= 1
!
b
1X
2
2

= 2
b
b
We can now reframe
0 the problem as one of picking an estimate so as to minimise the error
vector g = 1 2 . One natural way of doing this is to minimise the quadratic loss

b
= g0 g
= 12 + 22
After some thought, it is clear that this is unlikely to be the best way of combining the information in the two conditions, since it assumes that the information in the two conditions is of a
comparable quality. Instead we should use a weighted quadratic loss function
( W) = g0 Wg
(15.5)
The GMM estimator b

(W) is therefore a function of the particular weighting matrix W
selected.
b
(W) = arg min ( W)

(15.6)
The first order condition for a solution to this is given by
0
g
Wg = 0
(15.7)
0

g
Remark 15.1 The matrix
is the same matrix of derivatives encountered when we consid0
ered the delta method. We can check that the formula
gives the right results in the simple 2 2
0
11 12
case where g = 1 2
(where we have stipulated that 21 = 12 to
and W =
12 22
ensure symmetry). In this case
= 12 11 + 21 2 12 + 22 22
1
2
2
1
= 21
11 + 2
2 12 + 21
12 + 22
1 2 11 12
1
= 2
12 22
2
0
g
Wg
= 2
0
15.1. THE MOMENTS OF A PARETO DISTRIBUTION
211
Obviously the constant 2 does not aect the solution.

Example 15.2
and = 5000
6481.484
5455.215
5251.304
5960.138
We have
Consider the following sample extracted from a Pareto distribution with = 3

15734.03
10680.74
5734.346
11212.84
b
2
6850.222
11302.23
5283.836
5472.997
6202.225
8839.387
7536.145
5029.76
7809.99
5538.132
6486.269
5302.856
7408207
=
= 3 076 2
7408207 5000
22
2 62410417
=
= 3 336 5
62410417 25000000
2 2
=
=
So the two estimators come up with dierent estimates.

2
= 9 049 3 106 .
If we set b
= 30762, then 1 = 0 and 2 = 62410417 307625000
307622
55000
= 268 25, while 2 = 0. We
Alternatively if b
= 3 336 5, then 1 = 7408207 33336
336 51
observe that the two cases are not symmetrical: the error 2 is much larger when 1 = 0 than
the other way around. The reason for this is that in the case of the second moment equation any
deviations are eectively squared, so magnifying
the
impact.
1 0
With the weighting matrix W = I2 =
,
0 1
( W) = 12 + 22
1 X
=
2
2 X
2
1 X 2
2

+ 2
2
!
!
2
2 X 2
22

+ 2
1
2
( 1)2
( 2)2
Consequently the first order equations for b

(I2 ) can be written as
X
b
1
2 +
b
1
b
2
2
b
2
2 = 0
b
2
This does not have a neat closed form solution, but a numerical solution for the data given above
is
b
(I2 ) = 3 336 2
We observe that this is very close to b

2 , which we would expect to be less ecient than
b
, since second moments can be estimated less precisely than first moments. The reason why
b
(I2 ) is so heavily influenced by the second moment condition is obviously related to the point
made earlier: that deviations in the second equation are penalised more heavily. Since we are
weighting the two equations equally (W = I2 ), we are placing much more emphasis on meeting
the second moment condition. This is obviously not ideal.
15.2
Definition and properties of the GMM estimator
More generally, suppose that there is a DSP identified by the dimensional parameter vector
and that there is an dimensional vector-valued function h such that
0 [h (w 0 )] = 0
(15.8)
where 0 is the true value of the parameter vector, w contains all the data relevant to the
observation (e.g. a dependent variable y, explanatory variables x and instruments z), and .
Let us further assume that
h (w 1 ) = h (w 2 ) if, and only if, 1 = 2
Then the GMM estimator b
(W) is given by
(15.9)
b
(W) = arg min g0 Wg, where
g
1X
h (w )

The first order condition is given by
where the matrix
g
0 |
g
|
0

b =0
Wg
(15.10)
is of rank and the matrix W is of rank .
g
is a matrix with full rank, i.e. it has an inverse. In
Remark 15.3 If = , then
0 |
that case the first order conditions simplify to

b =0
g
regardless of which full-rank weighting matrix W is used. The standard method of moments
estimators are examples of this case.
15.2.1
Assumptions
Besides the assumptions given in equations 15.8 and 15.9 (an identification assumption), we
need to make a number of additional assumptions. Let us write the objective function for a
sample of size as
( W ) = g0 W g
(15.11)
1. We will assume that the matrix G0 exists, where
g
|
G0 = lim
0 0
and G0 is finite with rank
2. We assume W W0 where W0 is finite symmetric positive definite.

Note that this includes the case where each W is identical (e.g. I )
(15.12)
15.2. DEFINITION AND PROPERTIES OF THE GMM ESTIMATOR

3. We assume that
213
g (0 ) (0 S0 )
(15.13)
1X
h h0 |0

(15.14)
where
S0 = lim
Here we have assumed independent draws between observations and h is abbreviated for
h (w ). Cameron and Trivedi (2005, p.174) provide the appropriate formula if observations are not independent. Note that for the case of independent observations drawn from
the same distribution this takes the easy to remember form
S0 = hh0
(15.15)
15.2.2
Consistency
We will sketch out why this estimator is likely to be consistent. (here we follow Cameron and
Trivedi (2005, pp.182-183)). Note that
| =2
0 0
g
|
0 0
W g ( 0 )
Now g
|
G0 , W W0 and g (0 ) 0, hence
0
0 |0 0. This means that the

0
parameter vector 0 asymptotically satisfies the first order conditions for a minimum. Note
further that because G0 and W0 are of full rank, this is not trivially the case. It is necessary
that g ( 0 ) 0. By the identification condition (equation 15.9) there is no other parameter

vector 1 for which [h (w 1 )] = 0.
15.2.3
Asymptotic normality
Again we follow Cameron and Trivedi (2005,

pp.182-183). In order to establish asymptotic
b around 0 , i.e.
normality we take a Taylor expansion of g
b 0
b = g ( 0 ) + g |
g
b
where lies between
0 and . Substituting this into the first order conditions (equation 15.10)
and multiplying by we get
0
h
i
g
b
= 0
|
W
g
0

g
g
b
|
W
g ( 0 ) +
|
0
= 0
0
0
b
We can solve this for
0 , i.e.
b
0 =
"
g
|
0
g
|
0
#1 "
g
|
0
W g ( 0 )
b
By our assumptions g ( 0 ) (0 S0 ) and since
0 , we must have 0 . Conse1
quently the first square bracket has probability limit [G00 W0 G0 ] and
1
b 0

[G00 W0 G0 ] G00 W0 (0 S0 )
i.e.
1
1
b 0

0 [G00 W0 G0 ] G00 W0 S0 W0 G0 [G00 W0 G0 ]
(15.16)
Remark 15.4 Note that if = , so that G0 and W0 are both square this simplifies. In partic1
1
0 1
ular [G00 W0 G0 ] = G1
, so that
0 W0 (G0 )
0 1
b 0

0 G1
0 S0 (G0 )
This will be the asymptotic distribution of the method of moments estimators.
15.2.4
Estimating the covariance matrices
We can do inference on these estimators, provided that we can obtain estimates of G0 , W0 and
S0 . For W0 we simply use the sample weighting matrix W . For G0 we use
g
b
G=
|
(15.17)
0
and for S0
Consequently
15.3
X
b= 1
h h0 |
S

i1
h
i1
1h
b
b
b 0 W SW
b G
b G
b 0 W G
b
b
b 0 W G
G
=
V
G
(15.18)
(15.19)
Optimal GMM and Estimated Optimal GMM
We return to the point made earlier that the GMM estimator is a function of the weighting
matrix W. Considering the variance of the GMM estimator, we can show that the variance is
minimised if we pick
W S1
(15.20)
0
In that case the asymptotic distribution of this optimal GMM estimator (OGMM) simplifies to
b
(15.21)

0 0 G00 S1
0 G0
In practice knowledge of S0 will require knowledge of , which we are trying to estimate. Consequently this estimator is generally not feasible. Instead we can use a two-step procedure to
get estimates of S0 . In the first step use any GMM estimator (e.g. with W = I ). By the
b
argument above this should lead to a consistent set of parameter estimates
. Use these
b (by equation 15.18). Then set
estimates to estimate S
b 1
W = S
15.4. LESSONS FROM THE PARETO DISTRIBUTION

and reestimate, i.e.
215
b
b 1 g
= arg min g0 S
This estimated optimal GMM estimator has the same asymptotic distribution given in equation 15.21. For purposes of inference we estimate the covariance matrix as
1
1
1
0b
b
b
b
b
b
GS G
V =
b
b and S
b are estimated at b
where G
.
Note that the relationship between EOGMM and OGMM is similar to that between GLS and
FGLS. As in the case of FGLS we use a consistent (but inecient) estimator to get estimates of
the covariance matrix and then use that estimated covariance matrix to approximate the ecient
estimator.
15.4
Lessons from the Pareto distribution
Let us return to the case of the Pareto distribution considered earlier and see these estimators
in action. Figures 15.1 and 15.2 show our results. The summary statistics for this Monte Carlo
simulation are given in the following table:
n=2000
n=100000
Mean
s.d.
Mean
s.d.
b
3.004076 .0756765 3.000169 .0107534
b
2
3.036968 .1456797 3.004737 .0519052
b
3.036968 .1456797 3.004737 .0519052
(I2 )
b
3.045679 .0958747 2.998078 .0099747

b
3.003296 .072418
3.000135 .0105005
3.003228 .066415
3.000174 .0092302
replications 2000
1294
Several features deserve comment:
1. All the estimators look approximately unbiased to be more precise a 95% confidence
interval for will be given by the Monte Carlo sample average (e.g. 3.004076)
twice the
standard error. The standard error will be the standard deviation divided by
(e.g. 3.0040762 0756765

, i.e. 3 000 691 6 to 3 007 460 4, which shows some bias, but not
2000
much).
2. All the estimators appear consistent the standard deviation of their sampling distributions
decreases with sample size
3. The b
(I2 ) estimator yields estimates that are indistinguishable from b
2 . The reason
for this has already been noted.
4. While the standard deviation of b

(I2 )
and b
2 decreases with sample size it appears
to do so at a rate considerably slower than . Given a fifty-fold increase in sample size
we
would have expected the standard deviation to be approximately one seventh (roughly 50)
in the larger sample. This is, however,
definitely not the case. The other
estimators, by
contrast all appear to be roughly consistent. The reason why b

2 is not -consistent,

2
, 22 = , since it involves the fourth
is that although [2 ] = 0 where 2 = 2 2
Estimators of parameter of Pareto Distribution
2.5
3
theta
MOM1
GMM
OGMM
MOM2
EOGMM
MLE
n=2000, Replications=2 000, true theta=3

MOM1: first moment, MOM2: second moment
GMM: weighted with identity matrix
Figure 15.1: The performance of dierent GMM estimators with a small sample and problematic
distribution
3.5
15.4. LESSONS FROM THE PARETO DISTRIBUTION
217
10
20
30
40
Estimators of parameter of Pareto Distribution
2.4
2.6
2.8
theta
MOM1
GMM
OGMM
3
MOM2
EOGMM
MLE
n=100000, Replications=2 000, true theta=3

MOM1: first moment, MOM2: second moment
GMM: weighted with identity matrix
Figure 15.2: GMM estimators with a bigger sample size
3.2

moments of a Pareto distribution with = 3. Consequently the standard central limit
theorems do not apply. In eect the optimal weighting of the second moment condition
2 is at zero!
5. Comparing figures 15.1 and 15.2 it is clear that the optimality of EOGMM is, indeed, a
b can be reliably estimated.
large sample result. The sample has to be large enough that S
6. In all cases maximum likelihood outperforms these GMM estimators. The optimality of
GMM therefore has to be understood within the context of the given moment conditions.
15.5
GMM estimator of the linear model with exogenous

regressors
A GMM approach to the linear model would start with the underlying moment condition, which
in this case can be written
[x0 ( x )] = 0
where x is a row vector from the X matrix, the vector of unknown parameters and = x +
(by assumption). In this case therefore
h
g
= x0 ( x )
1X 0
x ( x )
=

=
The matrix
g
0
1 0
X (y X)
is given by 1 X0 X, so the first order conditions for the GMM estimator become
1 0
1 0
b
=0
XX W
X y X
As we noted above, in the case where we have equations in unknowns (as here), this simplifies
to the simple method of moments condition
1 0
b
= 0
X y X
1
b
= (X0 X) X0 y
This is the solution regardless of the weighting matrix W.

b
We know that the asymptotic distribution of
in this case is given by
0 1
b

0 G1
0
0 S0 (G0 )
i.e.
1 1
0 1
b
S
(G
)
G
0
0
0
1 0
In this case G0 is the probability
limit of X X which we would approximate with
P
Furthermore S0 = lim 1 h h0 | which will be approximated by
i
0
1 Xh 0
1X
0
b
b
h h | =
x x
x x

X
b = 1
S
x0 2 x

1 0
X X.
15.6. GMM ESTIMATOR OF THE LINEAR MODEL WITH ENDOGENOUS REGRESSORS219

b
Consequently the covariance matrix of
simplifies to
1 X
1
1 0
1 0
1
2 0
x x
XX
XX
X
1
1
= (X0 X)
2 x0 x (X0 X)
b
b
V
=
This, of course, is identical to the formula for the robust covariance matrix of the OLS estimator
given in equation 12.3.
Note that the GLS estimator cannot be derived from the moment condition above. We need
a dierent set of moment conditions, those emanating from the tansformed model (i.e. equation
11.5). The moment condition is
h
i
0
x ( x ) = 0
This again leads to equations in unknowns
1 0 1
X (y X) = 0
and hence a definite solution.

The point is that the same linear model
y = X +
leads to two dierent estimators, viz. OLS and GLS, only if the moment conditions are written
in dierent ways. In general, the eciency of the GMM estimator will therefore depend not only
on the choice of the weighting matrix W, but crucially also on how the moment conditions are
written.
15.6
GMM estimator of the linear model with endogenous

regressors
Consider now the case where we have an endogenous X matrix and assume that we have a
matrix of valid instruments Z. The moment conditions in this case can be written as
[z0 ( x )] = 0
where z is the -th row of Z. In this case therefore
h
g
= z0 ( x )
1X 0
z ( x )
=

=
The matrix
g
0
is given by
1 0
Z X,
1 0
Z (y X)
so the first order conditions for the GMM estimator become
1 0
1 0
b
=0
ZX W
Z y X

This leads to the equation
b
X0 ZWZ0 y = X0 ZWZ0 X
This has a solution provided that X0 Z has rank and W has rank . The solution is
0
1 0
0
b
X ZWZ0 y
(W) = X ZWZ X
We could set W = I and use this as the initial estimate for the EOGMM estimator. In this
case, however, we can do somewhat better.
Consider the case of independent draws from the same distribution. In this case S0 = hh0 .
We have
S0
= {[z0 ( x)] [( x) z]}

= [z0 z]
= 2 [z0 z]
since the errors will be homoscedastic in this case. The sample estimate of [z0 z] will be
So our OGMM estimator in this case will be given by setting
W
This means that
1 0
Z Z.
1
2 0
ZZ
h
i1
1 0
1 0
0
0
b
ZX
X0 Z (Z0 Z) Z y
= X Z (Z Z)
This, however, is the 2SLS estimator! This shows that the 2SLS estimator is equivalent to the
optimal GMM estimator provided that the assumption of homoscedasticity and zero autocorrelation holds. If the errors are independent but heteroscedastic, then we need to estimate S. In
this case
X
b = 1
S
h h0 |

i
0
1 Xh 0
b
b z
=
z x
x

b Any GMM estimator would
To implement this we obviously need an initial GMM estimate of .
do, but it is customary to use the 2SLS estimator for this first stage. This means that
X
b= 1
S
2 z0 z

where is the -th residual calculated with the IV estimator.

The EOGMM estimator of the instrumental variables regression is therefore
h
i1
0 b 1 0
b
b 1 Z0 y
ZX
X0 ZS
= X ZS
Part IV
Systems of Equations
221
Chapter 16
Estimation of equations by OLS

and GLS
16.1
Introduction
In this chapter we will begin our analysis of situations in which we estimate multiple relationships
simultaneously. The typical model can be written in the form
1
2
= x1 1 + 1
= x2 2 + 2
= x +
We assume that there are equations and we have observations (subscripted ) for each.
The variables appearing on the right hand side can be the same, but need not be. In general
we assume that the row vector x has dimension . We typically will need to assume that
( ) 6= ( ) if 6= . Furthermore there may be cross-equation correlation in the errors.
Example 16.1 A good example is if we estimate a system of demand equations. In this case
the demands are the dependent variables and the explanatory variables are own prices, cross
prices, income and a set of variables that are likely to shift tastes (e.g. age) or that might impact
on the eciency with which resources can be spent (e.g. household size). In this case it would
stand to reason that whatever was left out of the regressions (and thus goes into the error term)
would be correlated across equations.
16.2
Stacking the equations
Zellner had the fundamental insight that we could simply stack these equations to produce
one mega-regression. There are two ways of stacking (by equation or by individual) and it is
worthwhile noting that dierent authors adopt dierent approaches. The standard approach is
to stack by equation. Wooldridge makes the cogent case that for asymptotic analysis it makes
more sense to stack it by individual. If we want to analyse what happens as , we just
223
224
CHAPTER 16. ESTIMATION OF EQUATIONS BY OLS AND GLS
need to add blocks of rows to each data matrix. Let us define
x1 0
1
0
1
0 x2
2
0
2
, X = .
, = .
y =
..
..
..

..
..
.
.
.
1
0
0 x
, =
1
2
..
.
1
(16.1)
The subscript notation is supposed to remind us that we are stacking data on the same individuals, i.e. we are stacking on the first index. Stacking the data vectors and matrices we
get
y1
X1
1
y2
X2
2
y =
, X =
, =
(16.2)

y 1
X
1
P
where we have let = .
We can now rewrite the equations as one big equation (the dimension of each matrix is
given for clarity):
y1 = X 1 +1
(16.3)
In this form it looks just like a single-equation regression model. The key dierence is that it is
bound to be heteroscedastic and autocorrelated, for reasons that we alluded to earlier.
16.3
Assumptions
We will make the following assumptions about the stacked model:

1. y is univariate, continuous and of unlimited range, i.e. each is continuous and of
unlimited range
2. (X ) is linear in X, and additive in . X has full column rank.
3. X is fixed or stochastic
4. is fixed
5. (X0 ) = 0. This implies that (X0 ) = 0.
6. var ( |X ) = and 0 |X = 0 if 6= . This implies that
0 0
0 0
(0 |X) = = . . .
. . ...
.. ..
0
The last assumption will automatically hold if observations are drawn independently of each
other from the same distribution. We will make the additional assumption
(X0 X ) = Q
exists and has rank . This assumption is also easy to justify if we are drawing X from the
same distribution for each observation .
16.4. ESTIMATION BY OLS
16.4
225
Estimation by OLS
It should be easy to see that with the given assumptions we have a regression model that is
heteroscedastic and autocorrelated, unless = 2 . Nevertheless OLS estimation of this system
should provide consistent estimates.
We can develop a method of moments logic for OLS estimation of the system from the moment
condition
(X0 ) = 0
X (y X )
= 0
0
X y
= (X0 X )
We observe that the last equation identifies provided that (X0 X ) has an inverse, which it
does by the additional assumption that we made above.
Our sample analogue of the moment equations will be given by
1X 0
X y
=1
b
1X 0
b
X X
=1

!1
!
1X 0
1X 0
=
X X
X y
=1
=1
=
(16.4)
But this is identical to the stacked OLS equation
1
0
b
X0 y
= (X X)
Looking at equation 16.4 it is easy to see that this should lead to a consistent estimator of
. As ,
expect (by normal weak law of numbers arguments) that
i.e. we0 would
1 P

P
0
1
0
0
=1 X y X y and
=1 X X (X X ).
Observe that (as in the case of standard OLS), we can write

which in this case can be written as
b = + (X0 X)1 X0
!1
!
1X 0
1X 0
= +
X X
X
=1
=1
!1
!

1 X 0
1X 0
b

=
X X
X
=1
=1
b
Given our assumptions it stands

to reason that
under fairly broad conditions we could apply a

P
0
1
Central Limit Theorem to

=1 X , so that the OLS estimators turn out to normal.
Of course the covariance matrix in this case will be

b = (X0 X)1 X0 X (X0 X)1
var
We can consistently estimate this (as before) by

!

X 0
1
1
0
0
b
d
X e e X (X0 X)
v
ar = (X X)
=1
226
16.5
Estimation by GLS
It should be evident that if we knew the structure of the matrix we would be able to estimate
the relationship more eciently. The logic (as in the case of single equations) is that we can
transform the data to make the model conform to the assumptions of the classical linear regression
model. If we have
y = X +
with var ( |X ) = , then we can transform the data so that
1
2 y = 2 X + 2
1
and, of course, var 2 |X = I . This leads to the System GLS estimator

b
where
!1
!
1 X 0 1
1 X 0 1
=
X X
X y
=1
=1
1 0 1
= X0 1 X
X y
1 =
16.5.1
1
0
..
.
0
1
..
.
..
.
0
0
..
.
Notation
It becomes quite tedious to write out these stacked matrices the long way. A convenient
notation for these cases is provided by the Kronecker product which is discussed further in
the appendix to this chapter. By definition the Kronecker product of two matrices A and
B is given by:
11 B 12 B 1 B
21 B 22 B 2 B
A B = .
..
..
..
..
.
.
.
1 B 2 B
In this case we could have just written
1 = I 1
The mathematical properties of the Kronecker product allow one to derive many useful results
about these stacked matrices quickly and easily.
16.5.2
Some caution
In order for the GLS estimator to be consistent we need a stronger condition than (X0 ) = 0,
since we now need the transformed X variables to be orthogonal to the transformed errors. The
1
problem is that (in general) the transformed variables 2 X will be some linear combination
of the explanatory variables from dierent equations (for the same individual). It is therefore
now necessary that the explanatory variables x from any equation be uncorrelated with the
16.6. ESTIMATION BY FGLS
227
error term even when 6= . If the error terms are uncorrelated with each other (i.e. the
system is purely homoscedastic) then this doesnt apply. But to show consistency in general we
now need the condition
(X ) = 0
which is stronger.
16.6
Estimation by FGLS
In general we will not know . In this case, however,

that
(1 )
(1 2 )
(1 2 )
(2 )
=
..
..
.
.
(1 ) (2 )
it may be possible to estimate it. Note
(1 )
(2 )
..
..
.
.
( )
If it is reasonable to assume that the variance of the error term on the first equation, i.e. (1 )
is homoscedastic, so
(1 ) = 21 , for all
then we have equations (i.e. individuals) over which we can estimate this variance. Similarly
we will have equations over which to estimate any of the other variances and the covariances.
Consistent estimators of these are
1X 2
2
b1 =
=1 1
where 1 is the residual from the Systems OLS estimator. Note that we are exploiting the fact
here that Systems OLS is consistent, so the OLS residuals are consistent for the true errors.
Similarly
1X
b12 =
c (1 2 ) =
1 2
=1
We can summarise this as:
X
b= 1
e e0
=1
where e is the stacked vector of OLS residuals for the -th individual.
Our Systems FGLS estimator is therefore:
b
16.7
Exercises
!1
!
1 X 0 b 1
1 X 0 b 1
=
X X
X y
=1
=1
1
b 1 X
b 1 y
X0
=
X0
1. You are given the following theoretical model

1
2
= 1 + 2 1 + 1
= 3 2 + 2
228
where (1 2 ) 0 0 21 22 . (Note: there is no intercept in the second equation!)

You are told that 21 = 4, 22 = 1 and = 12 . Assume that (01 1 ) = 0 and (02 2 ) =
0. Note that
1
21
1 2
1 2
=
=
2
1 2
22
You have the following empirical information on this model:
individual
1
2
3
1
1
1
2
2
2
3
5
1
0
1
2
2
2
4
4
(a) Rewrite the theoretical model in stacked matrix form, paying attention also to the
assumptions being made about the error term.
(b) Rewrite the empirical information in stacked matrix form.
(c) Let = (0 ), where is the error vector of the stacked model. Calculate 1 .
(d) Assume that you want to estimate this model by GLS. Write down the appropriate formula with the appropriate empirical information. You do not need to simplify/calculate the final solution.
16.8
229
Appendix: A worked example
You are given the following theoretical model:

1
2
= 1 + 2 1 + 1
= 3 + 4 2 + 2
where (1 2 ) 0 0 21 22 . You are told that 21 = 1, 22 = 4 and = 05. Assume that

(01 1 ) = 0 and (02 2 ) = 0.
1. You have the following empirical information on this model:
individual
1
2
3
1
1
1
2
2
2
5
7
1
0
3
4
2
2
4
4
2. Rewrite the theoretical model in stacked matrix form.

(a) Answer:
11
12
21
22
31
32
1 11
0 0
1 21
0 0
1 31
0 0
0 0
1 12
0 0
1 22
0 0
1 32
+
3
11
12
21
22
31
32
Observe that the first column in the X matrices is for the intercept in equation 1, the
second column is 1 , the third column is the intercept in equation 2 and the fourth
column is 2 .
We should also specify the assumptions about the errors in matrix form. We have
(X0 ) = 0 and
11 12 11 21 11 22 11 31 11 32
211
12 11
212
12 21 12 22 12 31 12 32
221
21 22 21 31 21 32
21 11
21 12
0
( ) =
222
22 31 22 32
22 11 22 12 22 21
31 11 31 12 31 21 31 22
231
31 32
32 11 32 12 32 21 32 22 32 31
232
21
1 2
0
0
0
0
1 2
22
0
0
0
0
0
0
0
0
1 2
1
=
2
0
0
0
0
1 2
2
0
0
0
0
21
1 2
0
0
0
0
1 2
22
1 1 0 0 0 0
1 4 0 0 0 0
0 0 1 1 0 0
0 0 1 4 0 0
0 0 0 0 1 1
0 0 0 0 1 4
230
3. Rewrite the empirical information in stacked matrix form.
Answer:
(a) We first create the stacked vectors for each individual
y1
y2
y3
1
2
1
5
2
7
, X1 =
, X2 =
, X3 =
1 0 0 0
0 0 1 2
1 4 0 0
0 0 1 4
1 3 0 0
0 0 1 4
We could skip this step and go straight on to:
y =
4. Estimate this model by sytems OLS
Answer:
1
2
1
5
2
7
, X =
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
0
2
0
4
0
4
231
(a) The formula for the OLS estimator is the same as before:
b
= (X0 X)
=
0
X0 y
0
0
1
2
1
3
0
0
0
0
1
4
1
4
0
0
1
0
1
0
1
0
0
0
3
0
4
0
1
3 7 0 0
4
11
7 25 0 0

0 0 3 10 14
0 0 10 36
52
1
3 7
022
7 25
1
3 10
022
10 36
25
7
26
0
0
26
3
7
26
0
0
26
36
10
0
0
8
8
3
0
0
10
8
8
23
26
5
26
=
2
2
0 884 62
0 192 31
'
2
2
0
1
0
1
0
1
0
2
0
4
0
4
1 0
0 0
0 1
0 2
1
3
0
0
0
0
1
4
1
4
0
0
1
2
1
5
2
7
4
11
14
52
4
11
14
52
So our fitted equations are:

b1
b2
23
5
+ 1
26 26
= 2 + 22
=
5. What are the statistical properties of the OLS estimators?

Answer
We know that the OLS estimators are unbiased and consistent, but not ecient. The
reason for this is that the covariance matrix of the errors is not 2 I6 . We also know that
the OLS estimators will be normally distributed (because the errors are normal). The usual
b will be biased and inconsistent.
OLS estimates of the covariance matrix of
6. Impose the restriction 1 = 3 and 2 = 4 . Reestimate the model by OLS with these
restrictions.
Answer:
232

(a) Let us label the columns of the X matrix 1 , x1 , 2 and x2 . The star notation is to
emphasise that these variables are for the stacked model, i.e. padded with the zeros.
We can write our structural model in vector form as
y = 1 1 + 2 x1 + 3 2 + 4 x2 +
Substituting in the restrictions, this model becomes
y
= 1 1 + 2 x1 + 1 2 + 2 x2 +
= 1 (1 + 2 ) + 2 (x1 + x2 ) +
Observe that 1 + 2 is now just a column of ones. x1 + x2 is a column with all the
x values, i.e. our transformed X matrix X is:
X =
1
1
1
1
1
1
0
2
3
4
4
4
The OLS estimator of this model is
1 1 1 1 1 1
=
0 2 3 4 4 4
=
=
=
'
6 17
17 61
61
77
17
77
27
77
72
77
17
77
6
77
0 350 65
0 935 06
18
63
18
63
1
1
1
1
1
1
0
2
3
4
4
4
1 1 1 1 1 1
0 2 3 4 4 4
1
2
1
5
2
7
7. Test the restriction by means of the appropriate Wald test.

Answer:
The important point about this question is that we know that the errors are heteroscedastic
with some autocorrelation. Consequently we need to use the correct form of the covariance
matrix for the OLS estimators, i.e.

b = (X0 X)1 X0 2 X (X0 X)1

233
We know what the structure of the 2 matrix is, so
25
7
26
0
0
26

3
7
0
0
26
26
b

=
36
10
0
0
8
8
3
0
0
10
8
8
1 1 0 0
1 0 1 0 1 0 1 4 0 0
0 0 3 0 4 0 0 0 1 1
0 1 0 1 0 1 0 0 1 4
0 2 0 4 0 4 0 0 0 0
0 0 0 0
25
7
26
0
0
26
3
7
0
0
26
26
36
10
0
0
8
8
3
0
0
10
8
8
25
0
0
3 7
26
26
3
7
7 25
0
0
26
26
=
36
0
3 7
0
10
8
8
3
10 28
0
0
10
8
8
25
7
49
99
26
104
26
52
35
3
21
7
26
26
52
104
=
35
99
18
5
52
52
49
3
21
104
5
104
2
0
0
0
0
1
1
0
0
0
0
1
4
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
25
3 10
26
7
7 28
26
12 40 0
40 144
0
0
2
0
4
0
4
7
26
0
0
0
0
36
8
10
8
3
26
The null hypothesis
0 : 1
2
= 3
= 4
can be written in matrix form as:
0 :
1 0 1 0
0 1 0 1
1

2
= 0
3
0
4
The Wald statistic for this test will be given by

1
0
b
b
RV R0
R
R
Calculating the covariance matrix RV R0 first:

RV R0
1 0 1 0
0 1 0 1
197
13
33
8
33
8
63
52
25
26
7
26
99
52
49
104
7
26
3
26
35
52
21
104
99
52
35
52
18
5
1
0
21
1
104 0
5 1 0
3
0 1
2
49
104
0
0
10
8
3
8
234

Finally
b
R
1 0 1 0
0 1 0 1
75
26
47
26
23
26
5
26
2
2
Consequently the Wald statistic is
'
75
26
47
26
197
13
33
8
2 884 6 1 807 7
= 12 34
33
8
63
52
75
26
47
26
901 55 3 069 6
3 069 6 11 277
2 884 6
1 807 7
This is distributed as 2 (2). The critical value at the 5% level is 5.991 and at the 1%
level is 9.210. We reject the null hypothesis, i.e. the two sets of regression coecients are
dierent.
8. Reestimate the original model (the one without restriction) by GLS.
Answer
(a) In order to do this we need to invert = (0 ). Since the matrix is block diagonal
we only need to invert the two-by-two matrix = ( 0 ) where
=
1 1
1 4
1 1
1 4
i.e.
1
4
3
13
13
1
3
Consequently
4
3
13
0
=
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
1
3
235
The GLS estimator is given by

b
0 1 1 0 1
X X
X y
1 0 1 0 1 0
0 0 3 0 4 0
=
0 1 0 1 0 1
0 2 0 4 0 4
=
1
0
0
0
0
0
1
2
4
1
3
0
0
0
0
1
4
28
3
100
3
73
28
3
28
3
=
1
10
3
949 69
264 15
=
1 874 2
462 26
0 936 9
0 169 85
=
2 158 4
2 047 7
4
3
13
0
0
4
3
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
0
0
4
3
13
1
3
2
3
1
3
10
3
38
3
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
13
0
0
1
0
4 0
0
1
1 10
3

73 28
3
10
1
3
10
12
3
1
4
0
0
13
0
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
1
3
0
0
0
0
13
1
3
666 67
264 15 1 874 2 462 26
333 33
113 21 660 38 198 11
660 38 17 686 4 905 7 3 333 3

12 667
198 11 4 905 7 1 471 7
1
2
1
5
2
7
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
We observe that we get slightly dierent point estimates from OLS.
9. Assume now that you dont know the exact distribution of the error terms. You do know,
however, that ( 0 ) = . What would be the most appropriate estimator in general?
Apply it in this instance.
Answer:
(a) The FGLS estimator would be better than OLS in general. Indeed the FGLS estimator
would be asymptotically as ecient as the GLS estimator. Unfortunately in this case
the sample size is tiny, so asymptotic arguments are dubious. Nevertheless we will go
through the FGLS routine to show how it would work.
The FGLS estimator begins with the OLS residuals. We have calculated the OLS
0
2
0
4
0
4
236

coecients earlier. The residuals are given by
b
e = y X

1
1
2 0

1 1

=
5 0

2 1
0
7
3
26
6
13
=
1
9
26
1
0
0
3
0
4
0
0
1
0
1
0
1
0
2
0
4
0
4
23
26
5
26
2
From these residuals we estimate the elements of the covariance matrix:
b21
=
=
=
b12
=
=
=
b22
=
=
=
P
1
3
21
3
26
P
3
26
2 2 !
6
9
+
+
13
26
1 2

1
3
6
9
0+
(1) +
1
3
26
13
26
7
26
P 2
2
1 2
0 + (1)2 + 12
3
2
3
237
Consequently
b =
b 1
b 1
3
26
7
26
7
26
2
3
3
26
7
26
7
26
2
3
150 22 60 667
60 667
260
150 22 60 667
0
0
0
0
60 667
260
0
0
0
0
0
0
150 22 60 667
0
0
0
0
60 667
260
0
0
0
0
0
0
150 22 60 667
0
0
0
0
60 667
260
Consequently
b
1
b 1 X
b 1 y
X0
X0
1
0
0
0
150 22
60 667
0
0
0
0
0
0
1
2
1
3
0
0
0
0
1
4
1 0
3 0
0 1
0 4
60 667
0
0
260
0
0
0
150 22 60 667
0
60 667
260
0
0
0
0
0
0
150 22 60 667
60
667
260
1 0
0
0
4 0
0
0
0 1
0
0
0 4
0
0
7 580 6 102
1 599 7 102
307 53
6 532 3 102
939 89
168 57
2 343 9
2 104 8
1
0
0
0
1 599 7 102
6 856 103
9 331 9 102
2 799 6 102
0
0
1
2
0
0
1
4
0
0
0
0
0
0
0
0
150 22 60 667
60 667
260
1
4
0
0
1
0
1
0
1
0
0
0
3
0
4
0
0
0
0
0
0
0
150 22 60 667
0
60 667
260
0
0
0
150 22
0
0
60 667
307 53
6 532 3 102
9 331 9 102 2 799 6 102
1 652 8
429 13
429 13
128 74
0
1
0
1
0
1
0
2
0
4
0
4
0
0
0
0
60 667
260
248 46
956 26
121 33
502 66
1
2
1
5
2
7
238
16.9
Appendix: The Kronecker product
Definition 16.2 For two matrices A and B the Kronecker product is defined as:
11 B 12 B 1 B
21 B 22 B 2 B
A B = .
..
..
..
..
.
.
.
1 B 2 B B
1
11 12 13
Example 16.3 Let A =
and B = 2 , then
21 22 23
3
11 B 12 B 13 B
AB =
21 B 22 B 23 B
11 1 12 1 13 1
11 2 12 2 13 2
11 3 12 3 13 3
=
21 1 22 1 23 1
21 2 22 2 23 2
21 3 22 3 33 3
Proposition 16.4
A B C = (A B) C = A (B C)
Proposition 16.5 If A and B are both and C and D are both matrices, then
(A + B) (C + D) = A C + A D + B C + B D
Proposition 16.6 If the products AC and BD are defined, then
(A B) (C D) = AC BD
Remark 16.7 It follows that if B is a column vector, then
(A B) C = AC B
since the product AC will then be defined and C = C I1 (the 1 1 Identity matrix). Similarly
if A is a column vector, then the product BC will be defined, so that
(A B) C = A BC
using the fact that C = I1 C.
Proposition 16.8 Assume that A and B are square nonsingular matrices, then
(A B)1 = A1 B1
Proposition 16.9
(A B) = A0 B0
Proposition 16.10 Assume that A and B are square matrices, then
|A B| = |A| |B|
Proposition 16.11
(A B) = (A) (B)
Chapter 17
System estimation by
Instrumental Variables and GMM
239
240CHAPTER 17. SYSTEM ESTIMATION BY INSTRUMENTAL VARIABLES AND GMM
Chapter 18
Simultaneous Equation Models
241
242
CHAPTER 18. SIMULTANEOUS EQUATION MODELS
Part V
Solutions
243
Solutions to Chapter 14
1. We know that the IV estimator can be written either as
or (in the 2SLS form) as
b = (X0 Pw X)1 X0 PW y
b= X
b 0X
b X
b 0y
b are the fitted values from the first stage. These two are equivalent, since
where X
b = Pw X
X
and
b 0X
b X
b 0 y = (X0 Pw Pw X)1 X0 PW y
X
= (X0 Pw X)
X0 PW y
So we could estimate by using OLS and the fitted values, i.e. writing the model as
b +
y =X
But
b = Pw X
X
= Pw X1 Z2
= Pw X1 Pw Z2
= X1 Pw Z2
The last step follows since X1 is among the instruments, so the fitted values are equal to
the values themselves. Consequently the coecients in the OLS regression
y = X1 1 + PW Z2 2 +
(1)
will be identical to the IV coecients.

We know that the residuals MW Z2 are orthogonal to all the instruments in W, i.e.
MW X1 = MW PW = 0. Consequently the variables MW Z2 are orthogonal to the variables X1 and PW Z2 . So the OLS estimates of 1 and 2 will be identical in regression 1
above and
y = X1 1 + PW Z2 2 + MW Z2 +
(2)
245
246
SOLUTIONS TO CHAPTER 14
Furthermore
PW Z2 = Z2 MW Z2
(By definition of the PW and MW matrices.). A linear transformation of equation 2 is

therefore given by
y = X1 1 + (Z2 MW Z2 ) 2 + MW Z2 +
= X1 1 + Z2 2 + MW Z2 ( 2 ) +
= X1 1 + Z2 2 + MW Z2 +
The last equation is, of course, the artificial regression, equation 14.8. Estimating this by
b and
OLS will give the same coecients (appropriately transformed) as regression 2, i.e.
1
b will be equal to the IV coecients.
2
Now observe that our structural model is
y = X1 1 + Z2 2 +
Writing
Z2 = PW Z2 + MW Z2
This model becomes
y = X1 1 + PW Z2 2 + MW Z2 2 +
If is uncorrelated with Z2 it will certainly be uncorrelated with PW Z2 and MW Z2 . Estimating this last equation by OLS will therefore give us unbiased and consistent coecients.
We would therefore expect the OLS coecient on the variables MW Z2 in regression 2 to
be 2 . This implies that
= 2
=0
b
2. Show that the IV residuals e = y X
have a sample mean of zero, provided that the
intercept features in the list of instruments. Show that this implies that the usual (centred)
2 can be used in the overidentification test.
This is, in fact, a dicult question! It is fairly easy to show that the IV residuals should
have a mean of zero asymptotically. This follows from the fact that
(W0 ) = 0
and we are therefore sure that
1
lim W0 e = 0
The first row of W is a row of ones, so the first element of the vector 1 W0 e is just
and it follows that
1
lim 0 e = 0
which just states that the sample mean of the residuals converges to zero.
1 0
e,
It is also easy to show that the sample mean of the IV residuals in the exactly identified
case will be zero. In that case we can derive the IV estimator from the sample moment
condition
b
W0 y X
=0
W0 e = 0
247
This will be a set of equations in unknowns and hence will have a unique solution.
Since the first row of W0 will be a row of ones, the first equation will be
0 e = 0
from which it follows that the IV residuals have a mean of zero.
In the general overidentified case the sample moment condition does not have a unique
solution. Indeed the equations in unknowns will give an inconsistent set of equations,
unless the row rank of the W0 matrix is and not . We cannot be sure therefore that
at the IV solution the first equation will be exactly satisfied.
In order to show this more generally we need to adopt a slightly dierent tack. We want
to show that
1 0
e=0
1 0
b
=0
y X
1
1 0
b
y = 0 X
We therefore have to show that the sample mean of the values is equal to the sample
mean of the fitted values. Assume that the X matrix is partitioned (as before) as
X = X1 Z2
and the matrix of instruments as
W=
Let
X1
W2
b + Z2
b
b = X1
1
2
b
b +Z
b + Z2 Z
b 2
b 2
= X1
1
2
2
Note that this will be dierent (in general) from
b +Z
b
b 2
b2 = X1
1
2
b and
b will be identical. Note that X1 and Z2 are row
although the coecient vectors
1
2
vectors, so this is a vector equation. Now
X
X
1X
1X
b
b +1
b +1
b 2
b 2
b =
X1
Z
Z2 Z
1
2
2
b 2 are obtained from a first stage OLS regression which contains

Note that the fitted values Z
an intercept (since there is an intercept among the instruments). Consequently the sample
b 2 values. Consequently
meanof the Z2 values will be identical to the sample means of the Z
P
1
b
b
Z2 Z2 2 = 0. It follows that
X
1X
1X
b +1
b
b 2
b =
X1
Z
1
2
248
i.e.
b +Z
b
b 2
b = X1
1
2
= b2
So although it is not the case that b will be equal to b2 , their means will be equal.
Now if there is an intercept in the second stage regression1 , then the fitted values from that
regression will be equal to the mean of the values, i.e.
This however will establish that
b2 =
= b
from which it will in turn follow that the IV residuals will have a sample mean of zero.
If the sample of the IV residuals is zero, then
e0 e = e0 e
where e is the vector of centred residuals (i.e. with their sample mean subtracted). The
0
e e
2 of the auxiliary regression is equal to e0 e where b
e is the set of fitted values from the
auxiliary regression. Since these will also have a mean of zero, they will also be equal to
the uncentred values. Consequently the uncentred and centred 2 will be equal.
3. Acemoglu et al. (2001) have suggested that malaria deaths in the seventeen- and eighteenhundreds, i.e. at the beginning of the process of colonisation, might provide a useful
instrument for the quality of governance institutions in a cross-sectional regression.
(a) Sketch out the argument for why malaria deaths may be a good instrument. (Read
the article!)
The key argument is summarised on the second page of the article (p.1370). It is
that settler mortality aected the extent of European settlement. This in turn differentiated colonies which became settler colonies from those that had largely an
extractive function. These early institutions shaped the evolution of the society
and the current institutions. Current institutions in turn aect current economic
performance.
(b) What might be some of the problems with this instrument?
The fact that malaria deaths occurred before the current institutions does not guarantee that malaria deaths may not be correlated with the error term in the regression.
Malaria (and yellow fever) deaths in an earlier century may be correlated with some
other feature of the country that might be persistent and aect growth. This is why
the authors spend so much eort at dealing with other potential channels through
which early deaths might be correlated with current growth. You may want to note
how many dierent channels they consider and the variety of evidence that they bring
to bear.
One potential channel that they do not consider is the development of trade and
endogenous industries. Foreign companies may be deterred from investing in local
capacity (other than extractive capacity) if doing so requires sending out skilled people
1 This
suggests that the intercept should be in the X1 set of variables.
249
who might die in the process. Note that the malaria in 1994 variable does not
adequately control for this since it is again the mortality of the expatriates that
is at issue. Particularly if building up domestic industry occurs incrementally over a
long period of time the low presence of expatriates may have a very similar impact to
the one that Acemoglu et al focus on except through a dierent channel.
4. You are given the regression model
= 1 + 2 +
where y is the vector of the log of wages and x is the vector of the (true) level of schooling. We assume that this model obeys the standard assumptions of the Classical Linear
Regression Model. Unfortunately schooling is measured badly in your data set. Indeed you
have reason to believe that measured schooling x is given by
x = x + u
where ( ) = 0 and ( ) = 0. On your data set you observe that (x) = 96.
You also have a study available which suggests that (u) = 15. On top of this you have
data available for a subset of your observations on the schooling of a sibling. This variable
z is also badly measured, i.e.
z = z + v
where (z v) = 0 and ( v) = 0.
(a) Derive an expression for the asymptotic value of the OLS estimator.
P
( ) ( )
P
( )2
b = ( )
lim
2
( )
( 1 + 2 ( ) + )
=
( )
( )
= 2
( )
( )
= 2 1
( )
b =
(b) What would be the appropriate estimator of 2 ?

Since we have estimates of ( ) and ( ) the most appropriate estimator would
be the errors in variables estimator, i.e.
( )
b
b
=
2
2
( ) ( )
96
b
=
2
96 15
96
b
= 2
81
This amounts to scaling up the OLS estimates by 18.52%
250
(c) Under what circumstances could you use z as an instrument for x? Explain.
We require z to be correlated with x but not correlated with the error in the regression.
The regression model is
= 1 + 2 2 +
so the regression error consists both of the measurement error and the term . We
therefore require to be correlated with but uncorrelated with and . This
requires the true variable z to be uncorrelated with either of these error terms. Since
we hypothesised that x was uncorrelated with and u this is plausible. However we
also require the measurement error v to be uncorrelated with u. This is much less
plausible.
There is an additional, more subtle point. Since we only have education on a sibling for
a subset of our observations, we must be sure that there is no correlation between the
process of having a sibling (as measured in the data set) and the error term in the main
regression. If, for instance, people that have more siblings develop important skills that
lead to higher wages, then estimating the main regression only over individuals with
siblings will lead to biased coecients for reasons not at all related to the measurement
error.
5. You are given the following model:

ln = 1 + 2 + 3 + 1
= 1 + 2 + 2
where is the wage of individual , is schooling, is experience and is the schooling
of the parents of individual .
You estimate the relationships empirically. The Stata output is as follows:
. reg logpay highed exper _I*

Source |
SS
df
MS
-------------+-----------------------------Model | 244.761626
7 34.9659466
Residual | 476.807323
808 .590108072
-------------+-----------------------------Total | 721.568948
815 .885360673
Number of obs
F( 7,
808)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
59.25
0.0000
0.3392
0.3335
.76818
-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------highed |
.1333759
.0091304
14.61
0.000
.1154538
.151298
exper |
.0312511
.0051414
6.08
0.000
.021159
.0413433
_Imetro_2 |
.1612915
.0754299
2.14
0.033
.0132297
.3093532
_Imetro_3 |
.4565405
.0724531
6.30
0.000
.314322
.598759
_Irace_2 |
.0398263
.0883555
0.45
0.652
-.1336071
.2132597
_Irace_3 |
.2365166
.1069858
2.21
0.027
.0265137
.4465195
_Irace_4 |
.477008
.1072434
4.45
0.000
.2664995
.6875166
_cons |
4.850398
.1244376
38.98
0.000
4.606139
5.094657
251
-----------------------------------------------------------------------------. reg highed parent_ed exper _I* if logpay~=.
Source |
SS
df
MS
-------------+-----------------------------Model |
4605.8463
7 657.978043
Residual | 6633.29953
808 8.20952912
-------------+-----------------------------Total | 11239.1458
815
13.790363
Number of obs
F( 7,
808)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
80.15
0.0000
0.4098
0.4047
2.8652
-----------------------------------------------------------------------------highed |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------parent_ed |
.2288393
.0310682
7.37
0.000
.1678555
.2898232
exper | -.2888812
.0162631
-17.76
0.000
-.320804
-.2569584
_Imetro_2 |
.551927
.2826728
1.95
0.051
-.0029326
1.106787
_Imetro_3 |
.4677897
.2761861
1.69
0.091
-.0743371
1.009917
_Irace_2 | -.6611566
.331181
-2.00
0.046
-1.311233
-.0110801
_Irace_3 |
.1372697
.4034849
0.34
0.734
-.6547326
.9292719
_Irace_4 | -.1738147
.4272961
-0.41
0.684
-1.012556
.6649266
_cons |
10.73762
.2738222
39.21
0.000
10.20014
11.27511
-----------------------------------------------------------------------------. predict u_ed, res
(1507 missing values generated)
. reg logpay highed exper _I* u_ed
Source |
SS
df
MS
-------------+-----------------------------Model | 257.400261
8 32.1750326
Residual | 464.168688
807 .575178052
-------------+-----------------------------Total | 721.568948
815 .885360673
Number of obs
F( 8,
807)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
816
55.94
0.0000
0.3567
0.3503
.7584
-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------------------------------highed |
.2964423
.0359358
8.25
0.000
.2259036
.3669809
exper |
.0816785
.0118951
6.87
0.000
.0583296
.1050274
_Imetro_2 |
.0236001
.0800533
0.29
0.768
-.1335372
.1807374
_Imetro_3 |
.3015268
.0788048
3.83
0.000
.1468403
.4562134
_Irace_2 |
.1050021
.0883318
1.19
0.235
-.068385
.2783893
_Irace_3 |
.1383059
.1076816
1.28
0.199
-.0730632
.3496751
_Irace_4 |
.3206584
.1110075
2.89
0.004
.102761
.5385558
u_ed | -.1740156
.0371227
-4.69
0.000
-.246884
-.1011472
_cons |
2.923025
.4291273
6.81
0.000
2.080688
3.765362
------------------------------------------------------------------------------
252
The variables are defined as follows:

logpay: log of income received
highed: years of education attained

exper: length of experience
The regression is estimated over individuals where the parents education could be determined.
(a) Given the regression output, what would be the estimate of the returns to education
if you were to estimate the first equation by instrumental variables, using parents
education as an instrument for own education?
We can retrieve the IV coecients from the auxiliary regression used to perform the
Hausman test. All those coecients are identical to the IV coecients. In this case we
need to look at the coecient on the highed variable. It looks as though the returns
to education are 02964423
(b) Perform a Hausman test for the dierence between the OLS and IV estimates. How
might you explain the results?
We test the significance of the residuals term in the auxiliary regression. We see
(from the regression output) that the p-value on u_ed is less than 0.001, i.e. we
reject the hypothesis that the OLS and the IV coecients are identical. We conclude
that the education variable and the errors in the wage regression must be correlated.
This could be due to measurement error or the omission of a common variable in the
equations determining how much schooling someone gets and how much they earn.
(c) Do the results suggest that you might have the problem of weak instruments?
We look at the first stage regression in which we regress the education variable on all
the instruments. We see that parents education is highly significant with a t-statistic
in excess of seven. This translates into an F-statistic of over 49. Consequently we do
not have the problem of weak instruments.
(d) Interpret both the OLS and the IV estimates of the first equation.
The OLS regression suggests that the returns to education are 0.1333759, i.e. each
additional year of schooling raises earnings by about 13%. The experience variable
indicates that each additional year of experience raises earnings by about 3%. We note
that race and location dummies are significant and have the structure that we would
expect: metropolitan areas pay better than urban areas which in turn pay better than
the rural areas. Whites earn more than Indians who earn more than Coloureds who
earn more than Africans.
The IV regression suggests that the returns to education are 0.2964423, i.e. each
additional
year of schooling raises earnings by about 30% (to be precise, it raises it
by 02964423 1 100 = 34 506%). Each year of experience raises earnings by

about 8%. The race and location dummies still have the same interpretation and they
exhibit the same order, except that now the eects are considerably weaker.
(e) Discuss the empirical results in relation to the following two possible reasons for the
use of instrumental variables:
omitted variable bias in the main regression
253
measurement error in the schooling variable .
The omitted variable bias formula suggests that the biased OLS results would be the
sum of the true education coecient plus the coecient on the omitted variable
(say ) multiplied by the regression coecient (say ) of that variable on education,
i.e.
01333759 = 02964423 +
For this to make any sense we therefore require a variable that is either negatively
correlated with earnings or with education. We might have suspected the omission of
an ability variable: higher ability individuals are likely to earn more (at any level of
schooling), but they are also more likely to get additional schooling. It is quite clear,
however, that these coecients cannot arise from the omission of an ability variable.
In that case our OLS results would have overestimated the true returns to schooling.
Measurement error would, of course, lead to an underestimate (as shown by the relationship between the IV and the OLS coecients). The attenuation bias formula,
however, is
()
b
lim = 1
( ) + ()
To get attenuation in excess of 50% we would need to assume that the error process is
on a par with the true signal i.e. about half of the observed variation in education
levels is spurious. This just does not seem plausible.
In short neither of these two reasons coheres very well with the empirical results.
(f) What assumptions would you need to make for the OLS estimates to be valid? And
what assumptions are required in order for the IV estimates to be valid? Do you think
that any of these assumptions hold in this case?
The OLS estimates would be valid if the regressors are independent of the error term.
In particular we would need to assume that the process that determines education is
independent of the wage received, i.e. 1 and 2 are uncorrelated.
Instrumental variables estimation is valid only under the following conditions:
i. The instrument must be correlated with the endogenous variable
ii. It must not be correlated with the error term in the primary regression
It is clear from the first stage regression that the instruments are highly significant.
It is not clear, however, whether parents education is a valid instrument. One particularly troubling factor in this case (not made explicit in the question!) is that we
have data on parents education only for individuals who are still living with their
parents. These individuals, however, are more likely to be low earners. Sample selection therefore induces a relationship between parents education and the wage. This
will contaminate the IV results!
254
Bibliography
Acemoglu, D., Johnson, S. and Robinson, J. A.: 2001, The colonial origins of comparative
development: an empirical investigation, American Economic Review 91(5), 13691401.
Angrist, J. D. and Pischke, J.-S.: 2009, Mostly Harmless Econometrics: An Empiricists Companion, Princeton University Press, Princeton, NJ.
Cameron, A. C. and Trivedi, P. K.: 2005, Microeconometrics: Methods and Applications, Cambridge University Press, New York.
Davidson, R. and MacKinnon, J. G.: 1993, Estimation and Inference in Econometrics, Oxford
University Press, New York.
Davidson, R. and MacKinnon, J. G.: 2004, Econometric Theory and Methods, Oxford University
Press, New York.
Deaton, A.: 1997, The Analysis of Household Surveys: A Microeconometric Approach to Development Policy, Johns Hopkins University Press, Baltimore.
Greene, W. H.: 2003, Econometric Analysis, 5 edn, Prentice-Hall.
Gujarati, D.: 2003, Basic Econometrics, 4 edn, McGraw-Hill, Boston.
Holland, P. W.: 1986, Statistics and causal inference, Journal of the American Statistical Association 81(396), 945960.
Keynes, J. M.: 1936, The General Theory of Employment Interest and Money, Macmillan, London.
Mittelhammer, R. C., Judge, G. G. and Miller, D. J.: 2000, Econometric Foundations, CUP,
Cambridge.
Murray, M. P.: 2006, Avoiding invalid instruments and coping with weak instruments, Journal
of Economic Perspectives 20(4), 111132.
Simon, C. P. and Blume, L.: 1994, Mathematics for Economists, Norton, New York.
Stock, J. H., Wright, J. H. and Yogo, M.: 2002, A survey of weak instruments and weak identification in generalized method of moments, Journal of Business and Economic Statistics
20(4), 518529.
Sydsaeter, K., Strom, A. and Berck, P.: 1999, Economists Mathematical Manual, 3 edn, Springer,
Berlin.
Wooldridge, J. M.: 2002, Econometric Analysis of Cross Section and Panel Data, MIT Press,
Cambridge, Mass.
255

Probability and Statistics Guide

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Probability and Statistics Guide

Caricato da

Copyright:

Formati disponibili

Econometric Theory

Probability and Statistics

1 Probability and Distribution Theory 1

2 Probability and Distribution Theory II

Expectations of multivariate distributions . . .

3 Sampling and Estimation

Single equation estimation

6 Thinking about social processes econometrically

9 Asymptotic properties of the OLS estimators

10 Inference and prediction in the CLRM

11.2.4 Hypothesis testing . . . . . . . . . . . . . .

12 Estimation with an unknown general noise covariance matrix 2

13 Heteroscedasticity and Autocorrelation

Estimation with endogenous regressors - IV and GMM

14.7.2 Errors in variables estimator . . . . . . . . . . . . . . . . . . . . . . . . . 204

16 Estimation of equations by OLS and GLS

17 System estimation by Instrumental Variables and GMM

18 Simultaneous Equation Models

Probability and Statistics

Probability and Distribution

where is a set of positive integers

Some probability theorems

CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1

Random variables and probability distributions

1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1

1. Consider the following function:

(c) Sketch the cdf of the distribution.

(d) Verify that (1) = (1) (1)

this example suggest a problem with equation (B-7) in Greene (p.846)?

1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

(b) What is Pr ( 005)?

5. Consider the function

where is a constant to be determined.

6. Consider the function

CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1

Expectations of a random variable

Definition 1.6 Mean of a Random Variable

It is easy to verify that the expectations operator is linear, i.e.

It is again easy to observe that

2. -th moment about the mean = [( ) ]

1.4. SPECIFIC UNIVARIATE DISCRETE DISTRIBUTIONS

Specific univariate discrete distributions

CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1

Specific univariate continuous probability distributions

We write this as 2 . The pdf of a (0 1) random variable is frequently written as

If (0 1), then 2 2 (1). The pdf of = 2 is given by

the gamma function defined as () = 0 1 . It is a generalisation of the factorial

1.5. SPECIFIC UNIVARIATE CONTINUOUS PROBABILITY DISTRIBUTIONS

If (0 1) and 2 (), with and independent of each other, then =

The pdf of the -distribution is given by

The parameter is referred to as the degrees of freedom.

If 2 (1 ) and 2 (2 ) and and are independent of each other, then =

In cases where 2 is large, 1 is approximately 2 (1 ).

CHAPTER 1. PROBABILITY AND DISTRIBUTION THEORY 1

Noncentral 2 , and distributions

The gamma distribution has pdf

where and are called the shape and scale parameter

respectively. It has mean and

1. Graph (1 2), (2 2) and (5 2) on the same set of axes. What do

It is another special case of the gamma distribution, being (1,).

1.5. SPECIFIC UNIVARIATE CONTINUOUS PROBABILITY DISTRIBUTIONS

If (0 1) and (0 1) and these are independent of each other, then =