Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1/2/3-1
Brief Overview of the Course
1/2/3-2
This course is about using data to measure causal effects.
Ideally, we would like an experiment
o what would be an experiment to estimate the effect of
class size on standardized test scores?
But almost always we only have observational
(nonexperimental) data.
o returns to education
o cigarette prices
o monetary policy
Most of the course deals with difficulties arising from using
observational to estimate causal effects
o confounding effects (omitted factors)
o simultaneous causality
o “correlation does not imply causation”
1/2/3-3
In this course you will:
1/2/3-4
Review of Probability and Statistics
(SW Chapters 2, 3)
1/2/3-5
The California Test Score Data Set
Variables:
5PthP grade test scores (Stanford-9 achievement test,
combined math and reading), district average
Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-6
Initial look at the data:
(You should already know how to interpret this table)
1/2/3-7
Do districts with smaller classes have higher test scores?
Scatterplot of test score v. student-teacher ratio
1/2/3-10
1. Estimation
nsmall nlarge
1 1
Ysmall Ylarge =
nsmall
Y
i 1
i –
nlarge
Y
i 1
i
= 657.4 – 650.0
= 7.4
1/2/3-11
2. Hypothesis testing
Ys Yl Ys Yl
t (remember this?)
ss2
sl2 SE (Ys Yl )
ns nl
1/2/3-12
Compute the difference-of-means t-statistic:
Size Y sBYB n
small 657.4 19.4 238
large 650.0 17.9 182
|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.
1/2/3-13
3. Confidence interval
(Ys – Yl ) 1.96SE(Ys – Yl )
1/2/3-14
What comes next…
1/2/3-15
Review of Statistical Theory
Population
The group or collection of all possible entities of interest
(school districts)
We will think of populations as infinitely large ( is an
approximation to “very big”)
Random variable Y
Numerical summary of a random outcome (district
average test score, district STR)
1/2/3-17
Population distribution of Y
1/2/3-18
(b) Moments of a population distribution: mean, variance,
standard deviation, covariance, correlation
1/2/3-19
Moments, ctd.
E Y Y
3
skewness =
3
Y
= measure of asymmetry of a distribution
skewness = 0: distribution is symmetric
skewness > (<) 0: distribution has long right (left) tail
E Y Y
4
kurtosis =
4
Y
= measure of mass in tails
= measure of probability of large values
kurtosis = 3: normal distribution
skewness > 3: heavy tails (“leptokurtotic”)
1/2/3-20
1/2/3-21
2 random variables: joint distributions and covariance
1/2/3-22
cov(X,X) = E[(X – BXB)(X – BXB)] = E[(X – BXB)P2P]
= X2
1/2/3-23
The covariance between Test Score and STR is negative:
so is the correlation…
1/2/3-24
The correlation coefficient is defined in terms of the
covariance:
cov( X , Z ) XZ
corr(X,Z) = = rBXZB
var( X ) var( Z ) X Z
–1 corr(X,Z) 1
corr(X,Z) = 1 mean perfect positive linear association
corr(X,Z) = –1 means perfect negative linear association
corr(X,Z) = 0 means no linear association
1/2/3-25
The correlation coefficient measures linear association
1/2/3-26
(c) Conditional distributions and conditional means
Conditional distributions
The distribution of Y, given value(s) of some other
random variable, X
Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
conditional mean = mean of conditional distribution
= E(Y|X = x) (important concept and notation)
conditional variance = variance of conditional distribution
Example: E(Test scores|STR < 20) = the mean of test
scores among districts with small class sizes
The difference in means is the difference between the means
of two conditional distributions:
1/2/3-27
Conditional mean, ctd.
1/2/3-28
(d) Distribution of a sample of data drawn randomly
from a population: YB1B,…, YBnB
1/2/3-29
Distribution of YB1B,…, YBnB under simple random
sampling
Because individuals #1 and #2 are selected at random, the
value of YB1B has no information content for YB2B. Thus:
o YB1B and YB2B are independently distributed
o YB1B and YB2B come from the same distribution, that
is, YB1B, YB2B are identically distributed
o That is, under simple random sampling, YB1B and YB2B
are independently and identically distributed (i.i.d.).
o More generally, under simple random sampling,
{YBiB}, i = 1,…, n, are i.i.d.
1/2/3-30
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population …
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Y is the natural estimator of the mean. But:
(a) What are the properties of Y ?
(b) Why should we use Y rather than some other estimator?
YB1B (the first observation)
maybe unequal weights – not simple average
1/2/3-31
median(YB1B,…, YBnB)
The starting point is the sampling distribution of Y …
1/2/3-32
(a) The sampling distribution of Y
Y is a random variable, and its properties are determined by
the sampling distribution of Y
The individuals in the sample are drawn at random.
Thus the values of (YB1B,…, YBnB) are random
Thus functions of (YB1B,…, YBnB), such as Y , are random:
had a different sample been drawn, they would have taken
on a different value
The distribution of Y over different possible samples of
size n is called the sampling distribution of Y .
The mean and variance of Y are the mean and variance of
its sampling distribution, E(Y ) and var(Y ).
The concept of the sampling distribution underpins all of
econometrics.
1/2/3-33
The sampling distribution of Y , ctd.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random
variable) with the probability distribution,
Pr[Y = 0] = .22, Pr(Y =1) = .78
Then
E(Y) = p1 + (1 – p)0 = p = .78
Y2 = E[Y – E(Y)]2 = p(1 – p) [remember this?]
= .78(1–.78) = 0.1716
The sampling distribution of Y depends on n.
Consider n = 2. The sampling distribution of Y is,
Pr( Y = 0) = .222 = .0484
Pr( Y = ½) = 2.22.78 = .3432
Pr( Y = 1) = .782 = .6084
1/2/3-34
The sampling distribution of Y when Y is Bernoulli (p = .78):
1/2/3-35
Things we want to know about the sampling distribution:
1/2/3-36
The mean and variance of the sampling distribution of Y
General case – that is, for Yi i.i.d. from any distribution, not
just Bernoulli:
1 n 1 n 1 n
mean: E(Y ) = E( Yi ) = E (Yi ) = Y = Y
n i 1 n i 1 n i 1
1/2/3-38
Mean and variance of sampling distribution of Y , ctd.
E( Y ) = Y
Y2
var( Y ) =
n
Implications:
1. Y is an unbiased estimator of Y (that is, E( Y ) = Y)
2. var( Y ) is inversely proportional to n
the spread of the sampling distribution is
proportional to 1/ n
Thus the sampling uncertainty associated with Y is
proportional to 1/ n (larger samples, less
uncertainty, but square-root law)
1/2/3-39
The sampling distribution of Y when n is large
1/2/3-40
The Law of Large Numbers:
An estimator is consistent if the probability that its falls
within an interval of the true population value tends to one
as the sample size increases.
If (Y1,…,Yn) are i.i.d. and Y2 < , then Y is a consistent
estimator of Y, that is,
Pr[| Y – Y| < ] 1 as n
p
which can be written, Y Y
p
(“ Y Y” means “ Y converges in probability to Y”).
Y2
(the math: as n , var( Y ) = 0, which implies that
n
Pr[| Y – Y| < ] 1.)
1/2/3-41
The Central Limit Theorem (CLT):
If (Y1,…,Yn) are i.i.d. and 0 < Y2 < , then when n is
large the distribution of Y is well approximated by a
normal distribution.
Y2
Y is approximately distributed N(Y, ) (“normal
n
distribution with mean Y and variance Y2 /n”)
n ( Y – Y)/Y is approximately distributed N(0,1)
(standard normal)
Y E (Y ) Y Y
That is, “standardized” Y = = is
var(Y ) Y / n
approximately distributed as N(0,1)
The larger is n, the better is the approximation.
1/2/3-42
Sampling distribution of Y when Y is Bernoulli, p = 0.78:
1/2/3-43
Y E (Y )
Same example: sampling distribution of :
var(Y )
1/2/3-44
Summary: The Sampling Distribution of Y
For Y1,…,Yn i.i.d. with 0 < Y2 < ,
The exact (finite sample) sampling distribution of Y has
mean Y (“ Y is an unbiased estimator of Y”) and variance
Y2 /n
Other than its mean and variance, the exact distribution of
Y is complicated and depends on the distribution of Y (the
population distribution)
When n is large, the sampling distribution simplifies:
p
o Y Y (Law of large numbers)
Y E (Y )
o is approximately N(0,1) (CLT)
var(Y )
1/2/3-45
(b) Why Use Y To Estimate Y?
Y is unbiased: E( Y ) = Y
p
Y is consistent: Y Y
Y is the “least squares” estimator of Y; Y solves,
n
min m (Yi m) 2
i 1
dm i 1
(Yi m) =
2
i 1 dm
(Yi m) = 2 (Yi m)
2
i 1
1/2/3-47
1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision, based on the evidence at hand, whether
a null hypothesis is true, or instead that some alternative
hypothesis is true. That is, test
H0: E(Y) = Y,0 vs. H1: E(Y) > Y,0 (1-sided, >)
H0: E(Y) = Y,0 vs. H1: E(Y) < Y,0 (1-sided, <)
H0: E(Y) = Y,0 vs. H1: E(Y) Y,0 (2-sided)
1/2/3-48
Some terminology for testing statistical hypotheses:
1/2/3-52
Estimator of the variance of Y:
n
1
sY2 =
n 1 i 1
(Yi Y ) 2
= “sample variance of Y”
Fact:
p
If (Y1,…,Yn) are i.i.d. and E(Y ) < , then s Y2
4 2
Y
1/2/3-53
Computing the p-value with Y2 estimated:
1/2/3-56
Comments on this recipe and the Student t-distribution
1/2/3-59
1/2/3-60
Comments on Student t distribution, ctd.
4. You might not know this. Consider the t-statistic testing
the hypothesis that two means (groups s, l) are equal:
Ys Yl Ys Yl
t 2 2
ss
sl SE (Ys Yl )
ns nl
1/2/3-62
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals
A 95% confidence interval for Y is an interval that contains
the true value of Y in 95% of repeated samples.
Y Y Y Y
{Y: 1.96} = {Y: –1.96 1.96}
sY / n sY / n
sY sY
= {Y: –1.96 Y – Y 1.96 }
n n
sY sY
= {Y ( Y – 1.96 , Y + 1.96 )}
n n
This confidence interval relies on the large-n results that Y is
p
approximately normally distributed and s Y2 .
2
Y
1/2/3-64
Summary:
From the two assumptions of:
(1) simple random sampling of a population, that is,
{Yi, i =1,…,n} are i.i.d.
(2) 0 < E(Y4) <
we developed, for large samples (large n):
Theory of estimation (sampling distribution of Y )
Theory of hypothesis testing (large-n distribution of t-
statistic and computation of the p-value)
Theory of confidence intervals (constructed by inverting
test statistic)
Are assumptions (1) & (2) plausible in practice? Yes
1/2/3-65
Let’s go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?
Have we answered this question?
1/2/3-66
Introduction to Linear Regression
(SW Chapter 4)
1/2/3-67
What do data say about class sizes and test scores?
Variables:
5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-68
An initial look at the California test score data:
1/2/3-69
Do districts with smaller classes (lower STR) have higher test
scores?
1/2/3-70
The class size/test score policy question:
What is the effect on test scores of reducing STR by
one student/class?
Test score
Object of policy interest:
STR
This is the slope of the line relating test score and STR
1/2/3-71
This suggests that we want to draw a line through the
Test Score v. STR scatterplot – but how?
1/2/3-72
Some Notation and Terminology
(Sections 4.1 and 4.2)
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
1/2/3-74
n
The OLS estimator solves: min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
1/2/3-75
Why use OLS, rather than some other estimator?
OLS is a generalization of the sample average: if the
“line” is just an intercept (no X), then the OLS
estimator is just the sample average of Y1,…Yn ( Y ).
Like Y , the OLS estimator has some desirable
properties: under certain assumptions, it is unbiased
(that is, E( ˆ1 ) = 1), and it has a tighter sampling
distribution than some other candidate estimators of
1 (more on this later)
Importantly, this is what everyone uses – the common
“language” of linear regression.
1/2/3-76
1/2/3-77
Application to the California Test Score – Class Size data
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
= 698.9 – 2.28STR
TestScore
Population
population of interest (ex: all possible school districts)
Random variables: Y, X
Ex: (Test Score, STR)
1/2/3-83
The Population Linear Regression Model (Section 4.3)
1/2/3-88
Example: Assumption #1 and the class size example
Test Scorei = 0 + 1STRi + ui, ui = other factors
“Other factors:”
parental involvement
outside learning opportunities (extra math class,..)
home environment conducive to reading
family income is a useful proxy for many such factors
1/2/3-90
Least squares assumption #3:
E(X ) < and E(u ) <
4 4
1/2/3-91
1. The probability framework for linear regression
2. Estimation: the Sampling Distribution of ˆ1
(Section 4.4)
3. Hypothesis Testing
4. Confidence intervals
1/2/3-92
The sampling distribution of ˆ1 : some algebra:
Yi = 0 + 1Xi + ui
Y = 0 + 1 X + u
so Yi – Y = 1(Xi – X ) + (ui – u )
Thus,
n
( X i X )(Yi Y )
ˆ1 = i 1
n
i
( X
i 1
X ) 2
( X i X )[ 1 ( X i X ) (ui u )]
= i 1
n
i
( X
i 1
X ) 2
1/2/3-93
n
( X i X )[ 1 ( X i X ) (ui u )]
ˆ1 = i 1
n
i
( X
i 1
X ) 2
n n
( X i X )( X i X ) ( X i X )(ui u )
= 1 i 1
n
i 1
n
i
( X
i 1
X ) 2
i
( X
i 1
X ) 2
so
n
( X i X )(ui u )
ˆ1 – 1 = i 1
n
i
( X X
i 1
) 2
1/2/3-94
We can simplify this formula by noting that:
n n
n
i 1
( X i X )(u i u ) = ( X i X )u i – ( X i X ) u
i 1 i 1
n
= ( X
i 1
i X )u i .
Thus
n
1 n
( X i X )u i
n i 1
vi
ˆ
1 – 1 = n
i 1
=
n 1 2
i 1
(Xi X ) 2
n
sX
1/2/3-95
1 n
n i 1
vi
ˆ
1 – 1 = , where vi = (Xi – X )ui
n 1 2
sX
n
1/2/3-96
Now E(vi/ s 2X ) = E[(Xi – X )ui/ s 2X ] = 0
n 1 n
vi
E 2 = 0
Thus, ˆ
E( 1 – 1) =
n 1 n i 1 s X
so
E( ˆ1 ) = 1
1/2/3-97
Calculation of the variance of ˆ1 :
1 n
n i 1
vi
ˆ
1 – 1 =
n 1 2
sX
n
ˆ var( v )
var( 1 ) =
n X2
1/2/3-98
The exact sampling distribution is complicated, but when
the sample size is large we get some simple (and good)
approximations:
p
(1) Because var( ˆ1 ) 1/n and E( ˆ1 ) = 1, ˆ1 1
1/2/3-99
1 n
n i 1
vi
ˆ
1 – 1 =
n 1 2
sX
n
When n is large:
vi = (Xi – X )ui (Xi – X)ui, which is i.i.d. (why?) and
1 n 1 n
n i 1
vi
n i 1
vi
ˆ1 – 1 = ,
n 1 2 2
sX
X
n
v2
which is approximately distributed N(0, ).
n ( X )
2 2
ˆ var[( X i x )ui ]
1 is approximately distributed N(1, )
n X 4
1/2/3-101
Recall the summary of the sampling distribution of Y :
For (Y1,…,Yn) i.i.d. with 0 < Y2 < ,
The exact (finite sample) sampling distribution of Y
has mean Y (“ Y is an unbiased estimator of Y”) and
variance Y2 /n
Other than its mean and variance, the exact
distribution of Y is complicated and depends on the
distribution of Y
p
Y Y (law of large numbers)
Y E (Y )
is approximately distributed N(0,1) (CLT)
var(Y )
1/2/3-102
Parallel conclusions hold for the OLS estimator ˆ1 :
Y Y ,0
t=
sY / n
then reject the null hypothesis if |t| >1.96.
var[( X )u ] 2
var( ˆ1 ) = i x i
= v
n( X2 ) 2 n X4
1 estimator of 2
ˆ 2ˆ = v
1
n (estimator of X2 )2
1 n
1
n 2 i 1
( X i X ) 2 2
uˆi
= 2
.
n 1 n 2
n ( X i X )
i 1
1/2/3-110
1 n
1
n 2 i 1
( X i X ) 2 2
uˆi
ˆ =
2
ˆ 2
.
1
n 1 n 2
n ( X i X )
i 1
1/2/3-111
Return to calculation of the t-statsitic:
1/2/3-112
Example: Test Scores and STR, California data
= 698.9 – 2.28STR
Estimated regression line: TestScore
= 698.9 – 2.28STR
TestScore
(10.4) (0.52)
1/2/3-117
OLS regression: STATA output
1/2/3-119
Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):
When Xi = 0: Y i = 0 + u i
When Xi = 1: Y i = 0 + 1 + u i
thus:
When Xi = 0, the mean of Yi is 0
When Xi = 1, the mean of Yi is 0 + 1
that is:
E(Yi|Xi=0) = 0
E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
1/2/3-120
Example: TestScore and STR, California data
Let
1 if STRi 20
Di =
0 if STRi 20
= 650.0 + 7.4D
TestScore
(1.3) (1.8)
Yi = 0 + 1Xi + ui
1/2/3-124
2
The R
Write Yi as the sum of the OLS prediction + OLS
residual:
Yi = Yˆi + uˆi
ESS
2
R = ,
TSS
n n
where ESS = (Yˆi Yˆ ) and TSS =
i 1
2
i
(Y
i 1
Y ) 2
.
1/2/3-125
n n
ESS
2
R =
TSS
, where ESS = (Yˆi Yˆ ) and TSS =
i 1
2
i
(Y
i 1
Y ) 2
The R2:
2
R = 0 means ESS = 0, so X explains none of the
variation of Y
R2 = 1 means ESS = TSS, so Y = Yˆ so X explains all of
the variation of Y
0 ≤ R2 ≤ 1
For regression with a single regressor (the case here),
R2 is the square of the correlation coefficient between
X and Y
1/2/3-126
The Standard Error of the Regression (SER)
1 n
SER =
n 2 i 1
( ˆ
ui ˆ
ui ) 2
1 n 2
=
n 2 i 1
uˆi
1 n
(the second equality holds because uˆi = 0).
n i 1
1/2/3-127
1 n 2
SER =
n 2 i 1
uˆi
The SER:
has the units of u, which are the units of Y
measures the spread of the distribution of u
measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line)
The root mean squared error (RMSE) is closely
related to the SER:
1 n 2
RMSE =
n i 1
uˆi
60
Average hourly earnings
40
20
0
5 10 15 20
Years of Education
Scatterplot and OLS Regression Line
1/2/3-134
Is heteroskedasticity present in the class size data?
1/2/3-135
So far we have (without saying so) assumed that u is
heteroskedastic:
1 n
1
n 2 i 1
( X i X ) 2 2
uˆi
ˆ =
2
ˆ 2
.
1
n 1 n 2
n ( X i X )
i 1
n i 1
1/2/3-141
Summary and Assessment (Section 4.10)
The initial policy question:
Suppose new teachers are hired so the student-
teacher ratio falls by one student per class. What
is the effect of this policy intervention (this
“treatment”) on test scores?
Does our regression analysis give a convincing answer?
Not really – districts with low STR tend to be ones
with lots of other resources and higher income
families, which provide kids with more learning
opportunities outside school…this suggests that
corr(ui,STRi) > 0, so E(ui|Xi)0.
1/2/3-142
Digression on Causality
1/2/3-146
Omitted Variable Bias
(SW Section 5.1)
1/2/3-147
In the test score example:
1. English language ability (whether the student has
English as a second language) plausibly affects
standardized test scores: Z is a determinant of Y.
2. Immigrant communities tend to be less affluent and
thus have smaller school budgets – and higher STR:
Z is correlated with X.
1/2/3-148
A formula for omitted variable bias: recall the equation,
n
1 n
( X i X )u i
n i 1
vi
ˆ
1 – 1 = n
i 1
=
n 1 2
i 1
(Xi X ) 2
n
sX
1/2/3-149
Then
n
1 n
( X i X )u i
n i 1
vi
ˆ
1 – 1 = n
i 1
=
n 1 2
i 1
(Xi X ) 2
n
sX
so
n
( X i X )u i Xu u Xu
ˆ
E( 1 ) – 1 = E ni 1
2 =
( X X )2
i 1 i
X X X u
ˆ
p u
1 1 + Xu , where Xu = corr(X,u)
X
1/2/3-150
ˆ
p u
Omitted variable bias formula: 1 1 + Xu .
X
If an omitted factor Z is both:
(1) a determinant of Y (that is, it is contained in u); and
(2) correlated with X,
then Xu 0 and the OLS estimator ˆ1 is biased.
The math makes precise the idea that districts with few
ESL students (1) do better on standardized tests and (2)
have smaller classes (bigger budgets), so ignoring the
ESL factor results in overstating the class size effect.
Is this is actually going on in the CA data?
1/2/3-151
Districts with fewer English Learners have higher test scores
Districts with lower percent EL (PctEL) have smaller classes
Among districts with comparable PctEL, the effect of class
size is small (recall overall “test score gap” = 7.4)
1/2/3-152
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which
treatment (STR) is randomly assigned: then PctEL is
still a determinant of TestScore, but PctEL is
uncorrelated with STR. (But this is unrealistic in
practice.)
2. Adopt the “cross tabulation” approach, with finer
gradations of STR and PctEL (But soon we will run
out of data, and what about other determinants like
family income and parental education?)
3. Use a method in which the omitted variable (PctEL) is
no longer omitted: include PctEL as an additional
regressor in a multiple regression.
1/2/3-153
The Population Multiple Regression Model
(SW Section 5.2)
Y = 0 + 1X1 + 2X2
1/2/3-155
Before: Y = 0 + 1(X1 + X1) + 2X2
Difference: Y = 1X1
That is,
Y
1 = , holding X2 constant
X 1
also,
Y
2 = , holding X1 constant
X 2
and
0 = predicted value of Y when X1 = X2 = 0.
1/2/3-156
The OLS Estimator in Multiple Regression
(SW Section 5.3)
n
min b0 ,b1 ,b2 [Yi (b0 b1 X 1i b2 X 2i )]2
i 1
= 698.9 – 2.28STR
TestScore
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616
pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786
_cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189
------------------------------------------------------------------------------
1/2/3-160
Assumption #1: the conditional mean of u given the
included X’s is zero.
1/2/3-161
Assumption #2: (X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
This is satisfied automatically if the data are collected
by simple random sampling.
1/2/3-162
Assumption #4: There is no perfect multicollinearity
Perfect multicollinearity is when one of the regressors is
an exact linear function of the other regressors.
1/2/3-167
Tests of Joint Hypotheses
(SW Section 5.7)
H0: 1 = 0 and 2 = 0
vs. H1: either 1 0 or 2 0 or both
1/2/3-168
TestScorei = 0 + 1STRi + 2Expni + 3PctELi + ui
H0: 1 = 0 and 2 = 0
vs. H1: either 1 0 or 2 0 or both
Two Solutions:
Use a different critical value in this procedure – not
1.96 (this is the “Bonferroni method – see App. 5.3)
Use a different test statistic that test both 1 and 2 at
once: the F-statistic.
1/2/3-172
The F-statistic
The F-statistic tests all parts of a joint hypothesis at once.
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
F=
2 1 ˆ t1 ,t2
2
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
F=
2 1 ˆ t1 ,t2
2
1/2/3-174
Large-sample distribution of the F-statistic
Consider special case that t1 and t2 are independent, so
p
ˆ t ,t 0; in large samples the formula becomes
1 2
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t 1 2 2
F= (t1 t2 )
2 1 ˆ t1 ,t2
2
2
Implementation in STATA
Use the “test” command after the regression
1/2/3-177
F-test example, California class size data:
reg testscr str expn_stu pctel, r;
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -.2863992 .4820728 -0.59 0.553 -1.234001 .661203
expn_stu | .0038679 .0015807 2.45 0.015 .0007607 .0069751
pctel | -.6560227 .0317844 -20.64 0.000 -.7185008 -.5935446
_cons | 649.5779 15.45834 42.02 0.000 619.1917 679.9641
------------------------------------------------------------------------------
NOTE
1/2/3-181
By how much must the R2 increase for the coefficients on
Expn and PctEL to be judged statistically significant?
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
where:
2
Rrestricted = the R2 for the restricted regression
2
Runrestricted = the R2 for the unrestricted regression
q = the number of restrictions under the null
kunrestricted = the number of regressors in the
unrestricted regression.
1/2/3-182
Example:
Restricted regression:
= 644.7 –0.671PctEL, R 2
TestScore restricted = 0.4149
(1.0) (0.032)
Unrestricted regression:
= 649.6 – 0.29STR + 3.87Expn – 0.656PctEL
TestScore
(15.5) (0.48) (1.59) (0.032)
2
Runrestricted = 0.4366, kunrestricted = 3, q = 2
so:
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
(.4366 .4149) / 2
= = 8.01
(1 .4366) /(420 3 1)
1/2/3-183
The homoskedasticity-only F-statistic
2
( Runrestricted Rrestricted
2
)/q
F=
(1 Runrestricted
2
) /( n kunrestricted 1)
If:
1. u1,…,un are normally distributed; and
2. Xi is distributed independently of ui (so in
particular ui is homoskedastic)
1/2/3-185
The Fq,n–k–1 distribution:
The F distribution is tabulated many places
When n gets large the Fq,n-k–1 distribution asymptotes
to the q2 /q distribution:
Fq, is another name for q2 /q
1/2/3-188
Summary: the homoskedasticity-only (“rule of
thumb”) F-statistic and the F distribution
These are justified only under very strong conditions
– stronger than are realistic in practice.
Yet, they are widely used.
You should use the heteroskedasticity-robust F-
statistic, with q2 /q (that is, Fq,) critical values.
For n ≥ 100, the F-distribution essentially is the q2 /q
distribution.
For small n, the F distribution isn’t necessarily a
“better” approximation to the sampling distribution of
the F-statistic – only if the strong conditions are true.
1/2/3-189
Summary: testing joint hypotheses
The “common-sense” approach of rejecting if either
of the t-statistics exceeds 1.96 rejects more than 5% of
the time under the null (the size exceeds the desired
significance level)
The heteroskedasticity-robust F-statistic is built in to
STATA (“test” command); this tests all q restrictions
at once.
For n large, F is distributed as q2 /q (= Fq,)
The homoskedasticity-only F-statistic is important
historically (and thus in practice), and is intuitively
appealing, but invalid when there is heteroskedasticity
1/2/3-190
Testing Single Restrictions on Multiple Coefficients
(SW Section 5.8)
1/2/3-191
Two methods for testing single restrictions on multiple
coefficients:
1/2/3-192
Method 1: Rearrange (“transform”) the regression
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1 2
Yi = 0 + 1X1i + 2X2i + ui
H0: 1 = 2 vs. H1: 1 2
Example:
1/2/3-196
The coverage rate of a confidence set is the probability
that the confidence set contains the true parameter values
1/2/3-197
Coverage rate of “common sense” confidence set:
Pr[(1, 2) { ˆ1 1.96SE( ˆ1 ), ˆ2 1.96 SE( ˆ2 )}]
1/2/3-200
The confidence set based on the F-statistic is an ellipse
1 1 2 2 ˆ t1 ,t2 t1t2
2 2
t t
{1, 2: F = ≤ 3.00}
2 1 ˆ t1 ,t2
2
Now
1
2 2 t1 ,t2 t1t2
F=
t 2
t 2
ˆ
2(1 ˆ t1 ,t2 )
2 1
1
2(1 ˆ t1 ,t2 )
2
ˆ 2 ˆ 2 ˆ ˆ
2 2,0
1 1,0
2
ˆ t1 ,t2
1 1,0
2 2,0
SE ( ˆ2 ) SE ( ˆ1 ) SE ( ˆ ) SE ( ˆ )
1 2
This is a quadratic form in 1,0 and 2,0 – thus the
boundary of the set F = 3.00 is an ellipse.
1/2/3-201
Confidence set based on inverting the F-statistic
1/2/3-202
2
The R , SER, and R 2 for Multiple Regression
(SW Section 5.10)
n
1
SER = i
n k 1 i 1
ˆ
u 2
1/2/3-203
The R2 is the fraction of the variance explained:
2 ESS SSR
R = = 1 ,
TSS TSS
n n
where ESS = i
(Yˆ Y
i 1
ˆ ) 2
, SSR = i , and TSS =
ˆ
u 2
i 1
n
i
(Y
i 1
Y ) 2
– just as for regression with one regressor.
n 1 SSR 2
R = 1
2
so R 2
< R
n k 1 TSS
1/2/3-204
2
How to interpret the R and R 2 ?
A high R2 (or R 2 ) means that the regressors explain
the variation in Y.
A high R2 (or R 2 ) does not mean that you have
eliminated omitted variable bias.
A high R2 (or R 2 ) does not mean that you have an
unbiased estimator of a causal effect (1).
2
A high R (or R 2 ) does not mean that the included
variables are statistically significant – this must be
determined using hypotheses tests.
1/2/3-205
Example: A Closer Look at the Test Score Data
(SW Section 5.11, 5.12)
1/2/3-206
Variables we would like to see in the California data set:
School characteristics:
student-teacher ratio
teacher quality
computers (non-teaching resources) per student
measures of curriculum design…
Student characteristics:
English proficiency
availability of extracurricular enrichment
home learning environment
parent’s education level…
1/2/3-207
1/2/3-208
Variables actually in the California class size data set:
student-teacher ratio (STR)
percent English learners in the district (PctEL)
percent eligible for subsidized/free lunch
percent on public income assistance
average district income
1/2/3-209
A look at more of the California data
1/2/3-210
Digression: presentation of regression results in a table
Listing regressions in “equation” form can be
cumbersome with many regressors and many regressions
Tables of regression results can present the key
information compactly
Information to include:
variables in the regression (dependent and
independent)
estimated coefficients
standard errors
results of F-tests of pertinent joint hypotheses
some measure of fit
number of observations
1/2/3-211
1/2/3-212
Summary: Multiple Regression
1/2/3-213
Nonlinear Regression Functions
(SW Ch. 6)
1/2/3-214
The TestScore – STR relation looks approximately
linear…
1/2/3-215
But the TestScore – average district income relation
looks like it is nonlinear.
1/2/3-216
If a relation between Y and X is nonlinear:
The effect on Y of a change in X depends on the value
of X – that is, the marginal effect of X is not constant
A linear regression is mis-specified – the functional
form is wrong
The estimator of the effect on Y of X is biased – it
needn’t even be right on average.
The solution to this is to estimate a regression
function that is nonlinear in X
1/2/3-217
The General Nonlinear Population Regression Function
Assumptions
1. E(ui| X1i,X2i,…,Xki) = 0 (same); implies that f is the
conditional expectation of Y given the X’s.
2. (X1i,…,Xki,Yi) are i.i.d. (same).
3. “enough” moments exist (same idea; the precise
statement depends on specific f).
4. No perfect multicollinearity (same idea; the precise
statement depends on the specific f).
1/2/3-218
1/2/3-219
Nonlinear Functions of a Single Independent Variable
(SW Section 6.2)
1/2/3-220
1. Polynomials in X
Approximate the population regression function by a
polynomial:
Yi = 0 + 1Xi + 2 X i2 +…+ r X ir + ui
1/2/3-221
Example: the TestScore – Income relation
th
Incomei = average district income in the i district
(thousdand dollars per capita)
Quadratic specification:
Cubic specification:
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avginc | 3.850995 .2680941 14.36 0.000 3.32401 4.377979
avginc2 | -.0423085 .0047803 -8.85 0.000 -.051705 -.0329119
_cons | 607.3017 2.901754 209.29 0.000 601.5978 613.0056
------------------------------------------------------------------------------
1/2/3-224
Interpreting the estimated regression function:
(a) Compute “effects” for different values of X
1/2/3-225
= 607.3 + 3.85Incomei – 0.0423(Incomei)2
TestScore
------------------------------------------------------------------------------
| Robust
test avginc2 avginc3; Execute the test command after running the regression
( 1) avginc2 = 0.0
( 2) avginc3 = 0.0
F( 2, 416) = 37.69
Prob > F = 0.0000
x x
Here’s why: ln(x+x) – ln(x) = ln 1
x x
d ln( x ) 1
(calculus: )
dx x
Numerically:
ln(1.01) = .00995 .01; ln(1.10) = .0953 .10 (sort of)
1/2/3-230
Three cases:
1/2/3-231
I. Linear-log population regression function
Yi = 0 + 1ln(Xi) + ui (b)
X
now ln(X + X) – ln(X) ,
X
X
so Y 1
X
Y
or 1 (small X)
X / X
1/2/3-232
Linear-log case, continued
Yi = 0 + 1ln(Xi) + ui
X
Now 100 = percentage change in X, so a 1%
X
increase in X (multiplying X by 1.01) is associated with
a .011 change in Y.
1/2/3-233
Example: TestScore vs. ln(Income)
First defining the new regressor, ln(Income)
The model is now linear in ln(Income), so the linear-log
model can be estimated by OLS:
= 557.8 + 36.42ln(Incomei)
TestScore
(3.8) (1.40)
1/2/3-235
II. Log-linear population regression function
Y
so 1X
Y
Y / Y
or 1 (small X)
X
1/2/3-236
Log-linear case, continued
ln(Yi) = 0 + 1Xi + ui
Y / Y
for small X, 1
X
Y
Now 100 = percentage change in Y, so a change
Y
in X by one unit (X = 1) is associated with a 1001%
change in Y (Y increases by a factor of 1+1).
Note: What are the units of ui and the SER?
o fractional (proportional) deviations
o for example, SER = .2 means…
1/2/3-237
III. Log-log population regression function
Y X
so 1
Y X
Y / Y
or 1 (small X)
X / X
1/2/3-238
Log-log case, continued
ln(Yi) = 0 + 1ln(Xi) + ui
ln(TestScore) = 6.336 + 0.0554ln(Incomei)
(0.006) (0.0021)
1/2/3-240
Neither specification seems to fit as well as the cubic or linear-log
1/2/3-241
Summary: Logarithmic transformations
Yi = 0 + 1D1i + 2D2i + ui
Di is binary, X is continuous
As specified above, the effect on Y of X (holding
constant D) = 2, which does not depend on D
To allow the effect of X to depend on D, include the
“interaction term” DiXi as a regressor:
1/2/3-247
Interpreting the coefficients
Yi = 0 + 1Di + 2Xi + 3(DiXi) + ui
When HiEL = 0:
= 682.2 – 0.97STR
TestScore
When HiEL = 1,
= 682.2 – 0.97STR + 5.6 – 1.28STR
TestScore
= 687.8 – 2.25STR
Two regression lines: one for each HiSTR group.
Class size reduction is estimated to have a larger effect
when the percent of English learners is large.
1/2/3-249
Example, ctd.
= 682.2 – 0.97STR + 5.6HiEL – 1.28(STRHiEL)
TestScore
(11.9) (0.59) (19.5) (0.97)
Yi = 0 + 2Xi + ui
Yi = 0 + 1 + 2Xi + 3Xi + ui
= (0+1) + (2+3)Xi + ui
1/2/3-252
1/2/3-253
(c) Interactions between two continuous variables
Yi = 0 + 1X1i + 2X2i + ui
1/2/3-254
Coefficients in continuous-continuous interactions
Yi = 0 + 1X1i + 2X2i + 3(X1iX2i) + ui
General rule: compare the various cases
Y = 0 + 1X1 + 2X2 + 3(X1X2) (b)
Now change X1:
Y+ Y = 0 + 1(X1+X1) + 2X2 + 3[(X1+X1)X2] (a)
subtract (a) – (b):
Y
Y = 1X1 + 3X2X1 or = 2 + 3X2
X 1
The effect of X1 depends on X2 (what we wanted)
3 = increment to the effect of X1 from a unit change
in X2
1/2/3-255
Example: TestScore, STR, PctEL
1/2/3-256
Example, ctd: hypothesis tests
= 686.3 – 1.12STR – 0.67PctEL + .0012(STRPctEL),
TestScore
(11.8) (0.59) (0.37) (0.019)
1/2/3-260
The TestScore – Income relation
1/2/3-262
Question #1:
Investigate by considering a polynomial in STR
1/2/3-263
Interpreting the regression function via plots
(preceding regression is labeled (5) in this figure)
1/2/3-264
Are the higher order terms in STR statistically
significant?
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
– .411LunchPCT + 12.12ln(Income)
(.029) (1.80)
1/2/3-270
Tests of joint hypotheses:
1/2/3-271
Summary: Nonlinear Regression Functions
1/2/3-273
It can handle nonlinear relations (effects that vary
with the X’s)
Still, OLS might yield a biased estimator of the true
causal effect.
A Framework for Assessing Statistical Studies
1/2/3-275
Threats to External Validity
1/2/3-280
In general, measurement error in a regressor results in
“errors-in-variables” bias.
Illustration: suppose
Yi = 0 + 1Xi + ui
Let
Xi = unmeasured true value of X
X i = imprecisely measured version of X
1/2/3-281
Then
Yi = 0 + 1Xi + ui
= 0 + 1 X i + [1(Xi – X i ) + ui]
or
Yi = 0 + 1 X i + ui , where ui = 1(Xi – X i ) + ui
1/2/3-283
Potential solutions to errors-in-variables bias
1/2/3-284
4. Sample selection bias
1/2/3-285
Example #1: Mutual funds
Do actively managed mutual funds outperform “hold-
the-market” funds?
Empirical strategy:
o Sampling scheme: simple random sampling of
mutual funds available to the public on a given
date.
o Data: returns for the preceding 10 years.
o Estimator: average ten-year return of the sample
mutual funds, minus ten-year return on S&P500
o Is there sample selection bias?
1/2/3-286
Sample selection bias induces correlation between a
regressor and the error term.
returni = 0 + 1managed_fundi + ui
1/2/3-287
Example #2: returns to education
What is the return to an additional year of education?
Empirical strategy:
o Sampling scheme: simple random sampling of
workers
o Data: earnings and years of education
o Estimator: regress ln(earnings) on years_education
o Ignore issues of omitted variable bias and
measurement error – is there sample selection
bias?
1/2/3-288
Potential solutions to sample selection bias
1/2/3-290
Simultaneous causality bias in equations
External validity
o Compare results for California and Massachusetts
o Think hard…
Internal validity
o Go through the list of five potential threats to
internal validity and think hard…
1/2/3-293
Check of external validity
compare the California study to one using
Massachusetts data
1/2/3-294
The Massachusetts data: summary statistics
1/2/3-295
1/2/3-296
1/2/3-297
Logarithmic v. cubic function for STR?
Evidence of nonlinearity in TestScore-STR relation?
Is there a significant HiELSTR interaction?
1/2/3-298
Predicted effects for a class size reduction of 2
Linear specification for Mass:
1/2/3-301
Comparison of estimated class size effects: CA vs. MA
1/2/3-302
Summary: Comparison of California and
Massachusetts Regression Analyses
1/2/3-305
2. Wrong functional form
We have tried quite a few different functional forms,
in both the California and Mass. data
Nonlinear effects are modest
Plausibly, this is not a major threat at this point.
3. Errors-in-variables bias
STR is a district-wide measure
Presumably there is some measurement error –
students who take the test might not have experienced
the measured STR for the district
Ideally we would like data on individual students, by
grade level.
1/2/3-306
4. Selection
Sample is all elementary public school districts (in
California; in Mass.)
no reason that selection should be a problem.
5. Simultaneous Causality
School funding equalization based on test scores
could cause simultaneous causality.
This was not in place in California or Mass. during
these samples, so simultaneous causality bias is
arguably not important.
1/2/3-307
Summary
Some jargon…
Another term for panel data is longitudinal data
balanced panel: no missing observations
unbalanced panel: some entities (states) are not
observed for some time periods (years)
1/2/3-311
Why are panel data useful?
1/2/3-312
Example of a panel data set:
Traffic deaths and alcohol taxes
1/2/3-313
Traffic death data for 1982
1/2/3-316
These omitted factors could cause omitted variable bias.
1/2/3-319
The key idea:
Any change in the fatality rate from 1982 to 1988
cannot be caused by Zi, because Zi (by assumption)
does not change between 1982 and 1988.
1982 data:
= 2.01 + 0.15BeerTax
FatalityRate (n = 48)
(.15) (.13)
1988 data:
= 1.86 + 0.44BeerTax
FatalityRate (n = 48)
(.11) (.13)
(.065) (.36)
1/2/3-322
1/2/3-323
Fixed Effects Regression
(SW Section 8.3)
1/2/3-325
For TX:
YTX,t = 0 + 1XTX,t + 2ZTX + uTX,t
= (0 + 2ZTX) + 1XTX,t + uTX,t
or
YTX,t = TX + 1XTX,t + uTX,t, where TX = 0 + 2ZTX
1/2/3-326
The regression lines for each state in a picture
Y = CA + 1X
Y
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
CA
CA Y = TX + 1X
TX Y = MA+ 1X
TX
MA
MA
1/2/3-332
Entity-demeaned OLS regression, ctd.
1 T 1 T 1 T
Yit – Yit = 1 X it X it + uit uit
T t 1 T t 1 T t 1
or
Yit = 1 X it + uit
T T
1 1
where Yit = Yit – Yit and X it = Xit – X it
T t 1 T t 1
1/2/3-334
Example: Traffic deaths and beer taxes in STATA
. areg vfrall beertax, absorb(state) r;
------------------------------------------------------------------------------
| Robust
vfrall | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
beertax | -.6558736 .2032797 -3.23 0.001 -1.055982 -.2557655
_cons | 2.377075 .1051515 22.61 0.000 2.170109 2.584041
-------------+----------------------------------------------------------------
state | absorbed (48 categories)
FR1988 FR1982 = –.072 – 1.04(BeerTax1988–BeerTax1982)
(.065) (.36)
1/2/3-336
Regression with Time Fixed Effects
(SW Section 8.4)
1/2/3-337
Time fixed effects only
Yit = 0 + 1Xit + 3St + uit
Similarly,
Yi,1983 = 1983 + 1Xi,1983 + ui,1983, 1983 = 0 + 3S1983
etc.
1/2/3-338
Two formulations for time fixed effects
1/2/3-339
Time fixed effects: estimation methods
1/2/3-341
State and time effects: estimation methods
For a single X:
Yit = 1Xit + i + uit, i = 1,…,n, t = 1,…, T
5. E(uit|Xi1,…,XiT,i) = 0.
6. (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are i.i.d. draws from
their joint distribution.
7. (Xit, uit) have finite fourth moments.
8. There is no perfect multicollinearity (multiple X’s)
9. corr(uit,uis|Xit,Xis,i) = 0 for t s.
Assumptions 3&4 are identical; 1, 2, differ; 5 is new
1/2/3-344
Assumption #1: E(uit|Xi1,…,XiT,i) = 0
uit has mean zero, given the state fixed effect and the
entire history of the X’s for that state
This is an extension of the previous multiple
regression Assumption #1
This means there are no omitted lagged effects (any
lagged effects of X must enter explicitly)
Also, there is not feedback from u to future X:
o Whether a state has a particularly high fatality rate
this year doesn’t subsequently affect whether it
increases the beer tax.
o We’ll return to this when we take up time series
data.
1/2/3-345
Assumption #2: (Xi1,…,XiT,Yi1,…,YiT), i =1,…,n, are
i.i.d. draws from their joint distribution.
This is an extension of Assumption #2 for multiple
regression with cross-section data
This is satisfied if entities (states, individuals) are
randomly sampled from their population by simple
random sampling, then data for those entities are
collected over time.
This does not require observations to be i.i.d. over
time for the same entity – that would be unrealistic
(whether a state has a mandatory DWI sentencing law
this year is strongly related to whether it will have that
law next year).
1/2/3-346
Assumption #5: corr(uit,uis|Xit,Xis,i) = 0 for t s
This is new.
This says that (given X), the error terms are
uncorrelated over time within a state.
For example, uCA,1982 and uCA,1983 are uncorrelated
Is this plausible? What enters the error term?
o Especially snowy winter
o Opening major new divided highway
o Fluctuations in traffic density from local economic
conditions
Assumption #5 requires these omitted factors entering
uit to be uncorrelated over time, within a state.
1/2/3-347
What if Assumption #5 fails: corr(uit,uis|Xit,Xis,i) 0?
A useful analogy is heteroskedasticity.
OLS panel data estimators of 1 are unbiased,
consistent
The OLS standard errors will be wrong – usually the
OLS standard errors understate the true uncertainty
Intuition: if uit is correlated over time, you don’t have
as much information (as much random variation) as you
would were uit uncorrelated.
This problem is solved by using “heteroskedasticity and
autocorrelation-consistent standard errors” – we return
to this when we focus on time series regression
1/2/3-348
Application: Drunk Driving Laws and Traffic Deaths
(SW Section 8.5)
Some facts
Approx. 40,000 traffic fatalities annually in the U.S.
1/3 of traffic fatalities involve a drinking driver
25% of drivers on the road between 1am and 3am
have been drinking (estimate)
A drunk driver is 13 times as likely to cause a fatal
crash as a non-drinking driver (estimate)
1/2/3-349
Drunk driving laws and traffic deaths, ctd.
1/2/3-350
The drunk driving panel data set
n = 48 U.S. states, T = 7 years (1982,…,1988) (balanced)
Variables
Traffic fatality rate (deaths per 10,000 residents)
Tax on a case of beer (Beertax)
Minimum legal drinking age
Minimum sentencing laws for first DWI violation:
o Mandatory Jail
o Manditory Community Service
o otherwise, sentence will just be a monetary fine
Vehicle miles per driver (US DOT)
State economic data (real per capita income, etc.)
1/2/3-351
Why might panel data help?
Potential OV bias from variables that vary across states
but are constant over time:
o culture of drinking and driving
o quality of roads
o vintage of autos on the road
use state fixed effects
Potential OV bias from variables that vary over time
but are constant across states:
o improvements in auto safety over time
o changing national attitudes towards drunk driving
use time fixed effects
1/2/3-352
1/2/3-353
1/2/3-354
Empirical Analysis: Main Results
1/2/3-357
Fixed effects estimation can be done three ways:
1. “Changes” method when T = 2
2. “n-1 binary regressors” method when n is small
3. “Entity-demeaned” regression
Similar methods apply to regression with time fixed
effects and to both time and state fixed effects
Statistical inference: like multiple regression.
Limitations/challenges
Need variation in X over time within states
Time lag effects can be important
Standard errors might be too low (errors might be
correlated over time)
1/2/3-358
Regression with a Binary Dependent Variable
(SW Ch. 9)
1/2/3-359
Example: Mortgage denial and race
The Boston Fed HMDA data set
Individual applications for single-family mortgages
made in 1990 in the greater Boston area
2380 observations, collected under Home Mortgage
Disclosure Act (HMDA)
Variables
Dependent variable:
o Is the mortgage denied or accepted?
Independent variables:
o income, wealth, employment status
o other loan, property characteristics
o race of applicant
1/2/3-360
The Linear Probability Model
(SW Section 9.1)
Yi = 0 + 1Xi + ui
But:
Y
What does 1 mean when Y is binary? Is 1 = ?
X
What does the line 0 + 1X mean when Y is binary?
What does the predicted value Yˆ mean when Y is
binary? For example, what does Yˆ = 0.26 mean?
1/2/3-361
The linear probability model, ctd.
Yi = 0 + 1Xi + ui
When Y is binary,
E(Y) = 1Pr(Y=1) + 0Pr(Y=0) = Pr(Y=1)
so
E(Y|X) = Pr(Y=1|X)
1/2/3-362
The linear probability model, ctd.
When Y is binary, the linear regression model
Yi = 0 + 1Xi + ui
is called the linear probability model.
1/2/3-364
Linear probability model: HMDA data
Instead, we want:
0 ≤ Pr(Y = 1|X) ≤ 1 for all X
Pr(Y = 1|X) to be increasing in X (for 1>0)
This requires a nonlinear functional form for the
probability. How about an “S-curve”…
1/2/3-368
The probit model satisfies these conditions:
0 ≤ Pr(Y = 1|X) ≤ 1 for all X
Pr(Y = 1|X) to be increasing in X (for 1>0)
1/2/3-369
Probit regression models the probability that Y=1 using
the cumulative standard normal distribution function,
evaluated at z = 0 + 1X:
Pr(Y = 1|X) = (0 + 1X)
is the cumulative normal distribution function.
z = 0 + 1X is the “z-value” or “z-index” of the
probit model.
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.967908 .4653114 6.38 0.000 2.055914 3.879901
_cons | -2.194159 .1649721 -13.30 0.000 -2.517499 -1.87082
------------------------------------------------------------------------------
= (-1.30) = .097
Effect of change in P/I ratio from .3 to .4:
Pr( deny 1 | P / Iratio .4) = (-2.19+2.97.4) = .159
Predicted probability of denial rises from .097 to .159
1/2/3-374
Probit regression with multiple regressors
1/2/3-375
STATA Example: HMDA data
. probit deny p_irat black, r;
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
1/2/3-376
STATA Example: predicted probit probabilities
. probit deny p_irat black, r;
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181
black | .7081579 .0831877 8.51 0.000 .545113 .8712028
_cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463
------------------------------------------------------------------------------
. sca z1 = _b[_cons]+_b[p_irat]*.3+_b[black]*0;
NOTE
_b[_cons] is the estimated intercept (-2.258738)
_b[p_irat] is the coefficient on p_irat (2.741637)
sca creates a new scalar which is the result of a calculation
display prints the indicated information to the screen
1/2/3-377
STATA Example: HMDA data, ctd.
Pr( deny 1| P / I , black )
= (-2.26 + 2.74P/I ratio + .71black)
(.16) (.44) (.08)
Is the coefficient on black statistically significant?
Estimated effect of race for P/I ratio = .3:
Pr( deny 1 | .3,1) = (-2.26+2.74.3+.711) = .233
1
F(0 + 1X) =
1 e ( 0 1 X )
1/2/3-379
Logistic regression, ctd.
1
where F(0 + 1X) = ( 0 1 X )
.
1 e
------------------------------------------------------------------------------
| Robust
deny | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p_irat | 5.370362 .9633435 5.57 0.000 3.482244 7.258481
black | 1.272782 .1460986 8.71 0.000 .9864339 1.55913
_cons | -4.125558 .345825 -11.93 0.000 -4.803362 -3.447753
------------------------------------------------------------------------------
1/2/3-382
Estimation and Inference in Probit (and Logit)
Models (SW Section 9.3)
Probit model:
Pr(Y = 1|X) = (0 + 1X)
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
1 with probability p
Y= (Bernoulli distribution)
0 with probability 1 p
1/2/3-387
Joint density of (Y1,Y2):
Because Y1 and Y2 are independent,
= p
n
y
i 1 i
(1 p )
i1 yi
n
n
1/2/3-388
The likelihood is the joint density, treated as a function of
the unknown parameters, which here is p:
f(p;Y1,…,Yn) = p Y
i 1 i
n
(1 p )
i1Yi
n
n
d ln f ( p;Y1 ,...,Yn )
dp
= 1
1
i1Yi p n i1Yi 1 p = 0
n n
1/2/3-389
Solving for p yields the MLE; that is, pˆ MLE satisfies,
Y pˆ
n
i 1 i
1
MLE n
1
n i 1Yi
1 pˆ
MLE
=0
or
Y pˆ
n
i 1 i
1
MLE
n i 1Yi
n
1
1 pˆ MLE
or
Y pˆ MLE
1 Y 1 pˆ MLE
or
pˆ MLE = Y = fraction of 1’s
1/2/3-390
The MLE in the “no-X” case (Bernoulli distribution):
pˆ MLE = Y = fraction of 1’s
For Yi i.i.d. Bernoulli, the MLE is the “natural”
estimator of p, the fraction of 1’s, which is Y
We already know the essentials of inference:
o In large n, the sampling distribution of pˆ MLE = Y is
normally distributed
o Thus inference is “as usual:” hypothesis testing via
t-statistic, confidence interval as 1.96SE
STATA note: to emphasize requirement of large-n, the
printout calls the t-statistic the z-statistic; instead of the
F-statistic, the chi-squared statstic (= qF).
1/2/3-391
The probit likelihood with one X
The derivation starts with the density of Y1, given X1:
Pr(Y1 = 1|X1) = (0 + 1X1)
Pr(Y1 = 0|X1) = 1–(0 + 1X1)
so
Pr(Y1 = y1|X1) = ( 0 1 X 1 ) y [1 ( 0 1 X 1 )]1 y
1 1
…{ ( 0 1 X n )Y [1 ( 0 1 X n )]1Y }
n n
1/2/3-392
The probit likelihood function:
f(0,1; Y1,…,Yn|X1,…,Xn)
= { ( 0 1 X 1 )Y [1 ( 0 1 X 1 )]1Y }
1 1
…{ ( 0 1 X n )Y [1 ( 0 1 X n )]1Y }
n n
1/2/3-394
Measures of fit
2
The R and R 2 don’t make sense here (why?). So, two
other specialized measures are used:
so
f(p;Y1) = pY (1 p )1Y
1
(likelihood)
1
= ln[f(p;Y1)… f(p;Yn)]
n
= ln f ( p;Y )
i 1
i
1/2/3-397
2. Set the derivative of (p) to zero to define the MLE:
( p ) n
ln f ( p;Yi )
= =0
p pˆ MLE i 1 p pˆ MLE
( p ) ( p ) 2 ( p )
0= + (p
ˆ MLE true
–p )
p pˆ MLE
p p true
p 2 p true
1/2/3-398
true
4. Solve this linear approximation for ( p
ˆ MLE
– p ):
( p ) 2 ( p )
+ ( pˆ MLE – ptrue) 0
p p true
p 2 p true
so
2 ( p ) ( p )
( pˆ MLE
–p true
)–
p 2 p true
p p true
or
1
( p ) 2 ( p )
( pˆ MLE – p true
) –
p 2
p true
p p true
1/2/3-399
5. Substitute things in and apply the LLN and CLT.
n
(p) = ln f ( p;Y )
i 1
i
( p ) n
ln f ( p;Yi )
=
p p true i 1 p p true
2 ( p ) 2 ln f ( p;Yi )
n
=
p 2 p true i 1 p 2
p true
so
1
2
( p ) ( p )
( pˆ MLE
– p ) –
true
p 2
p true
p p true
1
n 2 ln f ( p;Y ) n ln f ( p;Y )
=
p 2
i
p
i
i 1 p true
i 1 p true
1/2/3-400
Multiply through by n :
n ( pˆ MLE – ptrue)
1
1 n 2 ln f ( p;Y ) 1 n ln f ( p;Y )
i
i
n i 1 p
2
n
i 1 p p true
p true
Because Yi is i.i.d., the ith terms in the summands are also
i.i.d. Thus, if these terms have enough (2) moments, then
under general conditions (not just Bernoulli likelihood):
1 n 2 ln f ( p;Yi ) p
n i 1 p 2
a (a constant) (WLLN)
p true
1 n ln f ( p;Yi ) d
n i 1 p
N(0, 2
ln f ) (CLT) (Why?)
p true
1/2/3-401
Putting this together,
n ( pˆ MLE – ptrue)
1
1 n 2 ln f ( p;Y ) 1 n ln f ( p;Y )
i
i
n i 1 p
2
n
i 1 p p true p true
1 n 2 ln f ( p;Yi ) p
n i 1 p 2
a (a constant) (WLLN)
p true
1 n ln f ( p;Yi ) d
n i 1 p
N(0, 2
ln f ) (CLT) (Why?)
p true
so
d
true
n ( pˆ MLE
–p ) N(0, ln2 f /a2) (large-n normal)
1/2/3-402
Work out the details for probit/no X (Bernoulli) case:
Recall:
f(p;Yi) = pY (1 p )1Y
i i
so
ln f(p;Yi) = Yilnp + (1–Yi)ln(1–p)
and
ln f ( p, Yi ) Yi 1 Yi Yi p
= =
p p 1 p p(1 p )
and
2 ln f ( p, Yi ) Yi 1 Yi Yi 1 Yi
= 2 = 2 2
p 2
p (1 p ) 2
p (1 p )
1/2/3-403
Denominator term first:
2 ln f ( p, Yi ) Yi 1 Yi
= 2 2
p 2
p (1 p )
so
1 n 2 ln f ( p;Yi ) 1 n Y 1 Yi
n i 1 p 2
= 2
n i 1 p
i
(1 p ) 2
p true
Y 1Y
= 2
p (1 p ) 2
p
p 1 p
2 (LLN)
p (1 p ) 2
1 1 1
= =
p 1 p p(1 p )
1/2/3-404
Next the numerator:
ln f ( p, Yi ) Yi p
=
p p(1 p )
so
1 n ln f ( p;Yi ) 1 n Yi p
n i 1 p
=
n i 1 p(1 p )
p true
1 1 n
= (Yi p )
p(1 p ) n i 1
d Y2
N(0, )
[ p (1 p )]
2
1/2/3-405
Put these pieces together:
n ( pˆ MLE – ptrue)
1
1 n 2 ln f ( p;Y ) 1 n ln f ( p;Y )
i
i
n i 1 p
2
n
i 1 p p true p true
where
1 n 2 ln f ( p;Yi ) p 1
n i 1 p 2
p(1 p )
p true
1 n ln f ( p;Yi ) d Y2
n i 1 p
N(0,
[ p (1 p )]2
)
p true
Thus
d
true
n ( pˆ MLE
–p ) N(0, Y2 )
1/2/3-406
Summary: probit MLE, no-X case
d
true
n ( pˆ MLE
–p ) N(0, Y2 )
1/2/3-410
The HMDA Data Set
1/2/3-411
The loan officer’s decision
1/2/3-412
Regression specifications
Pr(deny=1|black, other X’s) = …
linear probability model
probit
1/2/3-419
Remaining threats to internal, external validity
Internal validity
1. omitted variable bias
what else is learned in the in-person interviews?
2. functional form misspecification (no…)
3. measurement error (originally, yes; now, no…)
4. selection
random sample of loan applications
define population to be loan applicants
5. simultaneous causality (no)
External validity
This is for Boston in 1990-91. What about today?
1/2/3-420
Summary
(SW Section 9.5)
If Yi is binary, then E(Y| X) = Pr(Y=1|X)
Three models:
o linear probability model (linear multiple regression)
o probit (cumulative standard normal distribution)
o logit (cumulative standard logistic distribution)
LPM, probit, logit all produce predicted probabilities
Effect of X is change in conditional probability that
Y=1. For logit and probit, this depends on the initial X
Probit and logit are estimated via maximum likelihood
o Coefficients are normally distributed for large n
o Large-n hypothesis testing, conf. intervals is as usual
1/2/3-421
Instrumental Variables Regression
(SW Ch. 10)
Yi = 0 + 1Xi + ui
1/2/3-424
Two conditions for a valid instrument
Yi = 0 + 1Xi + ui
1/2/3-425
The IV Estimator, one X and one Z
Explanation #1: Two Stage Least Squares (TSLS)
As it sounds, TSLS has two stages – two regressions:
(1) First isolates the part of X that is uncorrelated with u:
regress X on Z using OLS
Xi = 0 + 1Zi + vi (1)
Yi = 0 + 1 Xˆ i + ui (2)
1/2/3-427
Two Stage Least Squares, ctd.
Stage 1:
Regress Xi on Zi, obtain the predicted values Xˆ i
Stage 2:
Regress Yi on Xˆ i ; the coefficient on Xˆ i is the TSLS
estimator, ˆ1TSLS .
1/2/3-428
The IV Estimator, one X and one Z, ctd.
Explanation #2: (only) a little algebra
Yi = 0 + 1Xi + ui
Thus,
cov(Yi,Zi) = cov(0 + 1Xi + ui,Zi)
= cov(0,Zi) + cov(1Xi,Zi) + cov(ui,Zi)
= 0 + cov(1Xi,Zi) + 0
= 1cov(Xi,Zi)
cov(Yi , Z i )
1 =
cov( X i , Z i )
1/2/3-429
The IV Estimator, one X and one Z, ctd
cov(Yi , Z i )
1 =
cov( X i , Z i )
ˆ sYZ
1 =
TSLS
,
s XZ
ˆ sYZ
1 =
TSLS
s XZ
p
The sample covariances are consistent: sYZ cov(Y,Z)
p
and sXZ cov(X,Z). Thus,
sYZ p cov(Y , Z )
ˆ
1
TSLS
= = 1
s XZ cov( X , Z )
1/2/3-433
This interaction of demand and supply produces…
1/2/3-434
What would you get if only supply shifted?
1/2/3-436
TSLS in the supply-demand example, ctd.
ln(Qibutter ) = 0 + 1ln( Pi butter ) + ui
Stage 1: regress ln( Pi butter ) on rain, get ln( Pi butter )
ln( Pi butter ) isolates changes in log price that arise
from supply (part of supply, at least)
Stage 2: regress ln(Qibutter ) on ln( Pi butter )
The regression counterpart of using shifts in the
supply curve to trace out the demand curve.
1/2/3-437
Example #2: Test scores and class size
1/2/3-438
Example #2: Test scores and class size, ctd.
Here is a (hypothetical) instrument:
some districts, randomly hit by an earthquake, “double
up” classrooms:
Zi = Quakei = 1 if hit by quake, = 0 otherwise
Do the two conditions for a valid instrument hold?
The earthquake makes it as if the districts were in a
random assignment experiment. Thus the variation in
STR arising from the earthquake is exogenous.
The first stage of TSLS regresses STR against Quake,
thereby isolating the part of STR that is exogenous (the
part that is “as if” randomly assigned)
We’ll go through other examples later…
1/2/3-439
Inference using TSLS
In large samples, the sampling distribution of the TSLS
estimator is normal
Inference (hypothesis tests, confidence intervals)
proceeds in the usual way, e.g. 1.96SE
The idea behind the large-sample normal distribution of
the TSLS estimator is that – like all the other estimators
we have considered – it involves an average of mean
zero i.i.d. random variables, to which we can apply the
CLT.
Here is a sketch of the math (see SW App. 10.3 for the
details)...
1/2/3-440
1 n
sYZ
n 1 i 1
(Yi Y )( Z i Z )
ˆ
1 =
TSLS
=
s XZ 1 n
n 1 i 1
( X i X )( Z i Z )
1 n 1 n
= 1
n 1 i 1
( X i X )( Z i Z )
n 1 i 1
(ui u )( Zi Z ) .
1/2/3-441
Thus
1 n
n 1 i 1
(Yi Y )( Z i Z )
ˆ1TSLS =
1 n
n 1 i 1
( X i X )( Z i Z )
1 n 1 n
1
n 1 i 1
( X i X )( Z i Z )
n 1 i 1
(ui u )( Z i Z )
=
1 n
n 1 i 1
( X i X )( Z i Z )
1 n
n 1 i 1
(ui u )( Z i Z )
= 1 + n
.
1
n 1 i 1
( X i X )( Z i Z )
1 n
(ui u )( Z i Z )
n ( ˆ1TSLS – 1) nn i 1
1
n i 1
( X i X )( Z i Z )
1/2/3-443
1 n
n
(ui u )( Z i Z )
n ( ˆ1TSLS – 1) i 1
1 n
n i 1
( X i X )( Z i Z )
1 n p
n i 1
( X i X )( Zi Z ) cov(X,Z)
1 n
n i 1
(ui u )( Z i Z ) is dist’d N(0,var[(Z–Z)u])
So finally:
ˆ1TSLS is approx. distributed N(1, 2ˆ TSLS ),
1
1 var[( Z i Z )ui ]
where 2
ˆ TSLS
= 2
.
1
n [cov( Zi , X i )]
1/2/3-445
Inference using TSLS, ctd.
ˆ1TSLS is approx. distributed N(1, 2ˆ
TSLS ),
1
1/2/3-447
Figure 4, p. 296, from Appendix B (1928):
1/2/3-448
Who wrote Appendix B of Philip Wright (1928)?
1/2/3-449
Philip Wright (1861-1934) Sewall Wright (1889-1988)
obscure economist and poet famous genetic statistician
MA Harvard, Econ, 1887 ScD Harvard, Biology, 1915
Lecturer, Harvard, 1913-1917 Prof., U. Chicago, 1930-1954
1/2/3-450
Example: Demand for Cigarettes
1/2/3-451
Example: Cigarette demand, ctd.
ln(Qicigarettes ) = 0 + 1ln( Pi cigarettes ) + ui
Panel data:
Annual cigarette consumption and average prices paid
(including tax)
48 continental US states, 1985-1995
Proposed instrumental variable:
Zi = general sales tax per pack in the state = SalesTaxi
Is this a valid instrument?
(1) Relevant? corr(SalesTaxi, ln( Pi cigarettes )) 0?
(2) Exogenous? corr(SalesTaxi,ui) = 0?
1/2/3-452
For now, use data for 1995 only.
1/2/3-453
STATA Example: Cigarette demand, First stage
Instrument = Z = rtaxso = general sales tax (real $/pack)
X Z
. reg lravgprs rtaxso if year==1995, r;
------------------------------------------------------------------------------
| Robust
lravgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rtaxso | .0307289 .0048354 6.35 0.000 .0209956 .0404621
_cons | 4.616546 .0289177 159.64 0.000 4.558338 4.674755
------------------------------------------------------------------------------
X-hat
. predict lravphat; Now we have the predicted values from the 1st stage
1/2/3-454
Second stage
Y X-hat
. reg lpackpc lravphat if year==1995, r;
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravphat | -1.083586 .3336949 -3.25 0.002 -1.755279 -.4118932
_cons | 9.719875 1.597119 6.09 0.000 6.505042 12.93471
------------------------------------------------------------------------------
1/2/3-455
Combined into a single command:
Y X Z
. ivreg lpackpc (lravgprs = rtaxso) if year==1995, r;
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.083587 .3189183 -3.40 0.001 -1.725536 -.4416373
_cons | 9.719876 1.528322 6.36 0.000 6.643525 12.79623
------------------------------------------------------------------------------
Instrumented: lravgprs This is the endogenous regressor
Instruments: rtaxso This is the instrumental varible
------------------------------------------------------------------------------
OK, the change in the SEs was small this time...but not always!
ln(
Qicigarettes ) = 9.72 – 1.08 ln( Pi cigarettes ) , n = 48
(1.53) (0.32)
1/2/3-456
Summary of IV Regression with a Single X and Z
1/2/3-457
The General IV Regression Model
(SW Section 10.2)
1/2/3-460
The general IV regression model, ctd.
Yi = 0 + 1X1i + … + kXki + k+1W1i + … + k+rWri + ui
1/2/3-462
Identification, ctd.
The coefficients 1,…,k are said to be:
exactly identified if m = k.
There are just enough instruments to estimate
1,…,k.
overidentified if m > k.
There are more than enough instruments to estimate
1,…,k. If so, you can test whether the instruments
are valid (a test of the “overidentifying restrictions”)
– we’ll return to this later
underidentified if m < k.
There are too few enough instruments to estimate
1,…,k. If so, you need to get more instruments!
1/2/3-463
General IV regression: TSLS, 1 endogenous regressor
Yi = 0 + 1X1i + 2W1i + … + 1+rWri + ui
Instruments: Z1i,…,Zm
First stage
o Regress X1 on all the exogenous regressors: regress
X1 on W1,…,Wr,Z1,…,Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
Second stage
o Regress Y on X̂ 1,W1,…,Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
To get correct SEs, do this in a single step
1/2/3-464
Example: Demand for cigarettes
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.143375 .3723025 -3.07 0.004 -1.893231 -.3935191
lperinc | .214515 .3117467 0.69 0.495 -.413375 .842405
_cons | 9.430658 1.259392 7.49 0.000 6.894112 11.9672
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso STATA lists ALL the exogenous regressors
as instruments – slightly different
terminology than we have been using
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Robust
lpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lravgprs | -1.277424 .2496099 -5.12 0.000 -1.780164 -.7746837
lperinc | .2804045 .2538894 1.10 0.275 -.230955 .7917641
_cons | 9.894955 .9592169 10.32 0.000 7.962993 11.82692
------------------------------------------------------------------------------
Instrumented: lravgprs
Instruments: lperinc rtaxso rtax STATA lists ALL the exogenous regressors
as “instruments” – slightly different
terminology than we have been using
------------------------------------------------------------------------------
1/2/3-467
TSLS estimates, Z = sales tax (m = 1)
ln(
Qicigarettes ) = 9.43 – 1.14 ln( Pi cigarettes ) + 0.21ln(Incomei)
(1.26) (0.37) (0.31)
Instruments: Z1i,…,Zm
Now there are k first stage regressions:
o Regress X1 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 1i , i = 1,…,n
o Regress X2 on W1,…, Wr, Z1,…, Zm by OLS
o Compute predicted values Xˆ 2i , i = 1,…,n
o Repeat for all X’s, obtaining Xˆ 1i , Xˆ 2i ,…, Xˆ ki
1/2/3-469
TSLS with multiple endogenous regressors, ctd.
Second stage
o Regress Y on Xˆ 1i , Xˆ 2i ,…, Xˆ ki , W1,…, Wr by OLS
o The coefficients from this second stage regression
are the TSLS estimators, but SEs are wrong
To get correct SEs, do this in a single step
What would happen in the second stage regression if
the coefficients were underidentified (that is, if
#instruments < #endogenous variables); for example, if
k = 2, m = 1?
1/2/3-470
Sampling distribution of the TSLS estimator in the
general IV regression model
1/2/3-471
A “valid” set of instruments in the general case
2. Instrument exogeneity
All the instruments are uncorrelated with the error
term: corr(Z1i,ui) = 0,…, corr(Zm,ui) = 0
1/2/3-472
“Valid” instruments in the general case, ctd.
1/2/3-473
The IV Regression Assumptions
Yi = 0 + 1X1i + … + kXki + k+1W1i + … + k+rWri + ui
1. E(ui|W1i,…,Wri) = 0
2. (Yi,X1i,…,Xki,W1i,…,Wri,Z1i,…,Zmi) are i.i.d.
3. The X’s, W’s, Z’s, and Y have nonzero, finite 4th
moments
4. The W’s are not perfectly multicollinear
5. The instruments (Z1i,…,Zmi) satisfy the conditions for
a valid set of instruments.
reason.
All this hinges on having valid instruments…
1/2/3-475
Checking Instrument Validity
(SW Section 10.3)
ˆ sYZ
The IV estimator is 1 =
TSLS
s XZ
If cov(X,Z) is zero or small, then sXZ will be small:
With weak instruments, the denominator is nearly zero.
If so, the sampling distribution of ˆ1TSLS (and its t-
statistic) is not well approximated by its large-n normal
approximation…
1/2/3-478
An example: the distribution of the TSLS t-statistic
with weak instruments
s XZ
If cov(X,Z) is small, small changes in sXZ (from one
sample to the next) can induce big changes in ˆ1TSLS
Suppose in one sample you calculate sXZ = .00001!
Thus the large-n normal approximation is a poor
approximation to the sampling distribution of ˆ1TSLS
A better approximation is that ˆ1TSLS is distributed as the
ratio of two correlated normal random variables (see
SW App. 10.4)
If instruments are weak, the usual methods of inference
are unreliable – potentially very unreliable.
1/2/3-480
Measuring the strength of instruments in practice:
The first-stage F-statistic
1/2/3-481
Checking for weak instruments with a single X
Compute the first-stage F-statistic.
Rule-of-thumb: If the first stage F-statistic is less
than 10, then the set of instruments is weak.
If so, the TSLS estimator will be biased, and statistical
inferences (standard errors, hypothesis tests, confidence
intervals) can be misleading.
Note that simply rejecting the null hypothesis of that
the coefficients on the Z’s are zero isn’t enough – you
actually need substantial predictive content for the
normal approximation to be a good one.
There are more sophisticated things to do than just
compare F to 10 but they are beyond this course.
1/2/3-482
What to do if you have weak instruments?
1/2/3-488
Panel data set
Annual cigarette consumption, average prices paid by
end consumer (including tax), personal income
48 continental US states, 1985-1995
Estimation strategy
Having panel data allows us to control for unobserved
state-level characteristics that enter the demand for
cigarettes, as long as they don’t vary over time
But we still need to use IV estimation methods to
handle the simultaneous causality bias that arises from
the interaction of supply and demand.
1/2/3-489
Fixed-effects model of cigarette demand
1/2/3-490
Panel data IV regression: two approaches
(a) The “n-1 binary indicators” method
(b) The “changes” method (when T=2)
1/2/3-494
Use TSLS to estimate the demand elasticity by using
the “10-year changes” specification
Y W X Z
. ivreg dlpackpc dlperinc (dlavgprs = drtaxso) , r;
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -.9380143 .2075022 -4.52 0.000 -1.355945 -.5200834
dlperinc | .5259693 .3394942 1.55 0.128 -.1578071 1.209746
_cons | .2085492 .1302294 1.60 0.116 -.0537463 .4708446
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso
------------------------------------------------------------------------------
NOTE:
- All the variables – Y, X, W, and Z’s – are in 10-year changes
- Estimated elasticity = –.94 (SE = .21) – surprisingly elastic!
- Income elasticity small, not statistically different from zero
- Must check whether the instrument is relevant…
1/2/3-495
Check instrument relevance: compute first-stage F
. reg dlavgprs drtaxso dlperinc , r;
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0254611 .0043876 5.80 0.000 .016624 .0342982
dlperinc | -.2241037 .2188815 -1.02 0.311 -.6649536 .2167463
_cons | .5321948 .0295315 18.02 0.000 .4727153 .5916742
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Robust
dlpackpc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dlavgprs | -1.202403 .1969433 -6.11 0.000 -1.599068 -.8057392
dlperinc | .4620299 .3093405 1.49 0.142 -.1610138 1.085074
_cons | .3665388 .1219126 3.01 0.004 .1209942 .6120834
------------------------------------------------------------------------------
Instrumented: dlavgprs
Instruments: dlperinc drtaxso drtax
------------------------------------------------------------------------------
------------------------------------------------------------------------------
e | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .0127669 .0061587 2.07 0.044 .000355 .0251789
drtax | -.0038077 .0021179 -1.80 0.079 -.008076 .0004607
dlperinc | -.0934062 .2978459 -0.31 0.755 -.6936752 .5068627
_cons | .002939 .0446131 0.07 0.948 -.0869728 .0928509
------------------------------------------------------------------------------
. test drtaxso drtax;
1/2/3-499
Check instrument relevance: compute first-stage F
X Z1 Z2 W
. reg dlavgprs drtaxso drtax dlperinc , r;
------------------------------------------------------------------------------
| Robust
dlavgprs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drtaxso | .013457 .0031405 4.28 0.000 .0071277 .0197863
drtax | .0075734 .0008859 8.55 0.000 .0057879 .0093588
dlperinc | -.0289943 .1242309 -0.23 0.817 -.2793654 .2213767
_cons | .4919733 .0183233 26.85 0.000 .4550451 .5289015
------------------------------------------------------------------------------
( 1) drtaxso = 0
( 2) drtax = 0
1/2/3-500
Tabular summary of these results:
1/2/3-501
How should we interpret the J-test rejection?
J-test rejects the null hypothesis that both the
instruments are exogenous
This means that either rtaxso is endogenous, or rtax is
endogenous, or both
The J-test doesn’t tell us which!! You must think!
Why might rtax (cig-only tax) be endogenous?
o Political forces: history of smoking or lots of
smokers political pressure for low cigarette taxes
o If so, cig-only tax is endogenous
This reasoning doesn’t apply to general sales tax
use just one instrument, the general sales tax
1/2/3-502
The Demand for Cigarettes:
Summary of Empirical Results
1/2/3-504
Remaining threats to internal validity, ctd.
Remaining simultaneous causality bias?
o Not if the general sales tax a valid instrument:
relevance?
exogeneity?
Errors-in-variables bias? Interesting question: are we
accurately measuring the price actually paid? What
about cross-border sales?
Selection bias? (no, we have all the states)
1/2/3-507
SurvivalDaysi = 0 + 1CardCathi + ui
1/2/3-508
Z = differential distance to CC hospital
o Relevant? If a CC hospital is far away, patient
won’t bet taken there and won’t get CC
o Exogenous? If distance to CC hospital doesn’t
affect survival, other than through effect on
CardCathi, then corr(distance,ui) = 0 so exogenous
o If patients location is random, then differential
distance is “as if” randomly assigned.
st
o The 1 stage is a linear probability model: distance
affects the probability of receiving treatment
Results (McClellan, McNeil, Newhous, JAMA, 1994):
o OLS estimates significant and large effect of CC
o TSLS estimates a small, often insignificant effect
1/2/3-509
Summary: IV Regression
(SW Section 10.6)
A valid instrument lets us isolate a part of X that is
uncorrelated with u, and that part can be used to
estimate the effect of a change in X on Y
IV regression hinges on having valid instruments:
(1) Relevance: check via first-stage F
(2) Exogeneity: Test overidentifying restrictions
via the J-statistic
A valid instrument isolates variation in X that is “as if”
randomly assigned.
The critical requirement of at least m valid instruments
cannot be tested – you must use your head.
1/2/3-510
Experiments and Quasi-Experiments
(SW Chapter 11)
1/2/3-512
Different types of experiments: three examples
1/2/3-514
Idealized Experiments and Causal Effects
(SW Section 11.1)
An ideal randomized controlled experiment randomly
assigns subjects to treatment and control groups.
More generally, the treatment level X is randomly
assigned:
Yi = 0 + 1Xi + ui
1/2/3-516
Potential Problems with Experiments in Practice
(SW Section 11.2)
1/2/3-517
Threats to internal validity, ctd.
1/2/3-518
Threats to internal validity, ctd.
3. Experimental effects
experimenter bias (conscious or subconscious):
treatment X is associated with “extra effort” or
“extra care,” so corr(X,u) 0
subject behavior might be affected by being in an
experiment, so corr(X,u) 0 (Hawthorne effect)
1/2/3-520
Threats to External Validity
1. Nonrepresentative sample
2. Nonrepresentative “treatment” (that is, program or
policy)
3. General equilibrium effects (effect of a program can
depend on its scale; admissions counseling )
4. Treatment v. eligibility effects (which is it you want
to measure: effect on those who take the program, or
the effect on those are eligible)
1/2/3-521
Regression Estimators of Causal Effects Using
Experimental Data
(SW Section 11.3)
Focus on the case that X is binary (treatment/control).
Often you observe subject characteristics, W1i,…,Wri.
Extensions of the differences estimator:
o can improve efficiency (reduce standard errors)
o can eliminate bias that arises when:
treatment and control groups differ
there is “conditional randomization”
there is partial compliance
These extensions involve methods we have already
seen – multiple regression, panel data, IV regression
1/2/3-522
Estimators of the Treatment Effect 1 using
Experimental Data (X = 1 if treated, 0 if control)
1/2/3-525
ˆ1diffs indiffs = (Y treat ,after –Y treat ,before ) – (Y control ,after –Y control ,before )
1/2/3-526
The differences-in-differences estimator, ctd.
Yi = 0 + 1Xi + ui
where
Yi = Yi after – Yi before
Xi = 1 if treated, = 0 otherwise
1/2/3-527
The differences-in-differences estimator, ctd.
where
t = 1 (before experiment), 2 (after experiment)
Dit = 0 for t = 1, = 1 for t = 2
Git = 0 for control group, = 1 for treatment group
Xit = 1 if treated, = 0 otherwise
= DitGit = interaction effect of being in treatment
group in the second period
ˆ1 is the diffs-in-diffs estimator
1/2/3-528
Including additional subject characteristics (W’s)
1/2/3-530
Estimation when there is partial compliance
Consider diffs-in-diffs estimator, X = actual treatment
Yi = 0 + 1Xi + ui
Suppose there is partial compliance: some of the
treated don’t take the drug; some of the controls go to
job training anyway
Then X is correlated with u, and OLS is biased
Suppose initial assignment, Z, is random
Then (1) corr(Z,X) 0 and (2) corr(Z,u) = 0
Thus 1 can be estimated by TSLS, with instrumental
variable Z = initial assignment
This can be extended to W’s (included exog. variables)
1/2/3-531
Experimental Estimates of the Effect of
Reduction: The Tennessee Class Size Experiment
(SW Section 11.4)
Project STAR (Student-Teacher Achievement Ratio)
4-year study, $12 million
Upon entering the school system, a student was
randomly assigned to one of three groups:
o regular class (22 – 25 students)
o regular class + aide
o small class (13 – 17 students)
regular class students re-randomized after first year to
regular or regular+aide
Y = Stanford Achievement Test scores
1/2/3-532
Deviations from experimental design
Partial compliance:
o 10% of students switched treatment groups because
of “incompatibility” and “behavior problems” – how
much of this was because of parental pressure?
o Newcomers: incomplete receipt of treatment for
those who move into district after grade 1
Attrition
o students move out of district
o students leave for private/religious schools
1/2/3-533
Regression analysis
1/2/3-534
Differences estimates (no W’s)
1/2/3-535
1/2/3-536
How big are these estimated effects?
Put on same basis by dividing by std. dev. of Y
Units are now standard deviations of test scores
1/2/3-537
How do these estimates compare to those from the
California, Mass. observational studies? (Ch. 4 – 7)
1/2/3-538
Summary: The Tennessee Class Size Experiment
Main findings:
The effects are small quantitatively (same size as
gender difference)
Effect is sustained but not cumulative or increasing
biggest effect at the youngest grades
1/2/3-539
What is the Difference Between a Control Variable
and the Variable of Interest?
(SW App. 11.3)
1/2/3-540
1/2/3-541
Example: “free lunch eligible,” ctd.
Coefficient on “free lunch eligible” is large, negative,
statistically significant
Policy interpretation: Making students ineligible for a
free school lunch will improve their test scores.
Why (precisely) can we interpret the coefficient on
SmallClass as an unbiased estimate of a causal effect,
but not the coefficient on “free lunch eligible”?
This is not an isolated example!
o Other “control variables” we have used: gender,
race, district income, state fixed effects, time fixed
effects, city (or state) population,…
What is a “control variable” anyway?
1/2/3-542
Simplest case: one X, one control variable W
Yi = 0 + 1 Xi + 2Wi + ui
For example,
W = free lunch eligible (binary)
X = small class/large class (binary)
Suppose random assignment of X depends on W
o for example, 60% of free-lunch eligibles get small
class, 40% of ineligibles get small class)
o note: this wasn’t the actual STAR randomization
procedure – this is a hypothetical example
Further suppose W is correlated with u
1/2/3-543
Yi = 0 + 1 Xi + 2Wi + ui
Suppose:
The control variable W is correlated with u
Given W = 0 (ineligible), X is randomly assigned
Given W = 1 (eligible), X is randomly assigned.
Then:
Given the value of W, X is randomly assigned;
That is, controlling for W, X is randomly assigned;
Thus, controlling for W, X is uncorrelated with u
Moreover, E(u|X,W) doesn’t depend on X
That is, we have conditional mean independence:
E(u|X,W) = E(u|W)
1/2/3-544
Implications of conditional mean independence
Yi = 0 + 1 Xi + 2Wi + ui
1/2/3-545
Implications of conditional mean independence:
The conditional mean of Y given X and W is
E(Yi|Xi,Wi) = (0+0) + 1Xi + (1+2)Wi
The effect of a change in X under conditional mean
independence is the desired causal effect:
E(Yi|Xi = x+x,Wi) – E(Yi|Xi = x,Wi) = 1x
or
E (Yi | X i x x,Wi ) E (Yi | X i x,Wi )
1 =
x
If X is binary (treatment/control), this becomes:
E (Yi | X i 1,Wi ) E (Yi | X i 0,Wi )
1 =
x
which is the desired treatment effect.
1/2/3-546
Implications of conditional mean independence, ctd.
Yi = 0 + 1 Xi + 2Wi + ui
Then:
The OLS estimator ˆ1 is unbiased.
ˆ2 is not consistent and not meaningful
The usual inference methods (standard errors,
hypothesis tests, etc.) apply to ˆ1 .
1/2/3-547
So, what is a control variable?
Two cases:
(a) Treatment (X) is “as if” randomly assigned (OLS)
(b) A variable (Z) that influences treatment (X) is
“as if” randomly assigned (IV)
1/2/3-552
Two types of quasi-experiments
1/2/3-564
OLS with Heterogeneous Causal Effects
1/2/3-565
The math: suppose X is binary and E(ui|Xi) = 0.
Then
ˆ1 = Y treated – Y control
For the treated:
E(Yi|Xi=1) = 0 + E(1iXi|Xi=1) + E(ui|Xi=1)
= 0 + E(1i|Xi=1)
For the controls:
E(Yi|Xi=0) = 0 + E(1iXi|Xi=0) + E(ui|Xi=0)
= 0
Thus:
p
ˆ1 E(Yi|Xi=1) – E(Yi|Xi=0) = E(1i|Xi=1)
= average effect of the treatment on the treated
1/2/3-566
OLS with heterogeneous treatment effects: general X
with E(ui|Xi) = 0
ˆ s p
XY cov( 0 1i X i ui , X i )
1 = 2 2 =
XY
sX X var( X i )
cov( 0 , X i ) cov( 1i X i , X i ) cov(ui , X i )
=
var( X i )
cov( 1i X i , X i )
= (because cov(ui,Xi) = 0)
var( X i )
If X is binary, this simplifies to the “effect of
treatment on the treated”
p
Without heterogeneity, 1i = 1 and ˆ1 1
In general, the treatment effects of individuals with
large values of X are given the most weight
1/2/3-567
(b) Now make a stronger assumption: that X is randomly
assigned (experiment or quasi-experiment). Then
what does OLS actually estimate?
I Xi is randomly assigned, it is distributed
independently of 1i, so there is no difference
between the population of controls and the
population in the treatment group
Thus the effect of treatment on the treated = the
average treatment effect in the population.
1/2/3-568
The math:
ˆ
pcov( 1i X i , X i ) cov( 1i X i , X i )
1 = E E | 1i
var( X i ) var( X i )
cov( X i , X i ) var( X i )
= E 1i = E 1i
var( X i ) var( X )
i
= E(1i)
Summary
If Xi and 1i are independent (Xi is randomly
assigned), OLS estimates the average treatment effect.
If Xi is not randomly assigned but E(ui|Xi) = 0, OLS
estimates the effect of treatment on the treated.
Without heterogeneity, the effect of treatment on the
treated and the average treatment effect are the same
1/2/3-569
IV Regression with Heterogeneous Causal Effects
1/2/3-570
IV with heterogeneous causal effects, ctd.
Intuition:
Suppose 1i’s were known. If for some people 1i =
0, then their predicted value of Xi wouldn’t depend
on Z, so the IV estimator would ignore them.
The IV estimator puts most of the weight on
individuals for whom Z has a large influence on X.
TSLS measures the treatment effect for those whose
probability of treatment is most influenced by X.
1/2/3-571
The math…
Yi = 0 + 1iXi + ui (equation of interest)
Xi = 0 + 1iZi + vi (first stage of TSLS)
ˆ
pE ( 1i 1i )
1
TSLS
E ( 1i )
TSLS estimates the average causal effect (that is,
p
ˆ
1
TSLS
E(1i)) if:
o If 1i and 1i are independent
o If 1i = 1 (no heterogeneity in equation of interest)
o If 1i = 1 (no heterogeneity in first stage equation)
But in general ˆ1TSLS does not estimate E(1i)!
1/2/3-574
Example: Cardiac catheterization
Yi = survival time (days) for AMI patients
Xi = received cardiac catheterization (or not)
Zi = differential distance to CC hospital
Equation of interest:
SurvivalDaysi = 0 + 1iCardCathi + ui
First stage (linear probability model):
CardCathi = 0 + 1iDistancei + vi
1/2/3-577
OLS with Heterogeneous Causal Effects
X is: Relation between Xi and Then OLS estimates:
ui :
binary E(ui|Xi) = 0 effect of treatment on the
treated: E(1i|Xi=1)
X randomly assigned (so average causal effect E(1i)
Xi and ui are independent)
general E(ui|Xi) = 0 weighted average of 1i,
placing most weight on
those with large |Xi–X|
X randomly assigned average causal effect E(1i)
p
Without heterogeneity, 1i = 1 and ˆ1 1 in all these
cases.
1/2/3-578
TSLS with Heterogeneous Causal Effects
TSLS estimates the causal effect for those individuals
for whom Z is most influential (those with large 1i).
What TSLS estimates depends on the choice of Z!!
In CC example, these were the individuals for whom
the decision to drive to a CC lab was heavily
influenced by the extra distance (those patients for
whom the EMT was otherwise “on the fence”)
Thus TSLS also estimates a causal effect: the average
effect of treatment on those most influenced by the
instrument
o In general, this is neither the average causal effect
nor the effect of treatment on the treated
1/2/3-579
Summary: Experiments and Quasi-Experiments
(SW Section 11.8)
Experiments:
Average causal effects are defined as expected values
of ideal randomized controlled experiments
Actual experiments have threats to internal validity
These threats to internal validity can be addressed (in
part) by:
o panel methods (differences-in-differences)
o multiple regression
o IV (using initial assignment as an instrument)
1/2/3-580
Summary, ctd.
Quasi-experiments:
Quasi-experiments have an “as-if” randomly assigned
source of variation.
This as-if random variation can generate:
o Xi which satisfies E(ui|Xi) = 0 (so estimation
proceeds using OLS); or
o instrumental variable(s) which satisfy E(ui|Zi) = 0
(so estimation proceeds using TSLS)
Quasi-experiments also have threats to internal vaidity
1/2/3-581
Summary, ctd.
1/2/3-583
Introduction to Time Series Regression and
Forecasting
(SW Chapter 12)
1/2/3-585
Example #2: US rate of unemployment
1/2/3-586
Why use time series data?
To develop forecasting models
o What will the rate of inflation be next year?
To estimate dynamic causal effects
o If the Fed increases the Federal Funds rate now,
what will be the effect on the rates of inflation and
unemployment in 3 months? in 12 months?
o What is the effect over time on cigarette
consumption of a hike in the cigarette tax
Plus, sometimes you don’t have any choice…
o Rates of inflation and unemployment in the US can
be observed only over time.
1/2/3-587
Time series data raises new technical issues
Time lags
Correlation over time (serial correlation or
autocorrelation)
Forecasting models that have no causal interpretation
(specialized tools for forecasting):
o autoregressive (AR) models
o autoregressive distributed lag (ADL) models
Conditions under which dynamic effects can be
estimated, and how to estimate them
Calculation of standard errors when the errors are
serially correlated
1/2/3-588
Using Regression Models for Forecasting
(SW Section 12.1)
1/2/3-591
Example: Quarterly rate of inflation at an annual rate
CPI in the first quarter of 1999 (1999:I) = 164.87
CPI in the second quarter of 1999 (1999:II) = 166.03
Percentage change in CPI, 1999:I to 1999:II
166.03 164.87 1.16
= 100 = 100 = 0.703%
164.87 164.87
Percentage change in CPI, 1999:I to 1999:II, at an
annual rate = 40.703 = 2.81% (percent per year)
Like interest rates, inflation rates are (as a matter of
convention) reported at an annual rate.
Using the logarithmic approximation to percent changes
yields 4100[log(166.03) – log(164.87)] = 2.80%
1/2/3-592
Example: US CPI inflation – its first lag and its change
CPI = Consumer price index (Bureau of Labor Statistics)
1/2/3-593
Autocorrelation
cov(Yt ,Yt j )
ˆ j =
var( Y) t
where
T
1
cov(Yt , Yt j ) =
T j 1 t j 1
(Yt Y j 1,T )(Yt j Y1,T j )
1/2/3-597
The inflation rate is highly serially correlated (1 = .85)
Last quarter’s inflation rate contains much information
about this quarter’s inflation rate
The plot is dominated by multiyear swings
But there are still surprise movements!
1/2/3-598
More examples of time series & transformations
1/2/3-599
More examples of time series & transformations, ctd.
1/2/3-600
Stationarity: a key idea for external validity of time
series regression
Stationarity says that the past is like the present and
the future, at least in a probabilistic sense.
1/2/3-601
Autoregressions
(SW Section 12.3)
Yt = 0 + 1Yt–1 + ut
1/2/3-603
Example: AR(1) model of the change in inflation
Estimated using data from 1962:I – 1999:IV:
Inf t = 0.02 – 0.211Inft–1 R 2 = 0.04
(0.14) (0.106)
First, let STATA know you are using time series data
generate time=q(1959q1)+_n-1; _n is the observation no.
So this command creates a new variable
time that has a special quarterly
date format
1/2/3-605
Example: AR(1) model of inflation – STATA, ctd.
. gen lcpi = log(cpi); variable cpi is already in memory
1/2/3-606
Example: AR(1) model of inflation – STATA, ctd
Syntax: L.dinf is the first lag of dinf
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.2109525 .1059828 -1.99 0.048 -.4203645 -.0015404
_cons | .0188171 .1350643 0.14 0.889 -.2480572 .2856914
------------------------------------------------------------------------------
if tin(1962q1,1999q4)
STATA time series syntax for using only observations between 1962q1 and
1999q4 (inclusive).
This requires defining the time scale first, as we did above
1/2/3-607
Forecasts and forecast errors
A note on terminology:
A predicted value refers to the value of Y predicted
(using a regression) for an observation in the sample
used to estimate the regression – this is the usual
definition
A forecast refers to the value of Y forecasted for an
observation not in the sample used to estimate the
regression.
Predicted values are “in sample”
Forecasts are forecasts of the future – which cannot
have been used to estimate the regression.
1/2/3-608
Forecasts: notation
Yt|t–1 = forecast of Yt based on Yt–1,Yt–2,…, using the
population (true unknown) coefficients
Yˆt|t 1 = forecast of Yt based on Yt–1,Yt–2,…, using the
estimated coefficients, which were estimated using
data through period t–1.
For an AR(1),
Yt|t–1 = 0 + 1Yt–1
Yˆt|t 1 = ˆ0 + ˆ1 Yt–1, where ˆ0 and ˆ1 were estimated
using data through period t–1.
1/2/3-609
Forecast errors
1/2/3-610
The root mean squared forecast error (RMSFE)
1/2/3-611
Example: forecasting inflation using and AR(1)
so
Inf = Inf 1999:IV +
Inf 2000:I |1999:IV = 3.2 – 0.1 = 3.1
2000:I |1999:IV
1/2/3-612
th
The p order autoregressive model (AR(p))
1/2/3-613
Example: AR(4) model of inflation
Inf t = .02 – .21Inft–1 – .32Inft–2 + .19Inft–3
(.12) (.10) (.09) (.09)
– .04Inft–4, R 2 = 0.21
(.10)
1/2/3-614
Example: AR(4) model of inflation – STATA
. reg dinf L(1/4).dinf if tin(1962q1,1999q4), r;
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.2078575 .09923 -2.09 0.038 -.4039592 -.0117558
L2 | -.3161319 .0869203 -3.64 0.000 -.4879068 -.144357
L3 | .1939669 .0847119 2.29 0.023 .0265565 .3613774
L4 | -.0356774 .0994384 -0.36 0.720 -.2321909 .1608361
_cons | .0237543 .1239214 0.19 0.848 -.2211434 .268652
------------------------------------------------------------------------------
NOTES
1/2/3-615
Example: AR(4) model of inflation – STATA, ctd.
. test L2.dinf L3.dinf L4.dinf; L2.dinf is the second lag of dinf, etc.
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
F( 3, 147) = 6.43
Prob > F = 0.0004
1/2/3-616
Digression: we used Inf, not Inf, in the AR’s. Why?
Inft = 0 + 1Inft–1 + ut
or
Inft – Inft–1 = 0 + 1(Inft–1 – Inft–2) + ut
or
Inft = Inft–1 + 0 + 1Inft–1 – 1Inft–2 + ut
so
Inft = 0 + (1+1)Inft–1 – 1Inft–2 + ut
Inf t = 1.32 – .36Inft–1 – .34Inft–2 + .07Inft–3 – .03Inft–4
(.47) (.09) (.10) (.08) (.09)
1/2/3-622
Example: dinf and unem – STATA
------------------------------------------------------------------------------
| Robust
dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dinf |
L1 | -.3629871 .0926338 -3.92 0.000 -.5460956 -.1798786
L2 | -.3432017 .100821 -3.40 0.001 -.5424937 -.1439096
L3 | .0724654 .0848729 0.85 0.395 -.0953022 .240233
L4 | -.0346026 .0868321 -0.40 0.691 -.2062428 .1370377
unem |
L1 | -2.683394 .4723554 -5.68 0.000 -3.617095 -1.749692
L2 | 3.432282 .889191 3.86 0.000 1.674625 5.189939
L3 | -1.039755 .8901759 -1.17 0.245 -2.799358 .719849
L4 | .0720316 .4420668 0.16 0.871 -.8017984 .9458615
_cons | 1.317834 .4704011 2.80 0.006 .3879961 2.247672
------------------------------------------------------------------------------
1/2/3-623
Example: ADL(4,4) model of inflation – STATA, ctd.
( 1) L2.dinf = 0.0
( 2) L3.dinf = 0.0
( 3) L4.dinf = 0.0
( 1) L.unem = 0.0
( 2) L2.unem = 0.0
( 3) L3.unem = 0.0
( 4) L4.unem = 0.0
The null hypothesis that the coefficients on the lags of the unemployment
rate are all zero is rejected at the 1% significance level using the F-
statistic
1/2/3-624
The test of the joint hypothesis that none of the X’s is a
useful predictor, above and beyond lagged values of Y, is
called a Granger causality test
For example:
The effect of an increase in cigarette taxes on cigarette
consumption this year, next year, in 5 years;
The effect of a change in the Fed Funds rate on
inflation, this month, in 6 months, and 1 year;
The effect of a freeze in Florida on the price of orange
juice concentrate in 1 month, 2 months, 3 months…
1/2/3-627
The Orange Juice Data
(SW Section 13.1)
Data
Monthly, Jan. 1950 – Dec. 2000 (T = 612)
Price = price of frozen OJ (a sub-component of the
producer price index; US Bureau of Labor Statistics)
%ChgP = percentage change in price at an annual rate,
so %ChgPt = 1200ln(Pricet)
FDD = number of freezing degree-days during the
month, recorded in Orlando FL
o Example: If November has 2 days with low temp <
32o, one at 30o and at 25o, then FDDNov = 2 + 7 = 9
1/2/3-628
1/2/3-629
Initial OJ regression
% ChgP t = -.40 + .47FDDt
(.22) (.13)
1/2/3-630
Dynamic Causal Effects
(SW Section 13.2)
1/2/3-632
An alternative thought experiment:
Randomly give the same subject different treatments
(FDDt) at different times
Measure the outcome variable (%ChgPt)
The “population” of subjects consists of the same
subject (OJ market) but at different dates
If the “different subjects” are drawn from the same
distribution – that is, if Yt,Xt are stationary – then the
dynamic causal effect can be deduced by OLS
regression of Yt on lagged values of Xt.
This estimator (regression of Yt on Xt and lags of Xt) s
called the distributed lag estimator.
1/2/3-633
Dynamic causal effects and the distributed lag model
The distributed lag model is:
Yt = 0 + 1Xt + … + pXt–r + ut
Yt = 0 + 1Xt + … + r+1Xt–r + ut
1/2/3-639
Heteroskedasticity and Autocorrelation-Consistent
(HAC) Standard Errors
(SW Section 13.4)
1/2/3-640
The math…
Yt = 0 + 1Xt + ut
1 T
T t 1
( X t X )(Yt Y )
ˆ
1 =
1 T
t
T t 1
( X X ) 2
so..
1/2/3-641
1 T
T t 1
( X t X )ut
ˆ
1 – 1 = (this is SW App. 4.3)
1 T
t
T t 1
( X X ) 2
1 T
T t 1
( X t X )ut
=
1 T
T t 1
( X t X ) 2
so
1 T
T t 1
vt
ˆ1 – 1 in large samples
2
X
T
1
var( ˆ1 ) = var( vt )/ ( X2 ) 2 (still SW App. 4.3)
T t 1
1 T
var( vt ) = var[½(v1+v2)]
T t 1
= ¼[var(v1) + var(v2) + 2cov(v1,v2)]
1/2/3-643
so
1 2
var( vt ) = ¼[var(v1) + var(v2) + 2cov(v1,v2)]
2 t 1
= ½ v2 + ½1 v2 (1 = corr(v1,v2))
= ½ v2 f2, where f2 = (1+1)
1 T v2
var( v ) = fT
T t 1 T
so
1 2
ˆ
var( 1 ) = v
2 2
fT
T ( X )
where
T 1
T j
fT = 1 2 j
j 1 T
The OLS SEs are off by the factor fT (which can be big!)
1/2/3-645
HAC Standard Errors
------------------------------------------------------------------------------
| Robust
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0767206 1.99 0.047 .0022532 .3035903
_cons | -.2097734 .2071122 -1.01 0.312 -.6165128 .196966
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Newey-West
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l1fdd | .1529217 .0781195 1.96 0.051 -.000494 .3063375
_cons | -.2097734 .2402217 -0.87 0.383 -.6815353 .2619885
------------------------------------------------------------------------------
OK, in this case the difference is small, but not always so!
1/2/3-649
Example: OJ and HAC estimators in STATA, ctd.
. global lfdd6 "fdd l1fdd l2fdd l3fdd l4fdd l5fdd l6fdd";
------------------------------------------------------------------------------
| Newey-West
dlpoj | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fdd | .4693121 .1359686 3.45 0.001 .2022834 .7363407
l1fdd | .1430512 .0837047 1.71 0.088 -.0213364 .3074388
l2fdd | .0564234 .0561724 1.00 0.316 -.0538936 .1667404
l3fdd | .0722595 .0468776 1.54 0.124 -.0198033 .1643223
l4fdd | .0343244 .0295141 1.16 0.245 -.0236383 .0922871
l5fdd | .0468222 .0308791 1.52 0.130 -.0138212 .1074657
l6fdd | .0481115 .0446404 1.08 0.282 -.0395577 .1357807
_cons | -.6505183 .2336986 -2.78 0.006 -1.109479 -.1915578
------------------------------------------------------------------------------
1/2/3-650
Do I need to use HAC SEs when I estimate an AR or
an ADL model?
NO.
The problem to which HAC SEs are the solution arises
when ut is serially correlated
If ut is serially uncorrelated, then OLS SE’s are fine
In AR and ADL models, the errors are serially
uncorrelated if you have included enough lags of Y
o If you include enough lags of Y, then the error term
can’t be predicted using past Y, or equivalently by
past u – so u is serially uncorrelated
1/2/3-651
Estimation of Dynamic Causal Effects with Strictly
Exogenous Regressors
(SW Section 13.5)
X is strictly exogenous if E(ut|…,Xt+1,Xt,Xt–1, …) = 0
If X is strictly exogenous, there are more efficient
ways to estimate dynamic causal effects than by a
distributed lag regression.
o Generalized Least Squares (GLS)
o Autoregressive Distributed Lag (ADL)
But the condition of strict exogeneity is very strong,
so this condition is rarely plausible in practice.
So we won’t cover GLS or ADL estimation of
dynamic causal effects (Section 13.5 is optional)
1/2/3-652
Analysis of the OJ Price Data
(SW Section 13.6)
What r to use?
How about 18? (Goldilocks method)
What m (Newey-West truncation parameter) to use?
m = .756121/3 = 6.4 7
1/2/3-653
1/2/3-654
1/2/3-655
1/2/3-656
1/2/3-657
These dynamic multipliers were estimated using a
distributed lag model. Should we attempt to obtain more
efficient estimates using GLS or an ADL model?
1/2/3-658
When Can You Estimate Dynamic Causal Effects?
That is, When is Exogeneity Plausible?
(SW Section 13.7)
Examples:
1. Y = OJ prices, X = FDD in Orlando
2. Y = Australian exports, X = US GDP (effect of US
income on demand for Australian exports)
1/2/3-659
Examples, ctd.
1/2/3-660
Exogeneity, ctd.
1/2/3-661
Estimation of Dynamic Causal Effects: Summary
(SW Section 13.8)
1/2/3-662