Sei sulla pagina 1di 14

Econ 140 (Spring 2018) – Section 4 & 5


GSI: Caroline, Chris, Jimmy, Kaushiki, Leah

Objective: We construct a simple linear model to capture a (causal) relationship between two
variables X and Y . For example, we may want to know how much life expectancy would increase
when income increases by 1 %. Also we discuss how to do statistical inferences such as confidence
interval and hypothesis test. For example, we may want to test whether more income increases life
expectancy or not. How to do inferences depends on an assumption on variance structures as you
will see.

1 Ordinary Least Squares (OLS)


Recall that last week we introduced the simple linear regression model (SLR) with parameters 0
and 1 as shown below:

yi = 0 + 1 xi + ui (1)
The equation above is the population regression function. We estimate the population regression
function by collecting data and calculating the sample regression:

yi = ˆ0 + ˆ1 xi + ûi

These estimators from our sample data 0 , 1 are our “guess” of the true population parameters.
Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear
regression model.

Using our sample data we find estimators

d
ˆ1 = Cov(xi , yi )
Vd
ar(xi )
ˆ0 = Ȳ ˆ1 X̄

We thank previous GSIs for the great section material they built over time. These section notes are heavily based
on their previous work.

1
In order for our estimators ˆ0 and ˆ1 to be unbiased and efficient (for any other estimator ˜1 ),
we need to make some assumptions about the population that sample data is drawn from. Unbiased
implies E[ ˆ1 ] = 1 , efficient implies var( ˆ1 )  var( ˜1 )

1.1 OLS Assumptions:


1. Mean of conditional distribution of errors ui given xi is zero E[ui |xi ] = 0

2. Random sampling, or (yi , xi )ni=1 are independently and identically distributed (i.i.d)

3. Large outliers are unlikely: 0 < E[x4i ] < 1 and 0 < E[yi4 ] < 1 (this is necessary for
asymptotic normality only; you will need it in order to use Central Limit Theorem.)

Other assumption:
V ar(ui |x1 , ..., xn ) = E[u2i |X1 , ..., xn ] = 2
u, or homoskedasticity (this assumption is needed to prove
efficiency only).

1.2 Measures of Fit


Before we have a look at the measures of fit, let’s define a few relevant terms:
X
• Total sum of squares (TSS) = (yi ȳ)2 , the sum of squares of the deviations of the actual
i
values of y from the sample average.

X
• Explained sum of squares (ESS) = (ŷi ȳ)2 , the sum of squares of the deviations of the
i
predicted values of y from the sample average.

X X
• Sum of squared residuals (SSR) = (yi ŷi )2 = û2i , the sum of squares of the deviations
i i
of the actual values of y from the predicted values.

2
Notice that:
X
T SS = (yi ȳ)2
i
X
= (yi ŷi + ŷi ȳ)2
i
X X X
= (yi ŷi )2 + (yi ȳ)2 + 2 (yi ŷi )(ŷi ȳ)
i i i
X
= SSR + ESS + 2 ûi ŷi
i
= SSR + ESS
X X
using the fact that 1
n ûi = 0, ŷi = ˆ0 + ˆ1 xi , and ûi xi = 0.
i i
How well does our best-fit line predict the variation in the data?

1. R-squared R2 is the ratio of the variation predicted by the model (i.e. by X) to the variation
in the outcome variable (Y ). It is a measure of the goodness of fit.

X
(ŷi ȳ)2
i
R2 = X
(yi ȳ)2
i
ESS
=
T SS
SSR
=1
T SS

2. Standard Error of the Regression 2 is the variance of the error term, u, and the depen-
dent variable, Y . We estimate 2 with the standard error of the regression (SER), with the
following formula:

q rP r
2
2 i ûi SSR
SER = sû = =
N 2 N 2

3
2 Homoskedasticity vs Heteroskedasticity
1. Homoskedasticity:
2
V ar(ui |Xi ) = u, 8i.

So, the variance of the conditional distribution of ui given Xi ’s is a constant for all i = 1, ..., n.

2. Heteroskedasticity: The variance of the conditional distribution of ui given Xi ’s is NOT


constant. For example, it is heteroskedatic if i2 = u2 Xi2 . For every observation, the variance
of ui is changing because the variance at each observation is multiplied by Xi2 . Here is an
example of this visually.

Figure 1: The blue line is the population regression line, Y = 1 + 0.5X. The left panel is with
a simulated data under the assumption of homoskedasticity, i2 = 1. The right panel is with a
simulated date under the assumption of heteroskedasticity of the form i2 = Xi2 . The sample size
is 200 for each panel.

2.1 Variance Estimation - Homoskedasticity case


The variance of our estimators ˆ0 and ˆ1 are going to depend on the variance of our error terms.
In the case of homoskedasticity, the asymptotic variances are simplified to the followings.1

2 E(Xi2 ) 2
ˆ0 = 2 u
n X
2
2 u
ˆ1 = 2
n X
1
Derivations are in Appendix 5.1.

4
P P P
2 X2 2
We use ˆû2 = n uˆ2i , n 1i , and (Xin X̄) as an estimator for u2 , E(Xi2 ), and 2
X, respectively.
Therefore, our homoskedasticity-only estimators of variances are
1 P 2
2 Xi
˜ˆ = P n
ˆ2 ,
0 (Xi X̄)2 û
ˆû2
˜ 2ˆ = P .
1 (Xi X̄)2
Note that the above formula consists only of observables. So we can calculate the value from
the sample. Using the estimates, we can conduct statistical inference in a usual way based on the
followings.
ˆ0 0 d
! N (0, 1),
˜ 2ˆ
0

ˆ1 1 d
! N (0, 1).
˜ 2ˆ
1

The square roots of ˜ 2ˆ and ˜ 2ˆ are called the homoskedasticity-only standard errors.
0 1

2.2 Variance Estimation - Heteroskedasticity case


In this case, it’s more complicated.

d
ˆ0 ! 2 2 1 V ar(Hi ui ) µX Xi
N( 0, ˆ0 ), ˆ0 = , Hi = 1 , µX = E(Xi ).
n [E(Hi2 )]2 E(Xi2 )
d
ˆ1 ! 2 2 1 V ar[(Xi µX )ui ]
N( 1, ˆ1 ), ˆ1 =
n [V ar(Xi )]2
The above variances include some population values which we do not know. So we use the
following formula to estimate the variances (you can look at the derivation in the Appendix of
chapter 5 of the book)

P 2
Ĥi uˆi 2
n 2
ˆ 2ˆ = ⇣ P ⌘2 ,
0
1
n n Ĥi2
P
(Xi X̄)2 uˆi 2
n 2
ˆ 2ˆ = P 2, where
1
1
n n (Xi X̄)2
X̄Xi
Ĥi = 1 1 P 2 8i.
n Xi

5
Note that the above formula consists only of observables. So, we can conduct statistical inference
as in the above using heteroskedasticity-robust standard errors, not homoskedasticity-only standard
errors.

2.3 Which one to use?


• At a general level, economic theory rarely gives any reason to believe that the errors are
homoskedastic. It therefore is prudent to assume that the errors might be heteroskedastic
unless you have compelling reasons to believe otherwise.

• Also, homoskedasticity-only method is e↵ective only when the errors are homoskedastic. How-
ever, homoskedasticity is a special case of heteroskedasticity. Therefore, heteroskedasticity
robust standard erros work even when the errors are homoskedastic. But in this case, we lose
some power.

Now that we have distinguished between homoskedasticity and heteroskedasticity, let’s outline
some properties of the OLS estimators ˆ0 , ˆ1 .

3 Properties of OLS estimator


1. OLS estimator is unbiased: E( ˆ0 ) = 0, E( ˆ1 ) = 1. On average, we can estimate the true
parameter correctly.
p p
2. OLS estimator is consistent: ˆ0 ! 0 , ˆ1 ! 1. When we have a large sample, we can
estimate the true parameter quite correctly.

3. OLS estimator is asymptotic normal:

• Heteroskedastic Case
d
ˆ0 ! 2 2 1 V ar(Hi ui ) µX Xi
N( 0, ˆ0 ), ˆ0 = , Hi = 1 , µX = E(Xi ).
n [E(Hi2 )]2 E(Xi2 )
d
ˆ1 ! 2 2 1 V ar[(Xi µX )ui ]
N( 1, ˆ1 ), ˆ1 =
n [V ar(Xi )]2

• Homoskedastic Case

d
ˆ0 =! 2 2 E(Xi2 ) 2
N( 0, ˆ0 ), ˆ0 = 2 u
n X
2
d
ˆ1 ! 2 2 u
N( 1, ˆ1 ), ˆ1 = 2
n X

6
Once we have a (consistent) estimator for the asymptotic variance, we can construct a confidence
interval, and test a hypothesis on ’s. The variance estimator used depends on the assumption on
V ar(ui |Xi ) = i2 .

Remark: The asymptotic normality of our OLS estimators ˆ0 and ˆ1 is due to two (very powerful)
theorems: LLN and CLT.

For property 3 to hold true, the errors ui should be normally distributed. However, this nor-
mality assumption is highly unlikely in many economic applications. In such cases, we can arrive
at property 3 without assuming normality of ui by using two very powerful theorems, the Law of
Large Numbers (LLN) and the Central Limit Theorem (CLT). In plain words, OLS Estimator is
approximately normally distributed when we have a large enough sample, even if we don’t assume
the errors are normally distributed.

• Law of large numbers (LLN): If Xi , i = 1, ..., N , are iid with E (Xi ) = µX , then X̄ !p µX
(in words, the sample average converges in probability to the population mean).
Informally, the LLN states that, under general conditions, X̄ will be near µX with very high
probability when N is large.

• Central Limit Theorem (CLT): Suppose that Xi , i = 1, ..., N , are iid with E (Xi ) = µX
2 , where 0 < 2 < 1. As N ! 1, the distribution of X̄ µX , where
and V ar (Xi ) = X X X̄
2
2 = X
, becomes arbitrarily well approximated by the standard normal distribution. That
X̄ N
X̄ µX
is, the distribution of is very close to the standard normal distribution as N increases,

A
denoted by X̄ µX ⇠ N (0, 1) as N ! 1. A comes from “approximately.”

Informally, the CLT states that, when the sample size is large, the sampling distribution
of the standardized sample average X̄ µX is approximately normal. This result is extremely

powerful because we don’t need to assume anything about the distribution of X (in particular,
notice that we are not requiring X to be normally distributed!).

So far, the OLS properties (unbiasedness, consistency, and asymptotic normality) hold even if
assumption of homoskedasticity was violated, in other words, even if our errors were heteroskedastic.
So what does assuming homoskedasticity get us? Read on.

3.1 Optimality of OLS Estimator: Gauss-Markov Theorem


Gauss-Markov theorem states that when the first three OLS assumptions hold and if the errors
are homoskedastic, then the OLS Estimator ˆ1 is the best linear (conditional) unbiased estima-
tor(remember BLUE!).
It is evident that the OLS Estimator is not linear in xi ’s. Hence, ‘L’ part is (conditional).
Thus, in answer the question from the previous section: We need the homoskedasticity assumption,
without it the optimality of the OLSE does not hold.

7
4 OLS Regression Example Using Stata
We will use a dataset on course evaluations, course characteristics, and professor characteristics
for 463 courses for the academic years 2000-2002 at the University of Texas at Austin. These
data were provided by Professor Daniel Hamermesh and were used in his paper with Amy Parker,
published in 2005. The variable “course eval” is the teaching evaluation score, on a scale of 1 (very
unsatisfactory) to 5 (excellent). The variable “female” is equal to 1 if the instructor is a female
and 0 if the structure is a male. Let’s just regress the course evaluation on gender and a constant.
The regression equation would be:

course evali = 0 + 1 f emalei + ui

We just need write in Stata “reg course eval female” (Stata always adds a constant term by default).
The output is the following:

Figure 2: Stata Output for the Regression of Course Evaluation on Gender

The table is easy to read. In the top-right corner we can find the number of observations and
the R2 . In the main body of the table we can find the coefficients What is the interpretation of
the female coefficient? By the end of next week’s section you would able to use hypothesis testing
to asses the statistical e↵ect of being a woman on course evaluations.

8
5 Recap - Confidence interval and Hypothesis testing
• Confidence interval
Confidence interval for ✓ with (1 ↵)% degree of confidence:

✓ˆ ✓ h i
 t(c, ↵2 %) =) CI(100 ↵)% (✓) = ✓ˆ t(c, ↵2 %) . ✓ˆ,
ˆ✓ + t(c, ↵ %) .
2 ✓ˆ
✓ˆ

Notice that the true parameter ✓ is not a random variable and the confidence interval has an
ex-ante (1 ↵) probability of containing it under the null hypothesis H0 , namely before the
sample is drawn we would expect the (1 ↵) confidence interval to contain ✓ (1 ↵)% of the
times if the null is true. However, once the sample is drawn and the confidence interval is
actually constructed, the true parameter either lies in it or it doesn’t (and we have no way
of knowing it). Therefore it is WRONG to say that the confidence interval contains the true
parameter (1 ↵)% of the times.

• Hypothesis testing: There are two ways to test a null hypothesis H0 : ✓ = ✓0 against the
alternative H1 : ✓ 6= ✓0 (the same is true for one-sided hypothesis): you can either use a
t-statistics or you can build a confidence interval. Keep in mind that using t-statistics is by
far the most common way do hypothesis testing.

1. First approach: T-statistic


– null hypothesis vs. alternative hypothesis: H0 : ✓ = ✓0 vs. H1 : ✓ 6= ✓0 (two-sided
hypothesis test)
✓ˆ ✓0
– test statistic: t stat =
✓ˆ
– comparison of t stat with the critical value (tc ) from some distribution:
! if |t stat| t(c, ↵2 %) ) reject H0
! if |t stat| < t(c, ↵2 %) ) fail to reject H0
Remarks: tc depends on the significance level ↵
for one-sided hypothesis tests, use t(c,↵%) instead of t(c, ↵2 %)
2. Second approach: Confidence Interval
– build the confidence interval as outlined above
– if the true parameter does not lie in the confidence interval then you reject the null
hypothesis; if, on the contrary, the confidence interval contains the true parameter
you fail to reject the null.
3. Third approach: Pvalue
– compute the pvalue of your t-statistic
– compare the pvalue you calculated with ↵: reject H0 , your pvalue  ↵.

9
Important remark: You can either reject the null hypothesis H0 or fail to reject it. You
NEVER accept H0 . Be extremely careful with wording because saying that you accept the
null hypothesis is WRONG; the reason why this is the case is that if your data does not
contradict the hypothesis you are testing does not mean that that hypothesis is actually
correct, it can still be wrong.

Table 1: Critical values for the t-statistic

Significance level
10% 5% 1%
Two-sided Test
Reject if |t| is greater than 1.64 1.96 2.58
One-sided Test
Reject if t is greater than 1.28 1.64 2.33
Reject if t is smaller than -1.28 -1.64 -2.33

10
6 Exercises
Stock & Watson, Exercise 4.6 Suppose that Yi = 0+ 1 Xi + ui . Show that the following
assumption, E(ui |Xi ) = 0 implies that E(Yi |Xi ) = 0 + 1 i.
X

Stock & Watson, Exercise 4.11 Consider the regression model Yi = 0 + 1 Xi + ui .

a. Suppose you know that 0 = 0. Derive a formula for the least squares estimator of 1.

b. Suppose you know that 0 = 4. Derive a formula for the least squares estimator of 1.

Stock & Watson, Exercise 4.14 Show that the sample regression line passes through the point
(X̄, Ȳ ).

Stock & Watson, Exercise 5.3. Suppose that a random sample of 200 twenty-year-old men is
selected from a population, and their heights and weights are recorded. A regression of weight on
height yields
Wdeight = 99.41 + 3.94 ⇤ Height, R2 = 0.81, SER = 10.2
(2.15) (0.31)

where W eight is measured in pounds, and Height is measured in inches. A man has a late growth
spurt and grows 1.5 inches over the course of a year. Construct a 99% confidence interval for the
person’s weight gain.

Stock & Watson, Exercise 5.4. Using a sample of 2829 full-time workers in the United States
in 2012, ages 29 and 30, with between 6 and 18 years of education (from March 2013 CPS), we run
the following regression:

Hourlyd
earnings = 7.29 + 1.93 ⇤ Y ears education, R2 = 0.162, SER = 10.29
(1.10) (0.08)

a. A randomly selected 30 year old worker reports an education level of 16 years.What is the
worker’s expected average hourly earnings?
b. A high school graduate (12 years of education) is contemplating going to a community college
for a 2-year degree. How much is this worker’s average hourly earnings expected to increase?

11
Stock & Watson, Exercise 5.5. In the 1980s, Tennessee conducted an experiment in which
kindergarten students were randomly assigned to “regular” and “small” classes and given standard-
ized tests at the end of the year. (Regular classes contained approximately 24 students, and small
classes contained approximately 15 students.) Suppose that, in the population, the standardized
tests have a mean score of 925 points and a standard deviation of 75 points. Let SmallClass
denote a binary variable equal to 1 if the student is assigned to a small class, and 0 otherwise. A
regression of T estScore on SmallClass yields
d
T estScore = 918.0 + 13.9 ⇤ SmallClass, R2 = 0.01, SER = 74.6
(1.6) (2.5)

a. Do small classes improve test scores? By how much? Is this e↵ect large? Explain.
b. Is the estimated e↵ect of a class size on test scores statistically significant? Carry out a test at
the 5% level.
c. Construct a 99% confidence interval for the e↵ect of SmallClass on T estScore.

Stock & Watson, Exercise 5.6. Refer to the regression described in Exercise 5.5
a. Do you think that the regression errors are plausibly homoskedastic? Explain.

b. SE( ˆ1 ) was computed using Equation (5.3) in the notes above. Suppose that the regression er-
rors were homoskedastic: Would this a↵ect the validity of the hypothesis testing of Exercise 5.5(c)?
Explain.

Stock & Watson, Exercise 5.9 Consider the regression model

Y i = Xi + u i

where ui and Xi satisfy the key least squares assumptions listed above. Let ¯ denote an
estimator of that is constructed as ¯ = Ȳ /X̄.
a. Show that ¯ is a linear function of Y1 , Y2 , ..., Yn .
b. Show that ¯ is unconditionally unbiased.

12
7 Solutions
Stock & Watson, Exercise 4.6

E(Yi |Xi ) = E( 0 + 1 Xi + ui |Xi ) = E( 0 |Xi ) + E( 1 Xi |Xi ) + E(ui |Xi )


= 0 + 1 E(Xi |Xi ) +0= 0 + 1 Xi .

Stock & Watson, Exercise 4.11

a. Our problem reduces to find the value of b1 minimizing the following:


X
min (yi 0 b1 xi )2 .
b1

The first order condition with respect to b1 is


X
2(yi b1 xi )xi = 0.
P P
Therefore, ˆ1 x2i = xi yi . Or,
P
ˆ1 = Pxi yi .
x2i

b. Similar to the above derivation, we find the value of b1 minimizing the following:
X
min (yi 4 b1 xi )2 .
b1

We can directly show that


P
ˆ1 = xi (yi 4)
P 2 .
xi

Stock & Watson, Exercise 4.14 We know that ȳ = ˆ0 + ˆ1 x̄. At the same time, the sample
regression line is given by

ŷi = ˆ0 + ˆ1 xi .

Therefore, when xi = x̄, ŷi = ˆ0 + ˆ1 x̄ = ȳ. In other words, the sample regression line passes
through (x̄, ȳ).

Stock & Watson, Exercise 5.3 The 99% confidence interval is 1.5 ⇥ ( 3.94 ± 2.58 ⇥ 0.31) or
4.71 lbs  WeightGain  7.11 lbs.

13
Stock & Watson, Exercise 5.4.

a. 7.29 + 1.93 ⇥ 16 = 23.59

b. The wage is expected to increase by 1.93 ⇥ 2 = 3.86

Stock & Watson, Exercise 5.5.

Stock & Watson, Exercise 5.6.

1
(Y1 +Y2 +...+Yn )
Stock & Watson, Exercise 5.9. a. ¯ = n

so it’s a linear function of Y1 , Y2 , ..., Yn .
b. E(Yi |X1 , ..., Xn ) = E( Xi + ui |X1 , ..., Xn ) = E(Xi |X1 , ..., Xn ) + E(ui |X1 , ..., Xn ) = Xi as
E(ui |X1 , ..., Xn ) = 0 by assumption.
So
1
n [E(Y1 |X1 , ..., Xn ) + E(Y2 |X1 , ..., Xn ) + ... + E(Y3 |X1 , ..., Xn )]
E( ¯|E(ui |X1 , ..., Xn ) =

1
n (X1 + X2 + ... + Xn )
= =

¯ is conditionally unbiased.

14

Potrebbero piacerti anche