Sei sulla pagina 1di 34

Violation of (Weak) Exogeneity Assumption

Endogeneity from Omitted Variables

Regression model with an additive omitted variable:

E(y|x1, x2, . . . , xK ) = β0 + β1x1 + β2x2 + . . . + βK xK + γq,

where q is omitted variable. We seek to estimated βj consistently


holding all other explanatory variables (6= j), including q constant.

E.g. y = log(wage), q includes ability, xK is some measure of ed-


ucation. Here βK denotes the partial effect of education on wages
controlling for the level of ability. The effect is interesting from policy
perspective as it reflects (marginal) returns to education.

The regression in the error form is:

y = β0 + β1x1 + β2x2 + . . . + βK xK + γq + v,
with E(v|x1, x2, . . . , xK , q) = 0, where v is structural error. Note that
E(v|.) = 0 implies (i) E(v) = 0 and (ii) covariance between v and any
function of (x, q) is zero.

One way to handle q is to put it into the error term. Assume, WLG,
E(q) = 0 (as there is an intercept in the model). Thus,

y = β0 + β1x1 + β2x2 + . . . + βK xK + u,

where u = γq+v. Note that, E(u|x1, x2, . . . , xK ) = E(v|x1, x2, . . . , xK )+


γE(q|x1, x2, . . . , xK ) and E(u) = E(v) + γE(q) = 0.

However, E(u|x1, x2, . . . , xK ) = 0 requires E(q|x1, x2, . . . , xK ) = 0, i.e.


cov(u, x) = 0 iff cov(q, x) = 0. If q is correlated with any xj then u
is also correlated with xj , violating CLRM assumption of E(x0u) = 0
and causing endogeneity in the estimable equation.
If we continue estimation of the model ignoring the unobserved omit-
ted variable q, OLS slope coefficients are likely to be inconsistent. To
see this, write a linear projection of q onto the observed explanatory
variables as

q = δ0 + δ1x1 + . . . + δK xK + r,

where r, the linear projection error, follows E(r) = 0 and cov(xj , r) =


0 ∀j = 1, . . . , K from definition.

A Digression on Linear Projections

Let x = (x1, x2, . . . , xk ) and K ×K variance matrix of x is non-singular.


Then the linear projection of y on (1, x1, x2, . . . , xK ) is unique:

L(y|1, x1, x2, . . . , xk ) = L(y|1, x) = β0 + β1x1 + . . . + βK xK = β0 + xβ ,


where β = [var(x)]−1cov(x, y), β0 = E(y) − E(x)β = E(y) − β1E(x1) −
β2E(x2)−. . .−βK E(xK ). Note that, [var(x)] is the K ×K positive defi-
nite symmetric matrix whose j, k-th element is cov(xj , xk ) and cov(x, y)
is the K × 1 vector with j-th element cov(xj , y). When K = 1, we
cov(x1, y)
have the familiar result: β1 = , β0 = E(y) − β1E(x1).
var(x1)

If the model does not have an intercept

L(y|x1, . . . , xk ) = γ1x1 + γ2x2 + . . . + γK xK = xγ ,

where γ = [E(x0x)]−1E(x0y). Note that, γ 6= β , unless E(x) = 0.

An alternative way of writing linear projection of y on (1, x1, x2, . . . , xK )


is

y = β0 + β1x1 + β2x2 + . . . + βK xK + r,
with E(r) = 0, cov(xj , r) = 0 ∀j = 1, . . . , K. This definition is iden-
tical with the previous one in the sense that given the error form,
parameters in the structural form must be as stated above. Again,
given the structural form, properties of r in the error form always hold
true.

Some Properties of Linear Projections

1. If E(y|x) = xβ , then L(y|x) = xβ .

2. L(y|x) = L[L(y|x, z)|x].

3. L(y|x) = L[E(y|x, z)|x].

Corollary: L(y|x) = L[E(y|x)|x].


4. Let L(y|x, z) = xβ + zγ . Let, r = x − L(x|z) and let v =
y − L(y|z). Then, L(v|r) = rβ and L(y|r) = rβ .

Thus, write the structural model in terms of observables, i.e. replac-


ing q by the linear projection of q on (1, x1, x2, . . . , xK ) as

y = (β0 + γδ0) + (β1 + γδ1)x1 + . . . + (βK + γδK )xK + (v + γr).

Note, E(v + γr) = E(v) + γE(r) = 0 and v + γr is uncorrelated with


any xj , since cov(xj , v) = cov(xj , r) = 0 ∀ j = 1, . . . , K. Hence the
above (estimable) equation satisfies all the CLRM properties.

Therefore, plim(β̂j ) = βj + γδj ∀ j = 0, 1, . . . , K; where β̂j is the


estimated slope coefficient of regression of y on (1, x1, x2, . . . , xK ).
As a special case, let δj = 0 ∀ j, except j = 0, K (i.e. only xK is
endogenous regressor). Then, plim(β̂j ) = βj ; ∀ j = 1, . . . , K − 1; but
cov(xK , q)
plim(β̂K ) = βK + γδK = βK + γ [i.e. q = δ0 + δK xK + r
var(xK )
cov(xK , q)
implying δK = , δ0 = E(q) − δK E(xK )].
var(xK )

Hence OLS ignoring omitted variables gives inconsistent estimates of


βK . This is called OLS Omitted Variables Bias.

If γ > 0 and xK and q are positively correlated then the bias is positive,
i.e. OLS tend to overestimate βK .

E.g. log(wage) = β0 + β1exper + β2exper2 + β3educ + γabil + v, where


E(v|exper, educ, abil) = 0. If abil is uncorrelated with exper and exper2
once educ is partialed out, i.e., abil = δ0 + δ3educ + r (r being uncorre-
lated with exper and exper2) then plim(β̂3) = β3 + γδ3. If δ3 > 0 (and
γ > 0 by postulation) then plim(β̂3) > β3, i.e. returns to education
are likely overestimated in large samples.

The Proxy Variable – OLS Solution to Omitted Variables Prob-


lem

If a proxy variable can be obtained for unobserved q the bias can be


eliminated.

A proxy variable is an observed variable (part of the collected data or


sample) which must satisfy the two conditions:
1. Proxy variable should be redundant ≫ If z is a proxy for q then
E(y|x, q, z) = E(y|x, q) ⇒ z is irrelevant for explaining y once x
and q have been controlled for.

E.g In the wage equation, let z be IQ score (which is observable).


By definition, it is ability that affects wage: IQ would not matter
if true ability was known.

2. Correlation between omitted variable and other regressors should


be zero after partialing out the effect of the proxy variable ≫
L(q|1, x1, . . . , xK , z) = L(q|1, z),

or, equivalently, q = θ0 +θ1z +r, where E(r) = 0 and cov(z, r) = 0.

Then cov(xj , r) = 0 ∀j = 1, . . . , K, i.e z is so closely related to q


that once its effect on q is accounted for, xj s are not correlated
with the residual variation in q.

Thus replace q = θ0 + θ1z + r in the structural equation to obtain

y = (β0 + γθ0) + β1x1 + . . . + βK xK + γθ1z + (γr + v).

Note that,

• γr + v is uncorrelated with z (as cov(z, r) = 0 and cov(z, v) = 0


from redundancy)

• γr + v is uncorrelated with x1, x2, . . . , xK (as cov(xj , v) = 0 and


cov(xj , r) = 0)
• E(γr + v) = γE(r) + E(v) = 0

Hence, OLS regression of y on 1, x1, . . . , xK , z produces consistent


estimates of β0 + γθ0, β1, β2, . . . , βK , γθ1. Thus we can consistently
estimate the partial effect of each xj on y.

When z is an imperfect proxy, however, condition 2 is violated ⇒ q =


θ0 + ρ1x1 + ρ2x2 + . . . + ρK xK + θ1z + r (i.e after controlling for z,
residual variation in q is still attributable to variation in x1, x2, . . . , xK ).

Then y = (β0 +γθ0)+(β1 +γρ1)x1 +. . .+(βK +γρK )xK +γθ1z+(γr+v).

Hence, OLS regression on above yields: plim(β̂j ) = βj + γρj , ∀j =


1, . . . , K. Thus OLS with imperfect proxy is still inconsistent. How-
ever, the hope is that presence of z in the L.P. makes |ρj | < |δj | (part of
the variation in q is explained away by z). Also var(γr+v) < var(γq+v)
as var(r) < var(q), implying smaller standard errors and better preci-
sion of estimator compared to total absence of proxy.

E.g 1: log(wage) = β0 + β1exper + β2tenure + β3married + β4south +


β5urban + β6black + β7educ + γabil + v. Assume IQ is a proxy for abil
satisfying both conditions. Hence, abil = θ0 + θ1IQ + r, E(r) = 0,
cov(IQ, r) = 0. Also, r is uncorrelated with exper, tenure, married,
south, urban, black, educ. OLS ignoring abil results in high values
of β7. OLS including IQ as a proxy for abil results in lower value
of β7, indicating the fact that educ and abil are (partially) positively
correlated (i.e. cov(xK , q) > 0)).

E.g 2: Effectiveness of Job Training Grant on Workers’ Productivity:


The average scrap rate (percentage of produced items which must
be scrapped) should be lower (workers more productive) for firms re-
ceiving a job-training grant (used for training the workers). Problem
is grants are not assigned randomly; whether a firm would receive a
grant could be related to factors unobserved to the econometrician.
This is a situation where the unobserved factors are not readily iden-
tifiable and consequently factor specific proxies may not be available.

log(scrap) = β0 + β1grant + γq + v,

where v is orthogonal to grant (a binary variable denoting whether


or not the firm has received grants during a certain period), but q
may contain unobserved productivity factors that might determine
whether the firm is eligible for a grant. Hence, u = γq + v may be
correlated with grant.

Use log(scrap−1) (scrap rate of previous year) as a proxy for q: q =


θ0 + θ1log(scrap−1) + r, E(r) = 0, cov(log(scrap−1), r) = 0 and r is
uncorrelated with grant.

Therefore, log(scrap) = δ0 + β1grant + γθ1log(scrap−1) + γr + v.

Here β1 measures the proportionate difference in scrap rates for two


firms having same scrap rates for the previous year, but one firm
received a grant and the other did not. OLS on above should give
consistent estimate of β1.

[ The use of lagged value of dependent variable as a proxy is used to


capture the time-invariant productivity related factors specific to the
firm which might be correlated to grant.]
Models with Non-Additive Omitted Variables: Interactions in
Unobservables

y = β0 + β1x1 + . . . + βK xK + γ1q + γ2xK q + v,

where E(v|x, q) = 0. For simplicity assume q interacts only with xK .

Interpretation of Parameters: If xK is continuous, the partial effect


∂E(y|x, q)
of xK on E(y|x, q) is = βK + γ2q. Thus partial effect of xK
∂xK
is a function of q. Since q is unobserved the partial effect is not
measurable. So standard interpretation of regression coefficients as
partial effects of regressors does not hold in this case. Measure the
average partial effect (APE) instead, averaged over the population of
q, i.e. E(βK + γ2q) = βK , assuming E(q) = 0.

If xK is discrete, similar interpretations hold. E.g. if xK is binary


then E(y|x1, . . . , xK−1, 1, q) − E(y|x1, . . . , xK−1, 0, q) = βK + γ2q and
again E(βK + γ2q) = βK , assuming E(q) = 0. In this case, βK is
called average treatment effect (ATE). If xK represent receiving some
“treatment” (e.g. participation in a job training programme etc.), the
effect of that treatment averaged over the entire population of q is
given by βK .
The assumption E(q) = 0 is WLG, as the model has an intercept.
If E(q) = µq 6= 0 population-average of the probability limit of the
coefficient on xK is βK + γ2µq , which is therefore the APE. It means
the population-averaged coefficient on xK always measures the APE,
whether or not E(q) = 0.

Estimation of the Parameters: There can be broadly two cases.

Case 1. Let E(q|x) = 0. Then, E(y|x, q) = β0 + β1x1 + . . . βK xK +


γ1q + γ2xK q. Apply LIE to get E(y|x) = β0 + β1x1 + . . . βK xK +
γ1E(q|x) + γ2xK E(q|x) = β0 + β1x1 + . . . + βK xK . Thus we can consis-
tently estimate the βj s by assuming γ1q + γ2xK q as part of the error
term.

Note that E(q|x) = 0 implies not just E(q) = 0 and cov(xj , q) =


0 ∀j = 1, . . . , K, but also that q is uncorrelated with any function
of xj s, e.g cov(xK q, xj ) = 0 and this is necessary for consistency (as
xK q is part of the error term). Thus, assuming only E(q) = 0 and
cov(xj , q) = 0 ∀j = 1, . . . , K may not give consistency.

Case 2. Let q and x be correlated. Here we need a proxy for q.


Assume that proxy variable, z, satisfies
1. Redundancy: E(y|x, q, z) = E(y|x, q)
2. Observed regressors uncorrelated to q after controlling for z:
E(q|x, z) = E(q|z) = θ1z, where we assume that z has zero mean
in population.
Hence E(y|x, z) = β0 + β1x1 + . . . + βK xK + γ1θ1z + γ2θ1xK z. As all
regressors of the above structural equation are observed βj s can be
estimated consistently by OLS.

If z does not have zero mean, then it must be that E(q|z) = θ0 + θ1z,
as we continue to assume E(q) = E[E(q|z)] = θ0 + θ1E(z) = 0 for
interpretational convenience. Then

y = (β0+γ1θ0)+β1x1+. . .+βK−1xK−1+(βK +γ2θ0)xK +(γ1θ1+γ2xK θ1)z+u.

So coefficient on xK is (βK + γ2θ0). So OLS will not yield consistent


estimate of βK . Since in practice we may not know E(z), z should be
at least demeaned in the sample before interacting it with xK .
Also, var(y|x, z) = E[var(y|x, z, q)|x, z] + var[E(y|x, z, q)|x, z].
As var(y|x, z, q) = var(y|x, q) = σ 2 by redundancy, hence var(y|x, z) =
E(σ 2) + var[E(y|x, q)|x, z] = σ 2 + (γ1 + γ2xK )2var(q|x, z).

So even if the original model is homoskedastic, i.e. var(y|x, q, z) =


var(y|x, q) = σ 2, in the estimable model var(y|x, z) is a function of xk ,
even if var(q|x, z) is constant. The estimable model, therefore, has
heteroskedasticity and robust inferences are always warranted.
Endogeneity from Measurement Error

Distinction between omitted variables and measurement error


problems

1. In omitted variables model the unobserved variable is often not


quantifiable (e.g. ability). In measurement error problems the
unobserved variable (e.g true income of a person, marginal tax
rate) is well defined, but we cannot obtain a correct measure of
it (e.g. reported income, average tax rate etc. are erroneous
measures of the unobserved variables).

2. In proxy variable problem the coefficient on the omitted variable


cannot be estimated. In measurement error problem partial effect
of the mismeasured variable if often of chief interest.

Measurement Error in Dependent Variable

Let y ∗ be the true measure of the dependent variable (e.g actual


income). Hence the structural model is y ∗ = β0 +β1x1 +. . .+βK xK +v,
where E(v|x) = 0. Let y be the observed measure of y ∗ (e.g reported
income). The measurement error is e0 = y − y ∗. Hence y ∗ = y − e0.
Thus, y = β0 + β1x1 + . . . + βK xK + v + e0. The above is an estimable
equation as y and x are observable.

Under what conditions are the coefficients consistently estimated?


1. E(e0) = 0. This assumption is WLG as there is an intercept in
the structural model (if violated, only estimation of intercept is
affected).
2. Measurement error in dependent variable is independent of the
explanatory variables, which implies cov(x, e0) = 0.
3. cov(e0, v) = 0.

Under these conditions the composite error term, u = v + e0, has the
following properties: E(u) = 0, cov(u, x) = 0. Hence OLS produces
consistent estimates of each βj . Further, usual OLS inferences are
asymptotically valid under appropriate homoskedasticity assumptions.

However, var(v + e0) = σv2 + σ02 > σv2. Hence standard errors are larger
than in the absence of measurement error.

Note: Suppose dependent variable is in logarithmic form, i.e. log(y) =


log(y ∗) + e0. Then we assume a multiplicative measurement error:
y = y ∗.a0, where a0 = log(e0).
E.g. Suppose scrap rates are measured with error:

log(scrap) = β0 + β1grant + v, log(scrap) = log(scrap∗) + e0.

This will be the case when the firm does not report true scrap rates
to enhance the chances of receiving a grant. But it implies

log(scrap) = β0 + β1grant + v + e0 = β0 + β1grant + u.

Suppose firm underreports scrap rates after receiving the grant to


make it look like the grant had the intended effect. Then e0 and
grant, and hence, u and grant are negatively correlated. This would
result in a downward bias for β1, tending to make training programme
look more effective than it actually is [note that condition 2 above is
being violated here].
Measurement Error in Explanatory Variables

y = β0 + β1x1 + . . . + βK x∗K + v,

where y, x1, x2, . . . , xK−1 are observed but x∗K is not observed. We
make the following assumptions:

1. E(v) = 0, cov(v, x) = 0 .

We have an observed measure of x∗K , viz. xK , where 2. cov(v, xK ) = 0 .


Condition 2 follows from the redundancy condition of xK , i.e. E(y|x1,
. . . , xK−1, x∗K , xK ) = E(y|x1, . . . , xK−1, x∗K ), which implies that xK has
no effect on y once the other explanatory variables, including x∗K have
been controlled for.

Measurement error in x∗K is eK = xK − x∗K . Assume 3. E(eK ) = 0 ,


4. cov(v, eK ) = 0 [as cov(v, x∗K ) = 0, cov(v, xK ) = 0]. Finally
5. E(xj eK ) = 0, ∀j = 1, . . . , K − 1 [eK is uncorrelated with all ex-
planatory variables except K-th].

The key assumption involves relationship between eK and xK and x∗K .


There are two sets of models based on two alternative assumptions
on this relationship, along with assumptions 1 – 5.

Assumption 6a. cov(xK , eK ) = 0 :

Plug x∗K = xK − eK in the structural model to obtain the estimable


equation

y = β0 + β1x1 + . . . βK xK + (v − βK eK ).
Note (i) E(v − βK eK ) = E(v) − βK E(eK ) = 0. Again, v and eK are
uncorrelated with x1, . . . , xK−1. Hence u = v − βK eK is also uncorre-
lated with x1, . . . , xK−1. (iii) Also, v is uncorrelated with xK [assmp.
2] and eK is uncorrelated with xK [assmp 6a]. Together, these mean,
u is uncorrelated with xK . Hence, (ii) and (iii) imply u is uncorrelated
with all observed regressors.

Hence OLS on the estimable equation produces consistent estimates


for βj ∀j = 1, . . . , K.

However, var(v−βK eK ) = σv2 +βK


2 σ 2 > σ 2 . Thus except when β = 0
eK v K
measurement error in K-th regressor will increase the standard error
of estimates.
Assumption 6b. cov(x∗K , eK ) = 0 [Classical Error in Variables] :

Since xK = x∗K +eK and cov(x∗K , eK ) = 0, variation in xK is totally due


to variation in eK , which means cov(xK , eK ) = E(xK eK ) = E(x∗K eK )+
E(e2
K ) = 0 + σe
2 = σ 2 . Hence x
K eK K and eK must be correlated. But
then in the estimable equation, y = β0 + β1x1 + . . . + βK xK + (v −
βK eK ), the error term is correlated with xK , i.e. cov(v − βK eK , xK ) =
−βK cov(eK , xK ) = −βK σe2K . Hence OLS will not produce consistent
estimate of any βj .

To characterise the asymptotic bias in the estimator we consider the


following. First, take a linear projection of y on 1, x1, . . . , xK−1, xK :

y = θ0 + θ1x1 + . . . θK−1xK−1 + θK xK + ε.

The θj s are consistently estimable by OLS.


What is θK ?

We know that xK = x∗K + eK . Hence

L(xK |1, x1, . . . , xK−1) = L(x∗K |1, x1, . . . , xK−1) + L(eK |1, x1, . . . , xK−1)
= δ0 + δ1x1 + . . . + δK−1xK−1 + 0
= δ0 + δ1x1 + . . . + δK−1xK−1,

where L(eK |1, x1, . . . , xK−1) = 0 follows from the fact that if L(eK |1, x1,
cov(ẍ, eK )
. . . , xK−1) = γ0 + γ1x1 + . . . + γK−1xK−1 = γ0 + ẍγ̈ , γ̈ =
var(ẍ)
and cov(xj , eK ) = 0 ∀j = 1, . . . , K − 1 by assumption 5. Thus γ̈ = 0.
Also, γ0 = E(eK ) − γ1E(x1) − . . . − γK−1E(xK−1) = 0 − 0 − . . . − 0 = 0.
Again,

∗ +e
rK = xK − L(xK |1, x1, . . . , xK−1) = δ0 + δ1x1 + . . . + δK−1xK−1 + rK K
−(δ0 + δ1x1 + . . . + δK−1xK−1)
∗ +e ,
= rK K

since xK = x∗K + eK and x∗K = δ0 + δ1x1 + . . . + δK−1xK−1 + rK


∗ .

cov(y, rK )
Now, L(y|rK ) = α1rK , where α1 = , as E(rK ) = 0. But,
var(rK )
cov(y, rK ) = cov(β0 + β1x1 + . . . + βK−1xK−1 + βK xK + v − βK eK , rK )
= βK cov(xK , rK ) − βK cov(eK , rK ),

as rK is uncorrelated with x1, . . . , xK−1 by definition.

Now, xK = δ0 + δ1x1 + . . . + δk−1xK−1 + rK , and rK is uncorrelated


∗ +e ) =
with x1, . . . , xK−1. Hence cov(xK , rK ) = var(rK ) = var(rK K
σr2∗ + σe2K .

Also, cov(eK , rK ) = cov(eK , xK −δ0−δ1x1−. . .−δK−1xK−1) = cov(eK , xK ) =


σe2K , as each of x1, x2, . . . , xK−1 is uncorrelated with eK by assumption
5.

Hence, cov(y, rK ) = βK (σr2∗ + σe2K − σe2K ) = βK σr2∗ .

βK σr2∗
Therefore, α1 = 2 2
.
σr∗ + σeK

So we have L(y|1, x1, . . . , xK−1, xK ) = θ0 + θ1x1 + . . . + θK−1xK−1 +


θK xK and rK = xK − L(xK |1, x1, . . . , xK−1). Hence by Property 4 of
L.P. (Property LP7 in Wooldridge, Ch 2 Appendix), L(y|rK ) = θK rK .
Thus by comparison, θK = α1. Since θK is consistently estimated by
 
σr2∗
OLS, hence plimβ̂K = βK  2 2
 .
σr∗ + σeK

σr2∗
Note that, 0 < 2 2
< 1 implying |plimβ̂K | < |βK |.
σr∗ + σeK

This is called “attenuation bias” in the estimated coefficient on the


explanatory variable measured with error. It implies that if βK > 0 it
is underestimated by OLS and if βK < 0 it is overestimated.

Note that, variance of x∗K , i.e. σx2∗ , does not affect plimβ̂K , but
K

the (net) variance of xK does (after netting out the effect of other
explanatory variables, i.e. σr2∗ ). If x∗K is more and more collinear with
other explanatory variables, the residual variation, σr2∗ is smaller (refer
to LP of x∗K on 1, x1, . . . , xK−1) and hence attenuation bias is worse.

E.g 1: Consider effect of family income on college GPA after control-


ling for high school GPA and SAT score.

colGP A = β0 + β1f aminc∗ + β2hsGP A + β3SAT + v.

Students may misreport true family income, f aminc∗, so that f aminc =


f aminc∗ + e. Suppose cov(f aminc∗, e) = 0. Then it is a case of CEV.

As a result β̂1 will be attenuated towards 0. The significance of


the regressor family income will consequently be understated by OLS
regression.

E.g 2: CEV assumption, cov(x∗K , eK ) = 0, is certainly false if σx2K <


σx2∗ [as xK = x∗K + eK , implying cov(x∗K , eK ) < 0]. We do not know
K
the values of these population moments with certainty, but in certain
cases we can use introspection. Consider actual days of schooling
versus reported days of schooling. Reported schooling is a rounded-
off version of actual schooling – so reported schooling should have
less variance compared to actual schooling and CEV can be assumed
away.

Potrebbero piacerti anche