Sei sulla pagina 1di 62

SOLUTIONS TO CHAPTER 12 PROBLEMS

12.1. Take the conditional expectation of equation (12.4) with respect to x,

and use E(u|x) = 0:

E{[y - m(x,Q)] |x} = E(u2|x) + 2[m(x,Qo) - m(x,Q)]E(u|x)


2

+ E{[m(x,Qo) - m(x,Q)] |x}


2

= E(u
2
|x) + 0 + [m(x,Qo) - m(x,Q)]2
= E(u
2
|x) + [m(x,Qo) - m(x,Q)]2.
The first term does not depend on Q and the second term is clearly minimized

at Q = Qo for any x. Therefore, the parameters of a correctly specified

conditional mean function minimize the squared error conditional on any value

of x. Usually, there would be multiple solutions for a particular x -- for

example, in the linear case m(x,Q) = xQ, any Q such that x(Qo - Q) = 0 sets

m(x,Qo) - m(x,Q) to zero. Uniqueness of Qo as a minimizer holds only after

we integrate out x to obtain E{[y - m(x,Q)] }.


2

12.2. a. Since u = y - E(y|x), Var(y|x) = Var(u|x) = E(u |x) since E(u|x) =


2

0. So E(u
2
|x) = exp(ao + xGo).
b. If we knew the ui = yi - m(xi,Qo), then we could do a nonlinear

regression of ui on exp(a + xG) and just use the asymptotic theory for
2

nonlinear regression. The NLS estimators of a and G would then solve


N
min S [u2i - exp(a + xiG)]2.
a,G i=1

The problem is that Qo is unknown. When we replace Qo with its NLS

estimator, Q^ 2 ^2
-- that is we replace ui with ui, the squared NLS residuals --

we are solving the problem

179
N
min S {[yi - m(xi,Q^)]2 - exp(a + xiG)}2.
a,G i=1

This objective function has the form of a two-step M-estimator in Section

12.4. Since Q
^
is generally consistent for Qo, the two-step M-estimator is

generally consistent for ao and Go (under weak regularity and identification


conditions). In fact, rN-consistency of ^a and ^G holds very generally.
-----

c. We now estimate Qo by solving


N
min S [yi - m(xi,Q)]2/exp(^a + xi^G),
Q i=1

where a and ^G are from part b.


^
The general theory of WNLS under WNLS.1 to

WNLS.3 can be applied.

= exp(ao + xGo)v .
2 2
d. Using the definition of v, write u Taking logs

gives log(u ) =
2
ao + xGo + log(v2). Now, if v is independent of x, so is

log(v ).
2
Therefore, E[log(u )|x] =
2
ao + xGo + E[log(v2)|x] = ao + xGo + ko,
where ko _ E[log(v2)]. So, if we could observe the ui, and OLS regression of

log(ui) on 1, xi would be consistent for (ao +


2
ko), Go; in fact, it would be
unbiased. By two-step estimation theory, consistency still holds if ui is

So, if m(x,Q)
^
replaced with ui, by essentially the same argument in part b.

is linear in Q, we can carry out a weighted NLS procedure without ever doing

nonlinear estimation.

e. If we have misspecified the variance function -- or, for example, we

use the approach in part d but v is not independent of x -- then we should

use a fully robust variance-covariance matrix. It looks just like (12.52)


^
except that ui and Dqm^i are each multiplied by exp[-(a^ + xiG
^
)/2] or just

exp(-xiG/2).
^

180
^
12.3. a. The approximate elasticity is dlog[E(y |z)]/dlog(z1) = d[^q1 +
^
q2log(z1) + ^q3z2]/dlog(z1) = ^q2.
^ ^
b. This is approximated by 100Wdlog[E(y|z)]/dz2 = 100Wq3.

c. Since d^E(y|z)/dz2 = exp[^q1 + ^q2log(z1) + ^q3z2 + ^q4z22]W(^q3 + 2^q4z2),


^
the turning point is z2 =
*
q3/(-2^q4).
d. Since Dqm(x,Q) = exp(x1Q1 + x2Q2)x, the gradient of the mean function
evaluated under the null is Dq~mi = exp(xi1Q~1)xi _ m~ixi, where Q~1 is the
restricted NLS estimator. From (12.72), we can compute the usual LM
2 ~ ~ ~
statistic as NRu from the regression ui on mixi1, mixi2, i = 1,...,N, where
~ ~ ~ ~
ui = yi - mi. For the robust test, we first regress mixi2 on mixi1 and

obtain the 1 * K2 residuals, r~i. Then we compute the statistic as in

regression (12.75).

N
12.4. a. Write the objective function as (1/2) S [yi - m(xi,Q)]2/h(xi,^G).
i=1
The objective function, for any value of G, is q(wi,Q;G) = (1/2)[yi -

m(xi,Q)] /h(xi,G). Q Dqq(wi,Q;G) =


2
Taking the gradient with respect to gives

-Dqm(xi,Q)[yi - m(xi,Q)]/h(xi,G) = -Dqm(xi,Q)ui(Q)/h(xi,G). Taking the

transpose gives us the score for any Q and any G.


b. This follows because, under WNLS.1, ui _ ui(Qo) has a zero mean given
xi:

E[si(Qo;G)|xi] = -Dqm(xi,Qo)’E(ui|xi)/h(xi,G) = 0;

the value of G plays no role.

c. First, the Jacobian of si(Qo;G) with respect to G is Dgsi(Qo;G) =


Dqm(xi,Qo)’uiDgh(xi,G)/[h(xi,G)]2. Everything but ui is a function only of

xi, so

E[Dgsi(Qo;G)|xi] = Dqm(xi,Qo)’E(ui|xi)Dgh(xi,G)/[h(xi,G)]2 = 0.
181
It follows by the LIE that the unconditional expectation is zero, too. In

other words, we have shown that the key condition (12.37) holds (whether or

not WNLS.3 holds).

d. We would use the analog of equation (12.52):

Avar(Q) =
^ ^ & SN D mˇ’D mˇ *-1& SN uˇ2D ˇm’D ˇm *& SN D ˇm’D ˇm *-1,
7i=1 q i q i8 7i=1 i q i q i87i=1 q i q i8
ˇ ^ ^1/2 ˇ ^ ^1/2
where ui _ ui/hi and Dqmi _ Dqmi/hi .

12.5. We need the gradient of m(xi,Q) evaluated under the null hypothesis.

By the chain rule, Dbm(x,Q) = g[xB + d1(xB)2 + d2(xB)3]W[x + 2d1(xB)2 +


3d2(xB) ], Ddm(x,Q) = g[xB + d1(xB)2 + d2(xB)3]W[(xB)2,(xB)3]. B~
2
Let denote

the NLS estimator with d1 = d2 = 0 imposed. Then Dbm(xi,~Q) = g(xiB


~
)xi and

Ddm(xi,Q~) = g(xiB
~
)[(xiB) ,(xiB) ].
~ 2 ~ 3
Therefore, the usual LM statistic can be

obtained as NRu from the regression ui on gixi, giW(xiB) , giW(xiB) , where


2 ~ ~ ~ ~ 2 ~ ~ 3

~
gi _ g(xiB
~
). If G(W) is the identity function, g(W) _ 1, and we get RESET.

12.6. a. The pooled NLS estimator of Qo solves


N T
min S S [yit - m(xit,Q)]2/2,
Q i=1t=1
T
and so, as in the hint in part b, we can take qi(Q) = S [yit m(xit,Q)] /2.
2
-
t=1
T
Then, the score is si(Q) = - S Dqm(xit,Q)’uit(Q). Without further
t=1
-1 N
assumptions, a consistent estimator of Bo is N S si(Q^)si(Q^)’, where ^Q is
i=1
the pooled NLS estimator. Now, the Hessian for observation i can be written

as
T T
Hi(Q) = Dqsi(Q) = - S D2qm(xit,Q)uit(Q) + S Dqm(xit,Q)’Dqm(xit,Q).
t=1 t=1
When we plug in Qo and use the fact that E(uit|xit) = 0, all t = 1,...,T,

then

182
T T
Ao _ E[Hi(Qo)] = - S E[D2qm(xit,Qo)uit] + S E[Dqm(xit,Qo)’Dqm(xit,Qo)]
t=1 t=1
T
= S E[Dqm(xit,Qo)’Dqm(xit,Qo)]
t=1
because E[Dqm(xit,Qo)uit] = 0, t = 1,...,T.
2
By the usual law of large

numbers argument,
-1 N T N
N S S Dqm(xit,Q^)’Dqm(xit,Q^) _ N-1 S ^Ai
i=1t=1 i=1
is a consistent estimator of Ao. Then, we just use the usual "sandwich"

formula in (12.49).

b. As in the hint, I show that Bo = s2oAo. First, write si(Q) =


T
S sit(Q), where sit(Q) _ -Dqm(xit,Q)’uit(Q). Under dynamic completeness of
t=1
the mean, these scores are serially uncorrelated across t (when evaluated at

Qo, of course). This is very similar to the linear regression case from

Chapter 7. Let r < t for concreteness. Then E[sit(Qo)sir(Qo)’|xit,xir,uir]

= E(uit|xit,xir,uir)uirDqm(xit,Qo)’Dqm(xir,Q) = 0 because

E(uit|xit,ui,t-1,,xi,t-1,...,) = 0 and r < t. Now apply the LIE to conclude


T
E[sit(Qo)sir(Qo)’] = 0. So we have shown that Bo = S E[sit(Qo)sit(Qo)’].
t=1
But for each t, E[sit(Qo)sit(Qo)’] = E[uitD qm(xit,Qo)’Dqm(xit,Qo)]
2
=

E[E(uit|xit)Dqm(xit,Qo)’Dqm(xit,Qo)] (by the law of iterated expectations) =


2

s2oE[Dqm(xit,Qo)’Dqm(xit,Qo)] because E(u2it|xit) = s2o.


Next, the usual two-step estimation argument shows that
T ^2 p
(NT) Si=1St=1
-1 N
uit L T St=1E(uit
-1 T 2
) = so2 as N L 8. The degrees of freedom

correction does not affect consistency.

The variance matrix obtained by ignoring the time dimension and assuming

homoskedasticity is simply

s^27& S S Dqm(xit,Q^)’Dqm(xit,^Q)*8 ,
N T -1

i=1t=1
and we just showed that N times this matrix is a consistent estimator of Avar

rN(^Q - Qo).
-----

183
12.7. a. For each i and g, define uig _ yig - m(xig,Qo), so that E(uig|xi) =
0, g = 1,...,G. Further, let ui be the G * 1 vector containing the uig.
Then E(uiu’|
i xi) = E(uiu’
i) = )o. ^
^ be the vector of nonlinear least
Let ui

squares residuals. That is, do NLS for each g, and collect the residuals.

Then, by standard arguments, a consistent estimator of )o is


-1 N
)^ _ N S ^^ui^^u’i
i=1
^
because each NLS estimator, Qg
^
is consistent for Qog as N L 8.
b. This part involves several steps, and I will sketch how each one

goes. First, let G be the vector of distinct elements of ) -- the nuisance

parameters in the context of two-step M-estimation. Then, the score for

observation i is

s(wi,Q;G) = -Dqm(xi,Q)’) ui(Q)


-1

where, hopefully, the notation is clear. With this definition, we can verify

condition (12.37), even though the actual derivatives are complicated. Each

element of s(wi,Q;G) is a linear combination of ui(Q). So Dgsj(wi,Qo;G) is a


linear combination of ui(Qo) _ ui, where the linear combination is a function
of (xi,Qo,G). Since E(ui|xi) = 0, E[Dgsj(wi,Qo;G)|xi] = 0, and so its

unconditional expectation is zero, too. This shows that we do not have to

adjust for the first-stage estimation of )o. Alternatively, one can verify

the hint directly, which has the same consequence.

Next, we derive Bo _ E[si(Qo;Go)si(Qo;Go)’]:


E[si(Qo;Go)si(Qo;Go)’] = E[Dqmi(Qo)’)o uiu’
i )o Dqmi(Qo)]
-1 -1

= E{E[Dqmi(Qo)’)o uiu’
i )o Dqmi(Qo)|xi]}
-1 -1

= E[Dqmi(Qo)’)o E(uiu’|
i xi))o Dqmi(Qo)]
-1 -1

= E[Dqmi(Qo)’)o )o)-1
o Dqmi(Qo)] = E[Dqmi(Qo)’)o Dqmi(Qo)].
-1 -1

184
Next, we have to derive Ao _ E[Hi(Qo;Go)], and show that Bo = Ao. The

Hessian itself is complicated, but its expected value is not. The Jacobian

of si(Q;G) with respect to Q can be written

Hi(Q;G) = Dqm(xi,Q)’)-1Dqm(xi,Q) + [IP t ui(Q)’]F(xi,Q;G),


where F(xi,Q;G) is a GP * P matrix, where P is the total number of
parameters, that involves Jacobians of the rows of )-1Dqmi(Q) with respect to

Q. The key is that F(xi,Q;G) depends on xi, not on yi. So,

E[Hi(Qo;Go)|xi] = Dqmi(Qo)’)-1
o Dqmi(Qo) + [IP t E(ui|xi)’]F(xi,Qo;Go)

= Dqmi(Qo)’)-1
o Dqmi(Qo).

Now iterated expectations gives Ao = E[Dqmi(Qo)’)o Dqmi(Qo)].


-1
So, we have

verified (12.37) and that Ao = Bo. Therefore, from Theorem 12.3, Avar rN(^Q -
-----

Qo) -1
= Ao = {E[Dqmi(Qo)’)o
-1
Dqmi(Qo)]}-1.
c. As usual, we replace expectations with sample averages and unknown

parameters, and divide the result by N to get Avar(Q):


^ ^

Avar(Q) =
^ ^ &N-1 SN D m (Q^)’)^ -1
D ^ *-1
mi(Q)
7 i=1 q i q 8 /N
& S D m (Q^)’)
N ^ -1
Dqmi(^Q)*8 .
-1
=
7 q i
i=1
The estimate )
^
can be based on the multivariate NLS residuals or can be

updated after the nonlinear SUR estimates have been obtained.

d. First, note that Dqmi(Qo) is a block-diagonal matrix with blocks


Dqgmig(Qog), a 1 * Pg matrix. (I implicityly assumed that there are no

cross-equation restrictions imposed in the nonlinear SUR estimation.) If )o


is block-diagonal, so is its inverse. Standard matrix multiplication shows

that

185
( )
2 s-2
o1Dq1mi1’Dq 1 m i1 0 W W W
o o
0 2

2 2

Dqmi(Qo)’)-1
o Dqmi(Qo) = 2
0
WW 2 .
2

W 2

2
s-2 0 W W W
oG DqGm iG’DqGmiG
o o
0 2

9 0
Taking expectations and inverting the result shows that Avar rN(Qg - Qog) =
^ -----

s2og[E(Dqgmoig’Dqgmoig)]-1, g = 1,...,G. (Note also that the nonlinear SUR

estimators are asymptotically uncorrelated across equations.) These

asymptotic variances are easily seen to be the same as those for nonlinear

least squares on each equation; see p. 360.

e. I cannot see a nonlinear analogue of Theorem 7.7. The first hint

given in Problem 7.5 does not extend readily to nonlinear models, even when

the same regressors appear in each equation. The key is that Xi is replaced

with Dqm(xi,Qo). While this G * P matrix has a block-diagonal form, as


described in part d, the blocks are not the same even when the same

regressors appear in each equation. In the linear case, Dqgmg(xi,Qog) = xi


for all g. But, unless Qog is the same in all equations -- a very

restrictive assumption -- Dqgmg(xi,Qog) varies across g. For example, if

mg(xi,Qog) = exp(xiQog) then Dqgmg(xi,Qog) = exp(xiQog)xi, and the gradients


differ across g.

12.8. As stated in the hint, we can use (12.33) and an updated version of

(12.66),
-1/2 N N
N S si(Q~;G
^
) = N
-1/2
S si(Qo;G
^
) + AorN(Q - Qo) + op(1),
~ -----

i=1 i=1
N
to show rN(Q
----- ~
- Q) = Ao N
^ -1 -1/2
S si(Q~;G
^
) + op(1); this is just standard
i=1
-1/2 N N
algebra. Under (12.37), N S si(Q~;G
^
) = N
-1/2
S si(Q~;Go) + op(1), by a
i=1 i=1
similar mean value expansion used for the unconstrained two-step M-estimator:

186
-1/2 N N
N S si(Q~;G
^
) = N
-1/2
S si(Q~;Go) + E[Dgsi(Qo;Go)]rN(^G - Go) + op(1),
-----

i=1 i=1
and use E[Dgsi(Qo;Go)] = 0. Now, the second-order Taylor expansion gives
N N N ^ ( N ) ~ ^
S q(wi,Q~;G
^
) - S q(wi,Q;G) = S si(Q;G) + (1/2)(Q - Q)’ S ¨
^ ^ ^ ^ ~
9 Hi (Q - Q)
i=1 i=1 i=1 i=1 0
^ ( N ¨ ) ~
= (1/2)(Q - Q)’ S Hi (Q - Q).
~ ^
9i=1 0
Therefore,

2
& SN q(w ,Q~;G
^ N ^ ^ *
) - S q(wi,Q;G) = [rN(Q - Q)]’Ao[rN(Q - Q)] + op(1)
~
----- ^ ~
----- ^
7i=1 i
i=1 8
& -1/2 SN s~ *’A-1&N-1/2 SN s~ * + o (1),
= N
7 i=1 8
i o 7
i=1 8
i p

where si = si(Q;Go).
~ ~
Again, this shows the asymptotic equivalence of the QLR

and LM statistics. To complete the problem, we should verify that the LM


-1/2 N
statistic is not affected by G^, either, but that follows from N S si(Q~;G
^
)
i=1
-1/2 N
= N S si(Q;Go) + op(1).
~
i=1

12.9. a. We cannot say anything in general about Med(y|x), since Med(y|x) =

m(x,Bo) + Med(u|x), and Med(u|x) could be a general function of x.

b. If u and x are independent, then E(u|x) and Med(u|x) are both

constants, say a and d. Then E(y|x) - Med(y|x) = a - d, which does not


depend on x.

c. When u and x are independent, the partial effects of xj on the

conditional mean and conditional median are the same, and there is no

ambiguity about what is "the effect of xj on y," at least when only the mean

and median are in the running. Then, we could interpret large differences

between LAD and NLS as perhaps indicating an outlier problem. But it could

just be that u and x are not independent.

12.10. The conditional mean function is m(xi,ni,B) = nip(xi,B). So we would,

Si=1[yi - nip(xi,B)] to get


N 2
as usual, minimize the sum of squared residuals,

187
^
the nonlinear least squares estimator, say B. Define the weights as hi _
^ ^

^ ^
nip(xi,B)[1 - p(xi,B)]. Then, the weighted NLS estimator minimizes S [yi -
^ ^ N
i=1

nip(xi,B)] /hi.
2^

12.11. a. For consistency of the MNLS estimator, we need -- in addition to

the regularity conditions, which I will ignore -- the identification

condition. That is, Bo must uniquely minimize E[q(wi,B)] = E{[yi -

m(xi,B)]’[yi - m(xi,B)]} = E({ui + [m(xi,Bo) - m(xi,B)]}’{ui + [m(xi,Bo) -

m(xi,B)]}) = E(u’
i ui) + 2E{[m(xi,Bo) - m(xi,B)]’ui} + E{[m(xi,Bo) -

m(xi,B)]’[m(xi,Bo) - m(xi,B)]} = E(uiu’


i ) + E{[m(xi,Bo) - m(xi,B)]’[m(xi,Bo) -

m(xi,B)]} because E(ui|xi) = 0. Therefore, the identification assumption is

that

E{[m(xi,Bo) - m(xi,B)]’[m(xi,Bo) - m(xi,B)]} > 0, B $ Bo.


In a linear model, where m(xi,B) = XiB for Xi a G * K matrix, the condition
is

(Bo - B)’E(X’i Xi)(Bo - B) > 0, B $ Bo,


and this holds provided E(X’
i Xi) is positive definite.

Provided m(x,W) is twice continuously differentiable, there are no

problems in applying Theorem 12.3. Generally, Bo = E[Dqmi(Bo)’uiu’D


i qmi(Bo)]

and Ao = E[Dqmi(Bo)’Dqmi(Bo)]. These can be consistently estimated in the

obvious way after obtain the MNLS estimators.

b. We can apply the results on two-step M-estimation. The key is that,

underl general regularity conditions,


-1 N
N S [yi - m(xi,B)]’[Wi(^D)]-1[yi - m(xi,B)]/2,
i=1
converges uniformly in probability to

E{[yi - m(xi,B)]’[W(xi,Do)] [yi - m(xi,B)]}/2,


-1

188
which is just to say that the usual consistency proof can be used provided we

verify identification. But we can use an argument very similar to the

unweighted case to show

E{[yi - m(xi,B)]’[W(xi,Do)] [yi - m(xi,B)]} = E{u’


i [Wi(Do)] ui}
-1 -1

+ E{[m(xi,Bo) - m(xi,B)]’[Wi(Do)] [m(xi,Bo) - m(xi,B)]},


-1

where E(ui|xi) = 0 is used to show the cross-product term, 2E{[m(xi,Bo) -

m(xi,B)]’[Wi(Do)] ui}, is zero (by iterated expectations, as always).


-1
As

before, the first term does not depend on B and the second term is minimized

at Bo; we would have to assume it is uniquely minimized.

To get the asymptotic variance, we proceed as in Problem 12.7. First,

it can be shown that condition (12.37) holds. In particular, we can write

Ddsi(Bo;Do) = (IP t ui)’G(xi,Bo;Do) for some function G(xi,Bo;Do). It

follows easily that E[Ddsi(Bo;Do)|xi] = 0, which implies (12.37). This means

that, under E(yi|xi) = m(xi,Bo), we can ignore preliminary estimation of Do


provided we have a rN-consistent estimator.
-----

To obtain the asymptotic variance when the conditional variance matrix

is correctly specified, that is, when Var(yi|xi) = Var(ui|xi) = W(xi,Do), we

can use an argument very similar to the nonlinear SUR case in Problem 12.7:

E[si(Bo;Do)si(Bo;Do)’] = E[Dbmi(Bo)’[Wi(Do)] uiu’


i [Wi(Do)] Dbmi(Bo)]
-1 -1

= E{E[Dbmi(Bo)’[Wi(Do)] uiu’
i [Wi(Do)] Dbmi(Bo)|xi]}
-1 -1

= E[Dbmi(Bo)’[Wi(Do)] E(uiu’|
i xi)[Wi(Do)] Dbmi(Bo)]
-1 -1

= E{Dbmi(Bo)’[Wi(Do)] ]Dbmi(Bo)}.
-1

Now, the Hessian (with respect to B), evaluated at (Bo,Do), can be written as

Hi(Bo;Do) = Dbm(xi,Bo)’[Wi(Do)]-1Dbm(xi,Bo) + (IP t ui)’]F(xi,Bo;Do),


for some complicated function F(xi,Bo;Do) that depends only on xi. Taking

expectations gives

189
Ao _ E[Hi(Bo;Do)] = E{Dbm(xi,Bo)’[Wi(Do)]-1Dbm(xi,Bo)} = Bo.
Therefore, from the usual results on M-estimation, Avar rN(B
^
- Bo) = Ao , and
----- -1

a consistent estimator of Ao is
N
^
A = N
-1
S Dbm(xi,B
^
)’[Wi(D)] Dbm(xi,B).
^ -1 ^
i=1
c. The consistency argument in part b did not use the fact that W(x,D)

is correctly specified for Var(y|x). Exactly the same derivation goes

through. But, of course, the asymptotic variance is affected because Ao $


Bo, and the expression for Bo no longer holds. The estimator of Ao in part b

still works, of course. To consistently estimate Bo we use


-1 N
^
B = N S Dbm(xi,B
^
)’[Wi(D)] uiu’
^ -1^ ^
i [Wi(D)] Dbm(xi,B).
^ -1 ^
i=1
Now, we estimate Avar rN(B
^
- Bo) in the usual way: A BA .
----- ^-1^^-1

12.12. (Bonus Question): Let Q


^
be an M-estimator with score si(Q) _ s(wi,Q)
and expected Hessian Ao. Let g(w,Q) be an M * 1 vector function of the
random vector w and the parameter vector, and suppose we wish to estimate Do
-1 N
_ E[g(wi,Qo)]. The natural estimator is D^ = N S g(wi,Q^).
i=1
a. Assuming that g(w,W) is continuously differentiable on int($), Qo e
int($), and other standard regularity conditions, find Avar rN(D^ - Do).
-----

b. How would you estimate Avar rN(^D - Do)?


-----

c. Show that if g(w,Q) = g(x,Q), and x is exogenous in the estimation

problem used to obtain Q


^
-- more precisely, E[s(wi,Qo)|xi] = 0 -- then

Avar rN(^D - Do) = Var[g(xi,Qo)] + Go[Avar rN(^D - Do)]G’o .


----- -----

Answer:

a. We use a mean value expansion, similar to the delta method from

Chapter 3 but now allowing for the randomness of wi. By a mean value

190
expansion, we can write

S g(wi,^Q) = N-1/2 S g(wi,Qo) + 7&N-1 S ¨Gi8*rN(^Q - Qo),


-1/2 N N N -----

N
i=1 i=1 i=1
where Gi is the M * P Jacobian of g(wi,Q) evaluate at mean values between Qo
¨

and Q^. Now, since rN(Q^ - Qo) ~a Normal(0,A-1


-----

o BoAo ), it follows that rN(Q -


-1 ^ -----

-1 N
Qo) = Op(1). Further, by Lemma 12.1, N S G¨i p
L E[Dqg(w,Qo)] _ Go, since the
i=1
mean values converge in probability to Qo. Therefore,
&N-1 N
S G¨i*8rN(^Q - Qo) = GorN(^Q - Qo) + op(1),
----- -----

7 i=1
and so
-1/2 N N
N S g(wi,Q^) = N-1/2 S g(wi,Qo) + GorN(Q^ - Qo) + op(1). -----

i=1 i=1
-1/2 N
Since rN(Q - Qo) = -N
----- ^
S Ao-1si( o) + Q op(1), we can write
i=1
-1/2 N -1/2 N
N S g(wi,^Q) = N S [g(wi, o) Q - GoAo si(Qo)] + op(1)
-1
i=1 i=1
or, subtracting rNDo from both sides,
-----

N
rN(D^ - Do) = N-1/2 S [g(wi,Qo) - Do - GoAo-1si(Qo)] + op(1).
-----

i=1
Since the term in the summation has zero mean, it follows from the CLT that

rN(^D - Do) ~a Normal(0,Do)


-----

where Do = Var[(gi - Do -1
- GoAo si)], where hopefully the shorthand is clear.

This differs from the usual delta method result by the presence of gi =

gi(Qo).
^ ^
b. We assume we have A consistent for Ao. By the usual arguments, G =
-1 N
N S Dqg(wi,^Q) is consistent for Go. Then
i=1
-1 N
^
D = N S (g^i - D^ - GA si)(gi - D - GA si)’
^^-1^ ^ ^ ^^-1^
i=1
is consistent for Do, where the "^" denotes evaluation at Q.
^

c. Using our shorthand notation, if E(si|xi) = 0 then gi is uncorrelated

with si (since gi is a function of xi), which means (gi - Do) is uncorrelated

with GoAo si.


-1
But then Do = Var[(gi - Do -1
- GoAo si)] = Var(gi) +

GoAo BoAo G’
-1 -1
o , which is what we wanted to show.

191
SOLUTIONS TO CHAPTER 13 PROBLEMS

13.1. No. We know that Qo solves

max E[log f(yi|xi;Q)],


Qe$
where the expectation is over the joint distribution of (xi,yi). Therefore,

because exp(W) is an increasing function, Qo also maximizes exp{E[log

f(yi|xi;Q)]} over $. The problem is that the expectation and the exponential

function cannot be interchanged: E[f(yi|xi;Q)] $ exp{E[log f(yi|xi;Q)]}. In

fact, Jensen’s inequality tells us that E[f(yi|xi;Q)] > exp{E[log

f(yi|xi;Q)]}.

13.2. a. Since

f(y|xi) = (2pso)
2 -1/2 (
exp -(y - m(xi,Bo)) /(2so) ,
2 2 )
9 0
it follows that for each i,

li(B,s2) 1
= - log(2p) -
2
1
-----

2
log(s ) -
2
-----
1
-------------- [yi - m(xi,B)] .
2

2s
2

Only the last of these terms depends on B. Further, for any s2 > 0,
N
maximizing S li(B,s2) with respect to B is the same as minimizing
i=1
N
S [yi - m(xi,B)]2. (13.66)
i=1
Thus, regardless of the MLE for s2, the MLE B
^
of B minimizes (13.66).

b. First,

Dbli(B,s2) = Dbm(xi,B)[yi - m(xi,B)]/s2;


note that Dbm(xi,B) is 1 * P. Next,

dli(B,s2)
[yi - m(xi,B)] .
1 1 2
------------------------------------------ = - -------------- + --------------

ds 2
2s
2
2s
4

192
For notational simplicity, define the residual function ui(B) _ yi
- m(xi,B). Then the score is
&D m (B)’u (B)/s 2 *
2
b i i 2

si(Q) =
[u i (B)]
2 1 1 2 2,
- ---------- + ----------------

7 2s 2s
2 2 4 2

0
where Dbmi(B) _ Dbm(xi,B).
Define the errors as ui _ ui(Bo), so that E(ui|xi) = 0 and E(u2i|xi) =
Var(yi|xi) = so2. Then, since Dbmi(Bo) is a function of xi, it is easily seen
that E[si(Qo)|xi] = 0. Note that we only use the fact that E(yi|xi) =

m(xi,Bo) and Var(yi|xi) = so2 in showing this. In other words, only the first

two conditional moments of yi need to be correctly specified; nothing else

about the normal distribution is used.


^2
The equation used to obtain s is
S 7&- ^12 + ^14 [yi - m(xi,B
^ 2*
N
)]
i=1
--------------

2s 2s
8 = 0,
--------------

where B^ is the nonlinear least squares estimator. Solving gives


N
s = N-1 S ^u2i,
^2
i=1
^
where ui _ yi - m(xi,B). ^
Thus, the MLE of s2 is the sum of squared residuals
divided by N. In practice, N is often replaced with N - P as a degrees-of-

freedom adjustment, but this makes no difference as N L 8.


c. The derivations are a bit tedious but fairly straightforward:
&-D m (B)’D m (B)/s 2 + D2m (B)u (B)/s2 -D b mi(B)’u i (B)/s
4 *
2
b i b i b i i 2

Hi(Q) = 2 2 ,
-D b mi(B)u i (B)/s [u i (B)]
2 4 1 1 22

-
7 ---------------

2s
2
---------------

2s
6 8
where Db2mi(B) is the P * P Hessian of mi(B).
d. From part c,

193
&D m (B )’D m (B )/s2 *
2
b i o b i o o 0
2

-E[Hi(Qo)|xi] = 2 2 , (13.67)
2 1 2

0
7 --------------

2so
4 0

where again I have used E(ui|xi) = 0 and E(ui|xi) = s2o.


2

e. To show that (13.67) equals E[si(Qo)si(Qo)’|xi], you need to know

that, with ui defined as above, E(ui|xi) = 0 and E(ui|xi) = 3so.


3 4 2

f. From general MLE, we know that Avar rN(^b - Bo) is the P * P upper
-----

left hand block of {E[Ai(Qo)]} , where Ai(Qo) is the matrix in (13.67).


-1

Because this matrix is block diagonal, it is easily seen that

Avar rN(B
^
- Bo) = so{E[Dbmi(Bo)’Dbmi(Bo)]} ,
----- 2 -1

and this is consistently estimated by


^2& N ^ *-1
s 7N-1 S Db^m’D
i bmi8 , (13.68)
i=1
which means that Avar(B) is (13.68) divided by N, or
^ ^

^2& N ^ *-1
Avar(B) = s
^ ^
S Dbm^’D
7i=1 i bmi8 . (13.69)

If the model is linear, Db^mi = xi, and we obtain exactly the asymptotic
variance estimator for the OLS estimator under homoskedasticity.

13.3. a. The conditional log-likelihood for observation i is

li(Q) = yilog[G(xi,Q)] + (1 - yi)log[1 - G(xi,Q)].

b. The derivation for the probit case in Example 13.1 extends

immediately:

si(Q) = yiDqG(xi,Q)’/G(xi,Q) - (1 - yi)DqG(xi,Q)’/[1 - G(xi,Q)]

= DqG(xi,Q)’[yi - G(xi,Q)]/{G(xi,Q)[1 - G(xi,Q)]}.


If we plug in Qo for Q and take the expectation conditional on xi we get

E[si(Qo)|xi] = 0 because E[yi - G(xi,Qo)|xi] = E(yi|xi) - G(xi,Qo) = 0, and

the functions multiplying yi - G(xi,Qo) depend only on xi.

194
c. We need to evaluate the score and the expected Hessian with respect

to the full set of parameters, but then evaluate these at the restricted

estimates. Now,

DqG(xi,B,0) = f(xB)[x,(xB)2,(xB)3],
a 1 * (K + 2) vector. Let B~ denote the probit estimates of B, obtained under

the null. The score for observation i, evaluated under the null estimates,

is the (K + 2) * 1 vector
si(Q) = DqG(xi,B ,0)’[yi - F(xiB)]/{F(xiB)[1 - F(xiB)]}
~ ~ ~ ~ ~

= f(xiB
~ ~
)z’i [yi - F(xiB)]/{F(xiB)[1 - F(xiB)]},
~ ~ ~

~
where zi _ [xi,(xiB) ,(xiB) ].
~ 2 ~ 3
The negative of the expected Hessian,

evaluated under the null, is the (K + 2) * (K + 2) matrix


A(xi,Q) = [f(xiB)] z’
i zi/{F(xiB)[1 - F(xiB
~ ~ 2~ ~ ~ ~
)]}.

These can be plugged into the second expression in equation (13.26) to obtain

a nonnegative, well-behaved LM statistic. Simple algebra shows that the

statistic can be computed as the explained sum of squares from the regression
~ ~ ~ ~ ~ ~
ui/[Fi(1 - Fi)] on fiWxi/[Fi(1 - Fi)] ,
1/2 1/2

f~iW(xiB) /[Fi(1 - Fi)] , fiW(xiB) /[Fi(1 - Fi)] , i = 1,...,N,


~ 2 ~ ~ 1/2 ~ ~ 3 ~ ~ 1/2

where "~" denotes evaluation at (B,0) and ui = yi - Fi.


~ ~ ~
Under H0, LM is

distributed asymptotically as c22.

13.4. If the density of y given x is correctly specified, then E[s(w,Qo)|x] =

0. But then E[a(x,Qo)s(w,Qo)|x] = a(x,Qo)E[s(w,Qo)|x] = 0, which, of course,

implies E[g(w,Qo)] = 0 where g(w,Q) = a(x,Q)s(w,Q).

13.5. a. Since si(Fo) = [G(Qo)’] si(Qo),


g -1

E[si(Fo)si(Fo)’|xi] = E{[G(Qo)’] si(Qo)si(Qo)’[G(Qo)] |xi}


g g -1 -1

195
= [G(Qo)’] E[si(Qo)si(Qo)’|xi][G(Qo)]
-1 -1

= [G(Qo)’] Ai(Qo)[G(Qo)] .
-1 -1

b. In part b, we just replace Qo with Q~ and Fo with F~:


Ai = [G(Q)’] Ai(Q)[G(Q)]
~g ~ ~ ~ -1
-1
_ ~G’-1~Ai~G-1.
c. The expected Hessian form of the statistic is given in the second
~g ~g
part of equation (13.36), but where it is based on si and Ai:

LMg =
& SN ~sg*’& SN ~Ag*-1& SN ~sg*
7i=1 i8 7i=1 i8 7i=1 i8
=
& SN ~G’-1~s *’& SN ~G’-1~A ~G-1*-1& SN ~G’-1s~ *
7i=1 i8 7 i 8 7i=1 i8
N ’ i=1
& S s~ * G~-1G~& S A~ * G~’G~’-1& S ~s *
N -1 N
=
7i=1 i8 7i=1 i8 7i=1 i8
=
& S ~s *’& S ~A * & S ~s * = LM.
N N -1 N
7i=1 8 7i=1 8
i i 7i=1 8
i

13.6. a. No, for two reasons. First, just specifying a distribution of yit

given xit says nothing, in general, about the distribution of yit given xi _
(xi1,...,xiT). We could assume these two are the same, which is a strict

exogeneity assumption. But, even under strict exogeneity, we would have to

specify something about joint or conditional distributions involving the

different time periods. We could assume independence (conditional on xi) or

make a dynamic completeness assumption. But, without substantially more

assumptions, we cannot derive the distribution of yi given xi.

b. This is given in a more general case in equation (19.46).

Specialized to exp(xitQ) (and using Q in place of B) gives


T T
li(Q) = S [yitxitQ - exp(xitQ)] _ S lit(Q).
t=1 t=1
Taking the gradient and transposing gives
T T
si(Q) = S x’it[yit - exp(xitQ)] _ S sit(Q).
t=1 t=1
c. First, we need minus the Hessian for each i, which is easily obtained

as -Dqsi(Q):

196
T
Hi(Q) = S exp(xitQ)x’itxit,
t=1
which, in this example, does not depend on the yit: Ait(Qo) = Hit(Qo).

Therefore,
N T
^
A = N
-1
S S exp(xitQ^)x’itxit,
i=1t=1
where Q
^
is the partial MLE. Further,
-1 N
^
B = N S si(^Q)si(^Q)’,
i=1
and then Avar(Q) is estimated as
^

& SN ST exp(x Q^)x’ x *-1& SN s (^Q)s (^Q)’*& SN ST exp(x ^Q)x’ x *-1.


7i=1t=1 it it it8 7
i=1
i i 87i=1t=1 it it it8

d. If E(yit|xit,yi,t-1,xi,t-1,...) = E(yit|xit), then

E[sit(Qo)|xit,yi,t-1,xi,t-1,...]

it[E(yit|xit,yi,t-1,xi,t-1,...) - exp(xitQo)] = 0.
= x’

This implies that sit(Qo) and sir(Qo) are uncorrelated, t $ r. Therefore,


T T
Bo = S E[sit(Qo)sit(Qo)’] = S E(uit
2
x’
itxit),
t=1 t=1
where uit _ yit - E(yit|xit) = yit - exp(xitQo). But, by the Poisson

assumption, E(uit|xit) = Var(yit|xit) = exp(xitQo).


2
By iterated

expectations,
T
Bo = S E[exp(xitQo)x’itxit] = Ao.
t=1
(We have really just verified the conditional information matrix equality for

each t, in the special case of Poisson regression with an exponential mean

Therefore, we can estimate Avar(Q) as


^
function.)

& SN ST exp(x ^Q)x’ x *-1,


7i=1t=1 it it it8

which is exactly what we get by using pooled Poisson estimation and ignoring

the time dimension.

197
13.7. a. The joint density is simply g(y1|y2,x;Qo)Wh(y2|x;Qo). The log-

likelihood for observation i is

li(Q) _ log g(yi1|yi2,xi;Q) + log h(yi2|xi;Q),

and we would use this in a standard MLE analysis (conditional on xi).

b. First, we know that, for all (yi2,xi), Qo maximizes E[li1(Q)|yi2,xi].

Since ri2 is a function of (yi2,xi),

E[ri2li1(Q)|yi2,xi] = ri2E[li1(Q)|yi2,xi];

since ri2 > 1, Qo maximizes E[ri2li1(Q)|yi2,xi] for all (yi2,xi), and


therefore Qo maximizes E[ri2li1(Q)]. Similary, Qo maximizes E[li2(Q)], and

so it follows that Qo maximizes ri2li1(Q) + li2(Q). For identification, we

have to assume or verify uniqueness.

c. The score is

si(Q) = ri2si1(Q) + si2(Q),

where si1(Q) _ Dqli1(Q)’ and si2(Q) _ Dqli2(Q)’. Therefore,

E[si(Qo)si(Qo)’] = E[ri2si1(Qo)si1(Qo)’] + E[si2(Qo)si2(Qo)’]

+ E[ri2si1(Qo)si2(Qo)’] + E[ri2si2(Qo)si1(Qo)’].

Now by the usual conditional MLE theory, E[si1(Qo)|yi2,xi] = 0 and, since ri2

and si2(Q) are functions of (yi2,xi), it follows that

E[ri2si1(Qo)si2(Qo)’|yi2,xi] = 0, and so its transpose also has zero

conditional expectation. As usual, this implies zero unconditional

expectation. We have shown

E[si(Qo)si(Qo)’] = E[ri2si1(Qo)si1(Qo)’] + E[si2(Qo)si2(Qo)’].

Now, by the unconditional information matrix equality for the density

h(y2|x;Q), E[si2(Qo)si2(Qo)’] = -E[Hi2(Qo)], where Hi2(Q) = Dqsi2(Q).


Further, byt the conditional IM equality for the density g(y1|y2,x;Q),

E[si1(Qo)si1(Qo)’|yi2,xi] = -E[Hi1(Qo)|yi2,xi], (13.70)

198
where Hi1(Q) = Dqsi1(Q). Since ri2 is a function of (yi2,xi), we can put ri2

inside both expectations in (13.70). Then, by iterated expectatins,

E[ri2si1(Qo)si1(Qo)’] = -E[ri2Hi1(Qo)].

Combining all the pieces, we have shown that

E[si(Qo)si(Qo)’] = -E[ri2Hi1(Qo)] - E[Hi2(Qo)]

= -{E[ri2Dqsi1(Q) + Dqsi2(Q)]
= -E[Dqli(Q)] _ -E[Hi(Q)].
2

So we have verified that an unconditional IM equality holds, which means we

can estimate the asymptotic variance of rN(^Q - Qo) by {-E[Hi(Q)]}-1.


-----

d. From part c, one consistent estimator of rN(^Q - Qo) is


-----

-1 N
N S (ri2^Hi1 + ^Hi2),
i=1
where the notation should be obvious. But, as we discussed in Chapters 12

and 13, this estimator need not be positive definite. Instead, we can break

the problem into needed consistent estimators of -E[ri2Hi1(Qo)] and

-E[Hi2(Qo)], for which we can use iterated expectations. Since, by


-1 N
definition, Ai2(Qo) _ -E[Hi2(Qo)|xi], N S ^Ai2 is consistent for -E[Hi2(Qo)]
i=1
by the usual iterated expectations argument. Similarly, since Ai1(Qo) _
-E[Hi1(Qo)|yi2,xi], and ri2 is a function of (yi2,xi), it follows that

E[ri2Ai1(Qo)] = -E[ri2Hi1(Qo)]. This implies that, under general regularity


-1 N
conditions, N S ri2^Ai1 consistently estimates -E[ri2Hi1(Qo)]. This
i=1
completes what we needed to show. Interestingly, even though we do not have

a true conditional maximum likelihood problem, we can still used the

conditional expectations of the hessians -- but conditioned on different sets

of variables, (yi2,xi) in one case, and xi in the other -- to consistently

estimate the asymptotic variance of the partial MLE.

e. Bonus Question: Show that if we were able to use the entire random

199
sample, the result conditional MLE would be more efficient than the partial

MLE based on the selected sample.

Answer: We use a basic fact about positive definite matrices: if A and B

are P * P positive definite matrices, then A - B is p.s.d. if and only if B-1


-1
- A is positive definite. Now, as we showed in part d, the asymptotic

variance of the partial MLE is {E[ri2Ai1(Qo) + Ai2(Qo)]} .


-1
If we could use

the entire random sample for both terms, the asymptotic variance would be

{E[Ai1(Qo) + Ai2(Qo)]} . But {E[ri2Ai1(Qo) + Ai2(Qo)]} - {E[Ai1(Qo) +


-1 -1

Ai2(Qo)]} is p.s.d. because E[Ai1(Qo) + Ai2(Qo)] - E[ri2Ai1(Qo) + Ai2(Qo)]


-1

= E[(1 - ri2)Ai1(Qo)] is p.s.d. (since Ai1(Qo) is p.s.d. and 1 - ri2 > 0.

13.8. a. The score with respect to Q, for observation i, is

si(Q;G) = f[hi(G)Q]Whi(G)’W{yi - F[hi(G)Q]}/(F[hi(G)Q]{1 - F[hi(G)Q]}),


where hi(G) _ (xi,wi - ziG) is 1 * (K + 1). The negative of the expected

Hessian, where the expectation is conditional on (xi,zi,vi) or (xi,zi,wi),

evaluated at the true parameters, has the standard probit form:

Ai(Qo;Go) = {f[hi(Go)Qo]} hi(Go)’hi(Go)/(F[hi(Go)Qo]{1 - F[hi(Go)Qo]}).


2

The consistent estimator of Ao that we would use is the positive definite

matrix
-1 N
N S [f(h^i^Q)]2^h’i ^hi/{F(h iQ)[1 - F(hiQ)]},
^ ^ ^ ^
i=1
^ ^
where hi _ (xi,vi) is the vector of "generated regressors." This is just the

usual probit estimator that we would use without generated regressors. The

tricky part is in adjusting the score. For this, we need to estimate Fo _


E[Dgsi(Qo;Go)], which we can do by first estimating E[Dgsi(Qo;Go)|xi,zi,wi]

and then averaging. But, by the same argument used to simplify the expected

200
Hessian, all but one term has zero conditional expectation. In particular,

E[Dgsi(Qo;Go)|xi,zi,wi] = -[f(hiQo)] hi’Q’D


o ghi(G )’/{F(hiQo)[1 - F(hoiQo)]},
o 2 o o o

o
where hi _ (xi,vi) and
( )
Dghi(Go)’ = -z0 _ -Ri, 2 2

9 i0
a (K + 1) * M matrix, where zi is 1 * M. Therefore, a consistent estimator

of Fo is
N
^
F = N
-1
S [f(h^i^Q)]2^h’i ^Q’Ri/{F(h iQ)[1 - F(hiQ)]}.
^ ^ ^ ^
i=1
^ ^ ^
Now, for ri implicit in (12.61), we can take ri = (Z’Z/N) z’
-1
i vi. Then, we
^
form gi as in (12.61).
( )
b. In general, Q’o Ri = (D’
o ro) z0 = rozi.
2 2 If ro = 0 then Q’o Ri = 0,
9 i0
which means Fo = 0. Then, we need not adjust the score for the first-step

estimation of Go.
c. From part b, the usual probit t statistic on the generated regressor
^
vi is asymptotically standard normal under H0: ro = 0, so we can just ignore
^
the generated regressor aspect of vi when carrying out the test.

13.9. a. Under the Markov assumption, the joint density of (yi0,...,yiT) is

given by

fT(yT|yT-1)WfT-1(yT-1|yT-2)WWWf1(y1|y0)Wf0(y0),

so we would need to model f0(y0) to obtain a model of the joint density.

b. The log-likelihood
T
li(Q) = S log[ft(yit|yi,t-1;Q)]
t=1
is the conditional log-likelihood for the density of (yi1,...,yiT) given yi0,

and so the usual theory of conditional maximum likelihood applies. In

practice, this is MLE pooled across i and t.

201
c. Because we have the density of (yi1,...,yiT) given yi0, we can use

any of the three asymptotic variance estimators implied by the information

matrix equality. However, we can also use the simplifications due to

dynamice completeness of each conditional density. Let sit(Q) =

Dqlog[f(yit|yi,t-1;Q), Hit(Q) = Dqsit(Q), and Ait(Qo) = -E[Hit(Q)|yi,t-1], t


= 1,...,T. Then Avar rN(^Q - Qo) is consistently estimated using the inverse
-----

of any of the three matrices in equation (13.50). If we have a canned

package that computes a particular MLE, we can just use any of the usual

asymptotic variance estimates obtained from the pooled MLE.

13.10. a. Because of conditional independence, by the usual product rule,


G
f(y1,y2,...,yG|x,c) = p fg(yg|x,c).
g=1

b. Let g(y1,...,yG|x) be the joint density of yi given xi = x. Then

i f(y1,y2,...,yG|x,c)h(c|x)dc.
g(y1,...,yG|x) =
R
c. The density g(y1,...,yG|x) is now

g(y1,...,yG|x;Go,Do) = i f(y1,y2,...,yG|x,c;Go)h(c|x;Do)dc
R
G
= i p fg(yg|x,c;Go)h(c|x;Do)dc,
g

R g=1
and so the log likelihood for observation i is

log[g(yi1,...,yiG|xi;Go,Do)]

= log
#i pG f (y |x ,c;Gg)h(c|x ;D )dc$.
3R g=1 g ig i o i o 4
d. This setup has some features in common with a linear SUR model,

although here the correlation across equations is assumed to come through a

single common component, c. Because of computational issues with general

nonlinear models -- especially if G is large and some of the models are for

202
qualitative response -- one probably needs to restrict the cross correlation

somehow.

13.11. a. For each t > 1, the density of yit given yi,t-1 = yi,t-1, yi,t-2 =
yt-2, , ..., yi0 = y0, and ci = c is

ft(yt|yt-1,c) = (2pse) ryt-1 - c)2/(2s2e)].


2 -1/2
exp[-(yt -

Therfore, the density of (yi1,...,yiT) given yi0 = y0 and ci = c is obtained

by the product of these densities:


T
p (2ps2e)-1/2exp[-(yt - ryt-1 - c)2/(2s2e)].
t=1
If we plug in the data for observation i and take the log we get
T
S {-(1/2)log(se2) - (yit - ryi,t-1 - ci)2/(2se2)}
t=1
where we have dropped the term that does not depend on the parameters. It is

not a good idea to "estimate" the ci along with r and s2e, as the incidental
parameters problem causes inconsistency -- severe in some cases -- in the

estimator of r.
b. If we write ci = a0 + a1yi0 + ai, under the maintained assumption,
then the density of (yi1,...,yiT) given (yi0 = y0,ai = a) is
T
p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0 - a)2/(2s2e)].
t=1
Now, to get the density condition on yi0 = y0 only, we integrate this density

over the density of ai given yi0 = y0. But ai and yi0 are independent, and

ai ~ Normal(0,sa).
2
So the density of (yi1,...,yiT) given yi0 = y0 is
8 T
i&7 p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0 - a)2/(2s2e)]*8s-1
a f(a/sa)da.
-8 t=1
If we now plug in the data (yi0,yi1,...,yiT) for each i and take the log we

get a conditional log-likelihood (conditional on yi0) for each i. We can

estimate the parameters by maximizing the sum of the log-likelihoods across

i.

203
c. As before, we can replace ci with a0 + a1yi0 + ai. Then, the density

of yit given (yi,t-1,...,yi1,yi0,ai) is

Normal[ryi,t-1 + a0 + a1yi0 + ai + d(a0 + a1yi0 + ai)yi,t-1,s2e],


t = 1,...,T. Using the same argument as in part b, we just integrate out ai

to get the density of (yi1,...,yiT) given yi0 = y0:


8& T
i7 p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0
-8 t=1
2 2 * -1
- a - d(a0 + a1y0 + a)yt-1) /(2se)] sa f(a/sa)da.
8
Numerically, this could be a difficult MLE problem to solve. Assuming we can

get the MLEs, we would estimate r + E(ci) as r^ + a^0 + a^1y0, where y0 is the
----- -----

cross-sectional average of the initial observation.

d. The log likelihood for observation i, now conditional on (yi0,zi), is

the log of
8& T
i7 p (2pse2)-1/2exp[-(yit - ryi,t-1 - zitB
-8 t=1
----- 2 2 * -1
- a0 - a1y0 - ziD - a) /(2se)] sa f(a/sa)da.
8
-----

The assumption that we can put in the time average, zi, to account for

correlation between ci and (yi0,zi), may be too strong. It would be better

to put in the full vector zi, although this leads to many more parameters to

estimate.

13.12 (Bonus Question): Let {f(yt|xt;Q): t = 1,...,T} be a sequence of

correctly specified densities for yit given xit. That is, assume that there

is Qo e int($) such that f(yt|xt;Qo) is the density of yit given xit = xt.

Also assume that {xit: t=1,2,...,T} is strictly exogenous for each t:

D(yit|xi1,...,xiT) = D(yit|xit).

a. Is it true that, under the standard regularity conditions for partial

MLE, that E[sit(Qo)|xi1,...,xiT] = 0, where sit(Q) = Dqlogft(yit|xit;Q)’?


204
b. Under the assumptions given, is {sit(Qo): t = 1,...,T} necessarily

serially uncorrelated?

c. Let ci be "unobserved heterogeneity" for cross section unit i, and

assume that, for each t, D(yit|zi1,...,ziT,ci) = D(yit|zit,ci). In other

words, {zit: t = 1,...,T} is strictly exogenous conditional on ci. Further,

assume that D(ci|zi1,...,ziT) = D(ci|zi), where zi = T (zi1 + ... + ziT) is


----- ----- -1

the time average. Assuming that well-behaved, correctly-specifed conditional

densities are available, how do we choose xit to make part a applicable?

Answer:

a. This is true, because, by the general theory for partial MLE, we know

that E[sit(Qo)|xit] = 0, t = 1,...,T. But if D(yit|xi1,...,xiT) = D(yit|xit)

then, for any function mt(yit,xit), E[mt(yit,xit)|xi1,...,xiT] =

E[mt(yit,xit)|xit], including the score function.

b. No. Strict exogeneity and complete dynamic specification of the

conditional density are entirely different. Saying that D(yit|xi1,...,xiT)

does not depend on xis, s $ t, says nothing about whether yir, r < t, appears
in D(yit|xit,yi,t-1,xi,t-1,...).

If gt(yt|zt,c;G) is correctly
-----

c. We take xit = (zit,zi), t = 1,...,T.

specified for the density of yit given (zit = zt,ci = c), and h(c|z;D) is
-----

----- -----

correctly specified for the density of ci given zi = z, then the density of

yit given (zit,zi) is obtained as

ft(yt|zit,zi;Qo) = i gt(yt|zit,c;Go)h(c|zi;Do)n(dc). -----

C
Under the assumptions given, D(yit|zi1,...,ziT,zi) = D(yit|zit,zi), t =
----- -----

1,...,T. However, we have not eliminated the serial dependence in {yit}

205
----- -----

after only conditioning on (zit,zi): the part of ci not explained by zi

affects yit in each time period.

206
SOLUTIONS TO CHAPTER 14 PROBLEMS

14.1. a. The simplest way to estimate (14.35) is by 2SLS, using instruments

(x1,x2). Nonlinear functions of these can be added to the instrument list

-- these would generally improve efficiency if g2 $ 1. If E(u2|x) =


2
s22,
2SLS using the given list of instruments is the efficient, single equation

GMM estimator. Otherwise, the optimal weighting matrix that allows

heteroskedasticity of unknown form should be used. Finally, one could try

to use the optimal instruments derived in section 14.5.3. Even under

homoskedasticity, these are difficult, if not impossible, to find

analytically if g2 $ 1.
b. No. If g1 = 0, the parameter g2 does not appear in the model. Of

course, if we knew g1 = 0, we would consistently estimate D1 by OLS.


c. We can see this by obtaining E(y1|x):
g
E(y1|x) = x1D1 + g1E(y22|x) + E(u1|x)
g
= x1D1 + g1E(y22|x).
g g
Now, when g2 $ 1, E(y22|x) $ [E(y2|x)] 2, so we cannot write
g
E(y1|x) = x1D1 + g1(xD2) 2;
in fact, we cannot find E(y1|x) without more assumptions. While the

regression y2 on x2 consistently estimates D2, the two-step NLS estimator of


g2
yi1 on xi1, (xiD2) D1
^
will not be consistent for and g2. (This is an

example of a "forbidden regression.") When g2 = 1, the plug-in method works:


it is just the usual 2SLS estimator.

14.2. a. When r1 = 1, we obtain the level-level model, hours = -g1 + z1D1 +


g1wage + u1. Using the hint, let r1 L 0 to get hours = z1D1 + g1log(wage)
207
+ u1.

b. We cannot use a standard t test after estimating the full model

(say, by nonlinear 2SLS), because r1 cannot be estimated under H0. The

score test and QLR test also fail because of lack of identification under
r1
H0. What we can do is fix a value for r1, and then use a t test on (wage
- 1)/r1 after 2SLS (or GMM more generally). This need not be a very good

test for detecting g1 $ 0 if our guess for r1 is not close to the actual
value. There is a growing literature on testing hypotheses when parameters

are not identified under the null.

c. If Var(u1|z) = s21, use nonlinear 2SLS, where we would use z and


functions of z as IVs. If we are not willing to assume homoskedasticity,

GMM is generally more efficient.

d. The residual funtion is r(Q) = [hours - z1D1 - g1(wager1 - 1)/r1],


where Q = (D’
1 ,g1,r1)’. The gradient is

Dqr(Q) = {-z1,-(wager1 - 1)/r1,g1[(wager1 - 1) - r1wager1log(wage)]/r21},


where I used the hint. The score is just the transpose.

e. Estimate D1 and g1 by 2SLS, or use the GMM estimator that accounts


for heteroskedasticity, under the restriction r1 = 1. Suppose the

instruments are zi, a 1 * L vector. This is just linear estimation because

the model is linear under H0. Then, using Zi = zi, and

ri(Q) = [hoursi - zi1D1 - g1(wagei - 1)]


~ ~ ~

Dqr(Q) = {-zi1,-(wagei - 1),g~ 1[(wagei - 1) - wageilog(wagei)]},


use the score statistic in equation (14.32).

*
14.3. Let Zi be the G * G matrix of optimal instruments in (14.63), where
we suppress its dependence on xi. Let Zi be a G * L matrix that is a
208
function of xi and let %o be the probability limit of the weighting matrix.

Then the asymptotic variance of the GMM estimator has the form (14.10) with

Go = E[Z’
i Ro(xi)]. So, in (14.54), take A _ G’o %oGo and s(wi) _

o %oZ’
G’ i r(wi,Qo). _
*
The optimal score function is s (wi)

Ro(xi)’)o(xi) r(wi,Qo). r = 1:
-1
Now we can verify (14.57) with

o %oE[Z’
E[s(wi)s (wi)’] = G’ i r(wi,Qo)r(wi,Qo)’)o(xi) Ro(xi)]
* -1

o %oE[Z’
= G’ i E{r(wi,Qo)r(wi,Qo)’|xi})o(xi) Ro(xi)]
-1

o %oE[Z’
= G’ i )o(xi))o(xi) Ro(xi)] = G’
o %oGo = A.
-1

14.4. a. The residual function for the conditional mean model E(yi|xi) =

m(xi,Bo) is ri(B) _ yi - m(xi,B). Then )o(xi) in (14.61) is just a scalar,

)o(xi) = Var(yi|xi) _ wo(xi). Under WNLS.3, wo(xi) = so2h(xi,Go) for a


known function h(W). Further, Ro(xi) _ E[Dbri(Bo)|xi] = -Dbm(xi,Bo), and
so the optimal instruments are Dbm(xi,Bo)/wo(xi). The asymptotic variance

of the efficient IV estimator is obtained from (14.66):

{E[Dbm(xi,Bo)’[wo(xi)] Dbm(xi,Bo)]}-1
-1

= so2{E[Dbm(xi,Bo)’Dbm(xi,Bo)/h(xi,Go)]}-1,
which is the asymptotic variance of the WNLS estimator under WNLS.1,

WNLS.2, and WNLS.3.

b. If Var(yi|xi) = s2o then NLS achieves the efficiency bound, as is


seen by setting h(x,Go) _ 1 in part a.
c. Now let ri1(B) _ ui(B) _ yi - m(xi,B) and ri2(B,s2) = [yi -
m(xi,B)] s2. Let ri(Q) denote the 2 * 1 vector obtained by stacking the
2
-

two residual functions. Then the moment conditions can be written as

E[ri(Qo)|xi] = 0,

where Qo = (B’
o ,so)’.
2
To obtain the efficient IVs, we first need

209
E[Dqri(Qo)|xi]. But
& -D b m i (B) 0 *
Dqri(Q) = .
7-2D b m i (B)ui(B) -1 8
2 2

Evaluating at Qo and using E[ui(Bo)|xi] = 0 gives

&-Dbm i (B) 0 *
Ro(xi) = Dqri(Q) = 2 2.
7 0 -1 8
We also need

s 2o & E(u i |xi)


3
*
)o(xi) = E[ri(Qo)ri(Qo)’|xi] =
7 E(u i |xi) E(ui|x i ) - s4o8
3 2
4 2

where ui _ yi - m(xi,Bo). The optimal IVs are [)o(xi)] Ro(xi).


-1
If

E(ui|xi) = 0, as occurs under conditional symmetry of ui, then the


3

asymptotic variance matrix of the optimal IV estimator is block diagonal,

and for B^ it is the same as NLS. In other words, adding the moment

condition for the homoskedasticity assumption does not improve efficiency

over NLS under symmetry, even if E(ui|xi) is not constant.


4
If, in

addition, E(ui|xi) is constant, then the usual estimator of s2o based on the
4

sum of squared NLS residuals is efficient.

14.5. We can write the unrestricted linear projection as

yit = pt0 + xiPt + vit, t = 1,2,3,


where Pt is 1 + 3K * 1, and then P is the 3 + 9K * 1 vector obtained by
stacking the Pt. Let Q = (j,L’
1 ,L’
2 ,L’
3 ,B’)’. With the restrictions imposed on

the Pt we have

pt0 = j, t = 1,2,3, P1 = [(L1 + B)’,L’2 ,L’3 ]’,


P2 = [L’
1 ,(L2 + B)’,L’3 ]’, P3 = [L’
1 ,L’
2 ,(L3 + B)’]’.
Therefore, we can write P = HQ for the (3 + 9K) * (1 + 4K) matrix H defined

210
by
&1 0 0 0 0 *
2
0 IK 0 0 IK 2

2 2

0 0 IK 0 0
2 2

0 2 0 0 IK 0 2

1 2
0 0 0 0 2

2 2

0 IK 0 0 0
H = 2 . 2

0 2
0 IK 0 IK 2

0 2 0 0 IK 0 2

1 2
0 0 0 0 2

2 2

0 IK 0 0 0
2 2

0 2
0 IK 0 0 2

70 0 0 IK IK8

14.6. By the hint, it suffices to show that

[Avar rN(^Q - Qo)]-1 - [Avar rN(~Q - Qo)]-1


----- -----

o %o Ho - H’
This difference is H’ o *o Ho = H’
o (%o *o-1)Ho.
-1 -1 -1
is p.s.d. - This is

positive semi-definite if %-1


o - *-1
o is p.s.d., which again holds by the hint

because *o - %o is assumed to be p.s.d.

14.7. With h(Q) = HQ, the minimization problem becomes

min (P - HQ)’% (P - HQ),


^ ^-1 ^

QeR P
where it is assumed that no restrictions are placed on Q. The first order

condition is easily seen to be

-2H’% (P - HQ) = 0 (H’% H)Q = H’% P.


^-1 ^ ^ ^-1 ^ ^-1^
or

Therefore, assuming H’% H is nonsingular -- which occurs w.p.a.1. when


^-1

H’%o H -- is nonsingular -- we have Q^ = (H’% H) H’% P.


-1 ^-1 -1 ^-1^

14.8. From the efficiency result of maximum likelihood -- see the discussion

211
on page 439 -- it is no less asymptotically efficient to use the density of

(yi0,yi1,...,yiT) than to use the conditional distribution (yi1,...,yiT)

given yi0. The cost of the asymptotic efficiency is that if we misspecify

f0(y0;Q), then the unconditional MLE will generally be inconsistent for Qo.
The MLE that conditions on yi0 is consistent provided we have the densitites

ft(yt|yt-1;Q) correctly specified, t > 1. As ft(yt|yt-1;Q) is the density of

interest, we are usually willing to put more effort into testing our

specification of it.

14.9. We have to verify equations (14.55) and (14.56) for the random effects

and fixed effects estimators. The choices of si1, si2 (with added i

subscripts for clarity), A1, and A2 are given in the hint. Now, from Chapter

10, we know that E(rir’|


i xi) = s2uIT under RE.1, RE.2, and RE.3, where ri = vi
ˇ ˇ
- ljTvi.
-----

Therefore, E(si1s’
i1) = E(X’i rir’
i Xi) = su2E(Xˇ’i Xˇi) _ su2A1 by the usual
iterated expectations argument. This means that, in (14.55), r _ su2. Now,
¨ ˇ
we just need to verify (14.56) for this choice of r. But si2s’
i1 = X’i uir’
i X i.
¨ ¨
Now, as described in the hint, X’i ri = X’i (vi - ljTvi) = ¨X’i vi = ¨X’i (cijT + ui) =
-----

¨ ¨ ˇ ¨ ˇ
X’i ui. So si2s’
i1 = X’i rir’
i Xi and therefore E(si2s’
i1|xi) = X’i E(rir’|
i xi)Xi =

su2X¨’i Xˇi. It follows that E(si2s’


i1) = su2E(X¨’i Xˇi). To finish off the proof,
¨ ˇ ¨
note that X’
i Xi = X’i (Xi - ljTxi) = ¨X’i Xi = ¨X’i ¨Xi.
-----

This verifies (14.56) with r


= s2u.

212
SOLUTIONS TO CHAPTER 15 PROBLEMS

15.1. a. Since the regressors are all orthogonal by construction -- dkiWdmi =

0 for k $ m, and all i -- the coefficient on dm is obtained from the


regression yi on dmi, i = 1,...,N. But this is easily seen to be the fraction

of yi in the sample falling into category m. Therefore, the fitted values are

just the cell frequencies, and these are necessarily in [0,1].

b. The fitted values for each category will be the same. If we drop d1

but add an overall intercept, the overall intercept is the cell frequency for

the first category, and the coefficient on dm becomes the difference in cell

frequency between category m and category one, m = 2, ..., M.

15.2. a. First, since utility is increasing in both c and q, the budget

constraint is binding at the optimum: ci + piqi = mi. Plugging c = mi - piq

into the utility function reduces the problem to

max (mi - piq) + ai(1+q).


q>0
Define utility as a function of q, as

si(q) _ (mi - piq) + ai(1+q).


Then, for all q > 0,
dsi ai
(q) = -wi +
-------------- . ------------------------

dq 1 + q
The optimal solution is qi = 0 if the marginal utility of charitable giving at

q = 0 is nonpositive, that is, if


dsi
(0) = -pi + ai < 0 or ai
-------------- < pi.
dq
(This can also be obtained by solving the Kuhn-Tucker conditions.) Thus, for

this utility function, ai can be interpreted as the reservation price above

213
which no charitable contributions will be made; in other words, we have the

corner solution qi = 0 whenever the price of charitable giving is too high

relative to the marginal utility of charitable giving. On the other hand, if

ai > pi, then an interior solution exists (qi > 0) and necessarily solves the

first order condition


dsi ai
dq
(qi) = -pi +
--------------

1 + qi
---------------------------- _ 0
or

1 + qi = ai/pi.

b. By definition of yi, yi = 1 if and only if ai/pi > 1 or log(ai/pi) >

0. If ai = exp(ziG + vi), this is equivalent to ziG + vi - log pi > 0.

Therefore,

P(yi = 1|zi,mi,pi) = P(yi = 1|zi,pi)

= P(ziG + vi - log pi > 0|zi,pi) = P[vi/s > (-ziG + log pi)/s]

= 1 - G[(-ziG + log pi)/s] = G[(ziG - log pi)/s],

where the last equality follows by symmetry of the distribution of vi/s.

15.3. a. If P(y = 1|z1,z2) = F(z1D1 + g1z2+ g2z2) then


2

dP(y = 1|z1,z2)
= (g1 + 2g2z2)Wf(z1D1 + g1z2 + g2z2);
2
dz 2 -------------------------------------------------------------------------

for given z, this is estimated as


^ ^ ^ ^ ^ 2
(g1 + 2g2z2)Wf(z1D1 + g1z2 + g2z2),

where, of course, the estimates are the probit estimates.

b. In the model

P(y = 1|z1,z2,d1) = F(z1D1 + g1z2 + g2d1 + g3z2d1),

the partial effect of z2 is


dP(y = 1|z1,z2,d1)
= (g1 + g3d1)Wf(z1D1 + g1z2 + g2d1 + g3z2d1).
dz 2
---------------------------------------------------------------------------------------

The effect of d1 is measured as the difference in the probabilities at d1 = 1

214
and d1 = 0:

P(y = 1|z,d1 = 1) - P(y = 1|z,d1 = 0)

= F[z1D1 + (g1 + g3)z2 + g2] - F(z1D1 + g1z2).

Again, to estimate these effects at given z and -- in the first case, d1 -- we

just replace the parameters with their probit estimates, and use average or

other interesting values of z.

c. We would apply the delta method from Chapter 3. Thus, we would

require the full variance matrix of the probit estimates as well as the

gradient of the expression of interest, such as (g1 + 2g2z2)Wf(z1D1 + g1z2 +


2
g2z2), with respect to all probit parameters. (Not with respect to the zj.)

15.4. This is the kind of statement that arises out of failure to distinguish

between the underlying latent variable model and the model for P(y = 1|x).

The linear probability model assumes P(y = 1|x) = xB while, for example, the

probit model assumes that P(y = 1|x) = F(xB). Thus, both models make very

particular functional form assumptions on the response probabilities. The

fact that the probit model can be derived from a latent variable model with a

normal, homoskedastic error does not make it less plausible than the LPM. In

fact, we know that the probit functional form has some attractive properties

that the linear model does not have: F(xB) is always between zero and one,

and the marginal effect of any xj is diminishing after some point.

Incidentally, the LPM can be obtained from a latent variable model by assuming

that e has a uniform distribution over [-1,1] (actually, any symmetric,

uniform interval will do).

215
15.5. a. If P(y = 1|z,q) = F(z1D1 + g1z2q) then
dP(y = 1|z,q) = g qWf(z D + g z q),
dz 2 1 1 1 1 2
-----------------------------------------------------------------

assuming that z2 is not functionally related to z1.

= z1D1 + r, where r = g1z2q + e, and e is independent of


*
b. Write y

(z,q) with a standard normal distribution. Because q is assumed independent

of z, q|z ~ Normal(0,g1z2 + 1); this follows because E(r|z) = g1z2E(q|z) +


2 2

E(e|z) = 0. Also,

Var(r|z) = g1z2Var(q|z) + Var(e|z) + 2g1z2Cov(q,e|z) = g1z2 + 1


2 2 2 2

because Cov(q,e|z) = 0 by independence between e and (z,q). Thus,


5 ======================================

r/r g1z2 + 1 has a standard normal distribution independent of z.


2 2
It follows

that

& 5 ======================================

*
P(y = 1|z) = F z1D1/r g1z2 + 1 .
2 2
(15.90)
7 8
c. Because P(y = 1|z) depends only on g1, this is what we can estimate
2

along with D1. (For example, g1 = -2 and g1 = 2 give exactly the same model

for P(y = 1|z).)


2
This is why we define r1 = g1. Testing H0: r1 = 0 is most

easily done using the score or LM test because, under H0, we have a standard
^
probit model. Let D1 denote the probit estimates under the null that r1 = 0.

^ ^ ^ ^ ^ ^ ~
5 ================================================

Define Fi = F(zi1D1), fi = f(zi1D1), ui = yi ----- Fi, and ui _ ^ui/r ^Fi(1 - ^Fi)


(the standardized residuals). The gradient of the mean function in (15.90)
^
with respect to D1, evaluated under the null estimates, is simply fizi1 . The

only other quantity needed is the gradient with respect to r1 evaluated at the

null estimates. But the partial derivative of (15.90) with respect to r1 is,

for each i,

-(zi1D1)(zi2/2) r1zi2 + 1
2 & 2 *-3/2f(z D /r5g2z2 + 1). ==========================================

7 8 9 i1 1 1 i2 0
^ ^ ^
When we evaluate this at r1 = 0 and D1 we get -(zi1D1)(zi2/2)fi.
2
Then, the

216
2
score statistic can be obtained as NRu from the regression

~ ^
5================================================

^ ^ ^ 2 ^ ^ ^
5================================================

ui on fizi1/r Fi(1 - Fi), (zi1D1)zi2fi/r Fi(1 - Fi);


2 a 2
under H0, NRu ~ c1.

d. The model can be estimated by MLE using the formulation with r1 in


2
place of g1. But this is not a standard probit estimation.

15.6. a. What we would like to know is that, if we exogenously change the

number of cigarettes that someone smokes per day, what effect would this have

on the probability of missing work over a three-month period? In other words,

we want to infer causality, not just find a correlation being missing work and

cigarette smoking.

b. Since people choose whether and how much to smoke, we certainly cannot

treat the data as coming from the experiment we have in mind in part a. (That

is, we cannot randomly assign people a daily cigarette consumption.) It is

possible that smokers are less healthy to begin with, or have other attributes

that cause them to miss work more often. Or, it could go the other way:

cigarette consumption may be related to personality traits that make people

harder workers. In any case, cigs might be correlated with the unobservables

in the equation.

c. If we start with the model

P(y = 1|z,cigs,q1) = F(z1D1 + g1cigs + q1), (15.91)

but ignore q1 when it is correlated with cigs, we will not consistently

estimate anything of interest, whether the model is linear or nonlinear.

Thus, we would not be estimating a causal effect. If q1 is independent of

cigs, then probit ignoring q1 does estimate the average partial effect of

217
another cigarette.

d. No. There are many people in the working population who do not smoke.

Thus, the distribution of cigs piles up at zero, conditional or unconditional

on z. Also, since cigs takes on integer values, it cannot be normally

distributed. But it is really the pile up at zero that is the most serious

issue.
^
e. Use the Rivers-Vuong test. Obtain the residuals, r2, from the
^
regression cigs on z. Then, estimate the probit of y on z1, cigs, r2 and use
^
a standard t test on r2. This does not rely on normality of r2 (or cigs). It

does, of course, rely on the probit model being correct for y under H0.

f. Assuming people will not immediately move out of their state of

residence when the state implements no smoking laws in the workplace, and that

state of residence is roughly independent of general health in the population,

a dummy indicator for whether the person works in a state with a new law can

be treated as exogenous and excluded from (15.91). (These situations are

often called "natural experiments.") Further, cigs is likely to be correlated

with the state law indicator, since people will not be able to smoke as much

as they otherwise would. Thus, it seems to be a reasonable instrument for

cigs.

15.7. a. The following Stata output is for part a:

. reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60

Source | SS df MS Number of obs = 2725


-------------+------------------------------ F( 8, 2716) = 30.48
Model | 44.9720916 8 5.62151145 Prob > F = 0.0000
Residual | 500.844422 2716 .184405163 R-squared = 0.0824
-------------+------------------------------ Adj R-squared = 0.0797
Total | 545.816514 2724 .20037317 Root MSE = .42942
218
------------------------------------------------------------------------------
arr86 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1543802 .0209336 -7.37 0.000 -.1954275 -.1133329
avgsen | .0035024 .0063417 0.55 0.581 -.0089326 .0159374
tottime | -.0020613 .0048884 -0.42 0.673 -.0116466 .007524
ptime86 | -.0215953 .0044679 -4.83 0.000 -.0303561 -.0128344
inc86 | -.0012248 .000127 -9.65 0.000 -.0014738 -.0009759
black | .1617183 .0235044 6.88 0.000 .1156299 .2078066
hispan | .0892586 .0205592 4.34 0.000 .0489454 .1295718
born60 | .0028698 .0171986 0.17 0.867 -.0308539 .0365936
_cons | .3609831 .0160927 22.43 0.000 .329428 .3925382
------------------------------------------------------------------------------

. reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60, robust

Regression with robust standard errors Number of obs = 2725


F( 8, 2716) = 37.59
Prob > F = 0.0000
R-squared = 0.0824
Root MSE = .42942

------------------------------------------------------------------------------
| Robust
arr86 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1543802 .018964 -8.14 0.000 -.1915656 -.1171948
avgsen | .0035024 .0058876 0.59 0.552 -.0080423 .0150471
tottime | -.0020613 .0042256 -0.49 0.626 -.010347 .0062244
ptime86 | -.0215953 .0027532 -7.84 0.000 -.0269938 -.0161967
inc86 | -.0012248 .0001141 -10.73 0.000 -.0014487 -.001001
black | .1617183 .0255279 6.33 0.000 .1116622 .2117743
hispan | .0892586 .0210689 4.24 0.000 .0479459 .1305714
born60 | .0028698 .0171596 0.17 0.867 -.0307774 .036517
_cons | .3609831 .0167081 21.61 0.000 .3282214 .3937449
------------------------------------------------------------------------------

The estimated effect from increasing pcnv from .25 to .75 is about -.154(.5) =

-.077, so the probability of arrest falls by about 7.7 points. There are no

important differences between the usual and robust standard errors. In fact,

in a couple of cases the robust standard errors are notably smaller.

b. The robust statistic and its p-value are gotten by using the "test"

command after appending "robust" to the regression command:

219
. test avgsen tottime

( 1) avgsen = 0.0
( 2) tottime = 0.0

F( 2, 2716) = 0.18
Prob > F = 0.8320

. qui reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60

. test avgsen tottime

( 1) avgsen = 0.0
( 2) tottime = 0.0

F( 2, 2716) = 0.18
Prob > F = 0.8360

c. The probit model is estimated as follows:

. probit arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60

Iteration 0: log likelihood = -1608.1837


Iteration 1: log likelihood = -1486.3157
Iteration 2: log likelihood = -1483.6458
Iteration 3: log likelihood = -1483.6406

Probit estimates Number of obs = 2725


LR chi2(8) = 249.09
Prob > chi2 = 0.0000
Log likelihood = -1483.6406 Pseudo R2 = 0.0774

------------------------------------------------------------------------------
arr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.5529248 .0720778 -7.67 0.000 -.6941947 -.4116549
avgsen | .0127395 .0212318 0.60 0.548 -.028874 .0543531
tottime | -.0076486 .0168844 -0.45 0.651 -.0407414 .0254442
ptime86 | -.0812017 .017963 -4.52 0.000 -.1164085 -.0459949
inc86 | -.0046346 .0004777 -9.70 0.000 -.0055709 -.0036983
black | .4666076 .0719687 6.48 0.000 .3255516 .6076635
hispan | .2911005 .0654027 4.45 0.000 .1629135 .4192875
born60 | .0112074 .0556843 0.20 0.840 -.0979318 .1203466
_cons | -.3138331 .0512999 -6.12 0.000 -.4143791 -.213287
------------------------------------------------------------------------------

Now, we must compute the difference in the normal cdf at the two different

values of pcnv, black = 1, hispan = 0, born60 = 1, and at the average values

220
of the remaining variables:

. sum avgsen tottime ptime86 inc86

Variable | Obs Mean Std. Dev. Min Max


---------+-----------------------------------------------------
avgsen | 2725 .6322936 3.508031 0 59.2
tottime | 2725 .8387523 4.607019 0 63.4
ptime86 | 2725 .387156 1.950051 0 12
inc86 | 2725 54.96705 66.62721 0 541

. di -.313 + .0127*.632 - .0076*.839 - .0812*.387 - .0046*54.97 + .467 + .0112


-.1174364

. di normprob(-.553*.75 - .117) - normprob(-.553*.25 - .117)


-.10181543

This last command shows that the probability falls by about .10, which is

somewhat larger than the effect obtained from the LPM.

d. To obtain the percent correctly predicted for each outcome, we first

generate the predicted values of arr86 as described on page 465:

. predict phat
(option p assumed; Pr(arr86))

. gen arr86h = phat > .5

. tab arr86h arr86

| arr86
arr86h | 0 1 | Total
-----------+----------------------+----------
0 | 1903 677 | 2580
1 | 67 78 | 145
-----------+----------------------+----------
Total | 1970 755 | 2725

. di 1903/1970
.96598985

. di 78/755
.10331126

For men who were not arrested, the probit predicts correctly about 96.6% of

221
the time. Unfortunately, for the men who were arrested, the probit is correct

only about 10.3% of the time. The overall percent correctly predicted is

quite high, but we cannot very well predict the outcome we would most like to

predict.

e. Adding the quadratic terms gives

. probit arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60 pcnvsq
pt86sq inc86sq

Iteration 0: log likelihood = -1608.1837


Iteration 1: log likelihood = -1452.2089
Iteration 2: log likelihood = -1444.3151
Iteration 3: log likelihood = -1441.8535
Iteration 4: log likelihood = -1440.268
Iteration 5: log likelihood = -1439.8166
Iteration 6: log likelihood = -1439.8005
Iteration 7: log likelihood = -1439.8005

Probit estimates Number of obs = 2725


LR chi2(11) = 336.77
Prob > chi2 = 0.0000
Log likelihood = -1439.8005 Pseudo R2 = 0.1047

------------------------------------------------------------------------------
arr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | .2167615 .2604937 0.83 0.405 -.2937968 .7273198
avgsen | .0139969 .0244972 0.57 0.568 -.0340166 .0620105
tottime | -.0178158 .0199703 -0.89 0.372 -.056957 .0213253
ptime86 | .7449712 .1438485 5.18 0.000 .4630333 1.026909
inc86 | -.0058786 .0009851 -5.97 0.000 -.0078094 -.0039478
black | .4368131 .0733798 5.95 0.000 .2929913 .580635
hispan | .2663945 .067082 3.97 0.000 .1349163 .3978727
born60 | -.0145223 .0566913 -0.26 0.798 -.1256351 .0965905
pcnvsq | -.8570512 .2714575 -3.16 0.002 -1.389098 -.3250042
pt86sq | -.1035031 .0224234 -4.62 0.000 -.1474522 -.059554
inc86sq | 8.75e-06 4.28e-06 2.04 0.041 3.63e-07 .0000171
_cons | -.337362 .0562665 -6.00 0.000 -.4476423 -.2270817
------------------------------------------------------------------------------

note: 51 failures and 0 successes completely determined.

. test pcnvsq pt86sq inc86sq

( 1) pcnvsq = 0.0
( 2) pt86sq = 0.0
( 3) inc86sq = 0.0
222
chi2( 3) = 38.54
Prob > chi2 = 0.0000

The quadratics are individually and jointly significant. The quadratic in


pcnv means that, at low levels of pcnv, there is actually a positive
relationship between probability of arrest and pcnv, which does not make much
sense. The turning point is easily found as .217/(2*.857) ~ .127, which means
that there is an estimated deterrent effect over most of the range of pcnv.

15.8. a. Here is my Stata session:

. gen smokes = cigs > 0

. tab smokes
smokes| Freq. Percent Cum.
------------+-----------------------------------
0 | 1176 84.73 84.73
1 | 212 15.27 100.00
------------+-----------------------------------
Total | 1388 100.00

. probit smokes motheduc white lfaminc

Probit Estimates Number of obs = 1387


chi2(3) = 92.67
Prob > chi2 = 0.0000
Log Likelihood = -546.76991 Pseudo R2 = 0.0781

------------------------------------------------------------------------------
smokes | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | -.1450599 .0207899 -6.977 0.000 -.1858074 -.1043124
white | .1896765 .1098804 1.726 0.084 -.0256852 .4050382
lfaminc | -.1669109 .0498894 -3.346 0.000 -.2646923 -.0691296
_cons | 1.126276 .2504608 4.497 0.000 .6353822 1.617171
------------------------------------------------------------------------------

. sum faminc

Variable | Obs Mean Std. Dev. Min Max


---------+-----------------------------------------------------
faminc | 1388 29.02666 18.73928 .5 65

. di 1.126 - .167*log(29.027)
.56350619

223
. di normprob(-.145*16 + .5635) - normprob(-.145*12 + .5635)
-.08019603

For nonwhite women at the average income level, the estimated difference in

the probability of smoking between college graduates and high school is about

-.08, that is, women with a college education are .08 less likely to smoke.

b. faminc is probably not exogenous, since, at a minium, it is likely

correlated with quality of health care. It might also be correlated withh

unobserved cultural factors that are correlated with smoking.

c. The reduced form equation for lfaminc is estimated as follows:

. reg lfaminc motheduc white fatheduc

Source | SS df MS Number of obs = 1191


---------+------------------------------ F( 3, 1187) = 119.23
Model | 140.936735 3 46.9789115 Prob > F = 0.0000
Residual | 467.690904 1187 .394010871 R-squared = 0.2316
---------+------------------------------ Adj R-squared = 0.2296
Total | 608.627639 1190 .511451797 Root MSE = .6277

------------------------------------------------------------------------------
lfaminc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | .0709044 .0098338 7.210 0.000 .0516109 .090198
white | .3452115 .050418 6.847 0.000 .2462931 .4441298
fatheduc | .0616625 .008708 7.081 0.000 .0445777 .0787473
_cons | 1.241413 .1103648 11.248 0.000 1.024881 1.457945
------------------------------------------------------------------------------

. predict v2hat, resid


(197 missing values generated)

As expected, fatheduc has a positive partial effect on lfaminc, and the

relationship is statistically significant. We need the residuals from this

regression for the next part. Note that we lose 197 observations due to

missing data on fatheduc.

d. To test the null of exogeneity, we estimate the probit that includes

224
^
v2:

. probit smokes motheduc white lfaminc v2hat

Probit Estimates Number of obs = 1191


chi2(4) = 79.43
Prob > chi2 = 0.0000
Log Likelihood = -432.06242 Pseudo R2 = 0.0842

------------------------------------------------------------------------------
smokes | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | -.0826247 .0465203 -1.776 0.076 -.1738029 .0085535
white | .4611075 .1965242 2.346 0.019 .0759272 .8462879
lfaminc | -.7622559 .3652944 -2.087 0.037 -1.47822 -.046292
v2hat | .6107298 .3708066 1.647 0.100 -.1160378 1.337497
_cons | 1.98796 .5996364 3.315 0.000 .8126946 3.163226
------------------------------------------------------------------------------

There is not real strong evidence of endogeneity, but even if there were, we

would not really know whether it is because lfaminc is endogenous or because

fatheduc belongs in the structural equation. Remember, the test can be

interpreted as a test for endogeneity of lfaminc only when we maintain that

fatheduc is exogenous.

This is not a very good example, but it shows you how to mechanically

carry out the tests.

15.9. a. Let P(y = 1|x) = xB, where x1 = 1. Then, for each i,

li(B) = yilog(xiB) + (1 - yi)log(1 - xiB),

which is only well-defined for 0 < xiB < 1.

b. For any possible estimate B^, the log-likelihood function is well-


^
defined only if 0 < xiB < 1 for all i = 1,...,N. Therefore, during the

iterations to obtain the MLE, this condition must be checked. It may be

impossible to find an estimate that satisfies these inequalities for every

225
observation, especially if N is large.

c. This follows from the KLIC: the true density of y given x --

evaluated at the true values, of course -- maximizes the KLIC. Since the MLEs

are consistent for the unknown parameters, asymptotically the true density

will produce the highest average log likelihood function. So, just as we can

use an R-squared to choose among different functional forms for E(y|x), we can

use values of the log-likelihood to choose among different models for P(y =

1|x) when y is binary.

^ ^
15.10. a. There are several possibilities. One is to define pi = P(y = 1|xi)

-- the estimated response probabilities -- and obtain the square of the


^
correlation between yi and pi. For the LPM, this is just the usual R-squared.
^
For the general index model, G(xiB) is the estimate of E(y|xi), and so it

makes sense to compute an analogous goodness-of-fit measure. This measure is

always between zero and one. An alternative is to use the sum of squared

residuals form. While this produces the same R-squared measure for the linear

model, it does not for nonlinear models.

b. I will report the square of the correlation between yi and the fitted

probabilities for the LPM and probit. The LPM R-squared is about .106 and

that for probit is higher, about .116. So probit is preferred, marginally, on

this goodness-of-fit measure.

15.11. We really need to make two assumptions. The first is a conditional

independence assumption: given xi = (xi1,...,xiT), (yi1,...,yiT) are

independent. This allows us to write

f(y1,...,yT|xi) = f1(y1|xi)WWWfT(yT|xi),

226
that is, the joint density (conditional on xi) is the product of the marginal

densities (each conditional on xi). The second assumption is a strict

exogeneity assumption: D(yit|xi) = D(yit|xit), t = 1,...,T. When we add the

standard assumption for pooled probit -- that D(yit|xit) follows a probit

model -- then
T
f(y1,...,yT|xi) = p [G(xitB)]yt[1 - G(xitB)]1-yt,
t=1
and so pooled probit is conditional MLE.

15.12. We can extend the T = 2 case on page 491:

P(yi1 = 1|xi,ci,ni = 1) = P(yi1 = 1,ni = 1|xi,ci)/P(ni = 1|xi,ci)

= P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci)/{P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci)

+ P(yi1 = 0,yi2 = 1,yi3 = 0|xi,ci) + P(yi1 = 0,yi2 = 0,yi3 = 1|xi,ci)}.

Now, we just the conditional independence assumption (across t) and the

logistic functional form:

P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci) = L(xi1B + ci)[1 - L(xi2B + ci)]

W[1 - L(xi3B + ci)]


P(yi1 = 0,yi2 = 1,yi3 = 0|xi,ci) = [1 - L(xi1B + ci)]L(xi2B + ci)

W[1 - L(xi3B + ci)]


and

P(yi1 = 0,yi2 = 0,yi3 = 1|xi,ci)} = [1 - L(xi1B + ci)]

W[1 - L(xi1B + ci)]L(xi3B + ci).


Now, the term

1/{[1 + exp(xi1B + ci)]W[1 + exp(xi2B + ci)]W[1 + exp(xi3B + ci)]}

appears multiplicatively in both the numerator and denominator, and so they

cancel. Therefore,

P(yi1 = 1|xi,ci,ni = 1) = exp(xi1B + ci)/[exp(xi1B + ci)

227
+ exp(xi2B + ci) + exp(xi3B + ci)]

= exp(xi1B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)].

Also,

P(yi2 = 1|xi,ci,ni = 1) = exp(xi2B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)]

and

P(yi3 = 1|xi,ci,ni = 1) = exp(xi3B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)],

which are of the conditional logit form in (15.80). A consistent estimator of

B is obtained using only the ni = 1 observations and applying conditional

logit. This, however, would be inefficient because it does not use the ni = 2

observations.

A similar argument can be used for the three possible configurations with

ni = 2, which leads to the log-likelihood conditional on (xi,ni), where ci has

dropped out. For example,

P(yi1 = 1,yi2 = 1|xi,ci,ni = 2) = exp[(xi1 + xi2)B]

/{[exp[(xi1 + xi2)B] + exp[(xi1 + xi3)B] + exp[(xi2 + xi3)B].

Again, this has the conditional logit form, but where the explanatory

variables consist of sums of explanatory variables across two different time

periods.

15.13. a. If there are no covariates, there is no point in using any method

other than a straight comparison of means. The estimated probabilities for

the treatment and control groups, both before and after the policy change,

will be identical across models.

b. Let d2 be a binary indicator for the second time period, and let dB be

an indicator for the treatment group. Then a probit model to evaluate the

treatment effect is

228
P(y = 1|x) = F(d0 + d1d2 + d2dB + d3d2WdB + xG),

where x is a vector of covariates. We would estimate all parameters from a

probit of y on 1, d2, dB, d2WdB, and x using all observations. Once we have

the estimates, we need to compute the "difference-in-differences" estimate,


-----

which requires either plugging in a value for x, say x, or averaging the

differences across xi. In the former case, we have


^ ^ ^ ^ ^ ^ ^ ^ ^
q _ [F(d0 + d1 + d2 + d3 + xG) - F(d0 + d2 + xG)]
------ ------

^ ^ ^ ^ ^
- [F(d0 + d1 + xG) - F(d0 + xG)],
------ ------

and in the latter we have


~ N
q _ N-1 S {[F(d^0 + d^1 + d^2 + d^3 + xiG
^ ^ ^ ^
) - F(d0 + d2 + xiG)]
i=1
^ ^ ^ ^ ^
- [F(d0 + d1 + xiG) - F(d0 + xiG)]}.

Both are estimates of the difference, between groups B and A, of the change in

the response probability over time.

c. We would have to use the delta method to obtain a valid standard error
^ ~
for either q or q.

15.14. a. The following Stata output contains the linear regression results.

Since pctstck is discrete (taking on only 0, 50, 100), it seems likely that

heteroskedasticity is present in a linear model. In fact, the robust standard

errors are not very different from the usual ones (not reported).

. reg pctstck choice age educ female black married finc25-finc101 wealth89
prftshr, robust

Regression with robust standard errors Number of obs = 194


F( 14, 179) = 2.15
Prob > F = 0.0113
R-squared = 0.0998
Root MSE = 39.134

------------------------------------------------------------------------------
229
| Robust
pctstck | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | 12.04773 5.994437 2.01 0.046 .2188715 23.87658
age | -1.625967 .8327895 -1.95 0.052 -3.269315 .0173813
educ | .7538685 1.172328 0.64 0.521 -1.559493 3.06723
female | 1.302856 7.148595 0.18 0.856 -12.80351 15.40922
black | 3.967391 8.974971 0.44 0.659 -13.74297 21.67775
married | 3.303436 8.369616 0.39 0.694 -13.21237 19.81924
finc25 | -18.18567 16.00485 -1.14 0.257 -49.76813 13.39679
finc35 | -3.925374 15.86275 -0.25 0.805 -35.22742 27.37668
finc50 | -8.128784 15.3762 -0.53 0.598 -38.47072 22.21315
finc75 | -17.57921 16.6797 -1.05 0.293 -50.49335 15.33493
finc100 | -6.74559 16.7482 -0.40 0.688 -39.7949 26.30372
finc101 | -28.34407 16.57814 -1.71 0.089 -61.05781 4.369671
wealth89 | -.0026918 .0114136 -0.24 0.814 -.0252142 .0198307
prftshr | 15.80791 8.107663 1.95 0.053 -.190984 31.80681
_cons | 134.1161 58.87288 2.28 0.024 17.9419 250.2902
------------------------------------------------------------------------------

b. With relatively few husband-wife pairs -- 23 in this application -- we

do not expect big differences in standard errors, and we do not see them:

. reg pctstck choice age educ female black married finc25-finc101 wealth89
prftshr, robust cluster(id)

Regression with robust standard errors Number of obs = 194


F( 14, 170) = 2.12
Prob > F = 0.0128
R-squared = 0.0998
Number of clusters (id) = 171 Root MSE = 39.134

------------------------------------------------------------------------------
| Robust
pctstck | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | 12.04773 6.184085 1.95 0.053 -.1597615 24.25521
age | -1.625967 .8192942 -1.98 0.049 -3.243267 -.0086663
educ | .7538685 1.1803 0.64 0.524 -1.576064 3.083801
female | 1.302856 7.000538 0.19 0.853 -12.51632 15.12203
black | 3.967391 8.711611 0.46 0.649 -13.22948 21.16426
married | 3.303436 8.624168 0.38 0.702 -13.72082 20.32769
finc25 | -18.18567 16.82939 -1.08 0.281 -51.40716 15.03583
finc35 | -3.925374 16.17574 -0.24 0.809 -35.85656 28.00581
finc50 | -8.128784 15.91447 -0.51 0.610 -39.54421 23.28665
finc75 | -17.57921 17.2789 -1.02 0.310 -51.68804 16.52963
finc100 | -6.74559 17.24617 -0.39 0.696 -40.78983 27.29865
finc101 | -28.34407 17.10783 -1.66 0.099 -62.1152 5.427069
wealth89 | -.0026918 .0119309 -0.23 0.822 -.0262435 .02086
230
prftshr | 15.80791 8.356266 1.89 0.060 -.6874976 32.30332
_cons | 134.1161 58.1316 2.31 0.022 19.36333 248.8688
------------------------------------------------------------------------------

For later use, the predicted pctstck for the person described in the problem,

with choice = 0, is about 38.37. With choice, it is roughly 50.42.

c. The ordered probit estimates follow, including commands that provide

the predictions for pctstck with and without choice:

. oprobit pctstck choice age educ female black married finc25-finc101 wealth89
prftshr

Iteration 0: log likelihood = -212.37031


Iteration 1: log likelihood = -202.0094
Iteration 2: log likelihood = -201.9865
Iteration 3: log likelihood = -201.9865

Ordered probit estimates Number of obs = 194


LR chi2(14) = 20.77
Prob > chi2 = 0.1077
Log likelihood = -201.9865 Pseudo R2 = 0.0489

------------------------------------------------------------------------------
pctstck | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | .371171 .1841121 2.02 0.044 .010318 .7320241
age | -.0500516 .0226063 -2.21 0.027 -.0943591 -.005744
educ | .0261382 .0352561 0.74 0.458 -.0429626 .0952389
female | .0455642 .206004 0.22 0.825 -.3581963 .4493246
black | .0933923 .2820403 0.33 0.741 -.4593965 .6461811
married | .0935981 .2332114 0.40 0.688 -.3634878 .550684
finc25 | -.5784299 .423162 -1.37 0.172 -1.407812 .2509524
finc35 | -.1346721 .4305242 -0.31 0.754 -.9784841 .7091399
finc50 | -.2620401 .4265936 -0.61 0.539 -1.098148 .5740681
finc75 | -.5662312 .4780035 -1.18 0.236 -1.503101 .3706385
finc100 | -.2278963 .4685942 -0.49 0.627 -1.146324 .6905316
finc101 | -.8641109 .5291111 -1.63 0.102 -1.90115 .1729279
wealth89 | -.0000956 .0003737 -0.26 0.798 -.0008279 .0006368
prftshr | .4817182 .2161233 2.23 0.026 .0581243 .905312
-------------+----------------------------------------------------------------
_cut1 | -3.087373 1.623765 (Ancillary parameters)
_cut2 | -2.053553 1.618611
------------------------------------------------------------------------------

. * The estimated cut points are -3.087 and -2.053.


. * Now compute index to obtain prediction for the person described.
231
. * First, without choice.
. di - .050*60 + .026*12 + .046 - .262 - .000096*150
-2.9184

. * Now, with choice:


. di -2.918 + .371
-2.547

. * Now, compute probabilities.


. * First, P(pctstck = 50):

. di normprob(-2.054 + 2.918) - normprob(-3.087 + 2.918)


.37330773

. * Now, P(pctstck = 100)

. di 1 - normprob(-2.054 + 2.918)
.19379395

. * Now estimate the expected value:

. di 50*.373 + 100*.194
38.05

. * With choice:

. di normprob(-2.054 + 2.547) - normprob(-3.087 + 2.547)


.39439519

. di 1 - normprob(-2.054 + 2.547)
.31100629

. di 50*.394 + 100*.311
50.8

. di 50.8 - 38.05
12.75

. * So, using the ordered probit, the effect of choice for this person is
. * about 12.8 percentage points more in stock, which is not far from the
. * 12.1 points obtained with the linear model.

d. We can compute an R-squared for the ordered probit model by using the

squared correlation between the predicted pctstcki and the actual. The

following Stata session does this, after using the "oprobit" command:

232
. predict p1 p2 p3
(option p assumed; predicted probabilities)
(32 missing values generated)

. sum p1 p2 p3

Variable | Obs Mean Std. Dev. Min Max


-------------+-----------------------------------------------------
p1 | 194 .331408 .1327901 .0685269 .8053644
p2 | 194 .3701685 .0321855 .1655734 .3947809
p3 | 194 .2984236 .1245914 .0290621 .6747374

. gen pctstcko = 50*p2 + 100*p3


(32 missing values generated)

. corr pctstck pctstcko


(obs=194)

| pctstck pctstcko
-------------+------------------
pctstck | 1.0000
pctstcko | 0.3119 1.0000

. di .321^2
.103041

The R-squared for the linear regression was about .100, so the R-squared is

only slightly higher for ordered probit. In fact, the correlation between the

fitted values for the linear regression and ordered probit is .998, so the

fitted values are very similar.

15.15. We should use an interval regression model, that is, ordered probit

with known cut points. We would be assuming that the underlying GPA is

normally distributed conditional on x, but we only observe interval coded

data. (Clearly a conditional normal distribution for the GPAs is at best an

approximation.) Along with the bj -- including an intercept -- we estimate


2
s . The estimated coefficients are interpreted as if we had done a linear

regression with actual GPAs.

233
15.16. a. P(yi = 1|xi,ri) = P(xiB + ui > ri|xi,ri) = P[ui/s > (ri - xiB)/s] =

1 - F[(ri - xiB)/s] = F[xi(B/s) - (1/s)ri].


^ ^
b. From part a, plim G = B/s and plim d = -1/s. So a consistent
^ ^ ^
estimator of s is -1/d and a consistent estimator of B is -G/d.

c. Just maximize the log-likelihood function with respect to B and s;

this would make it easy to obtain standard errors and other test statistics

involving B (and s). For each i,

li(B,s) = yilog{F[xi(B/s) - (1/s)ri]}

+ (1 - yi)log{1 - F[xi(B/s) - (1/s)ri]}.

d. As in part a,

P(yi = 1|xi,ri) = P(xiB + ui > ri|xi,ri) = P(ui > ri - xiB) = 1

- G(ri - xiB;D). Therefore,

li(B,D) = yilog[1 - G(ri - xiB;D)]

+ (1 - yi)log[G(ri - xiB;D)].

Note how this derivation allows for an asymmetric distribution of u.

e. Yes, because B has quantitative meaning in this particular binary

response application: bj measures the effect of xj on the average willingness

to pay. Our choice of distribution for u can certainly affect our estimation

of B. If we could observe wtpi for each i, we would use linear regression and

never have to specify a distribution for u.

15.17. a. We obtain the joint density by the product rule, since we have

independence conditional on (x,c):

f(y1,...,yG|x,c;Go) = f1(y1|x,c;Go)f2(y1|x,c;Go)WWWfG(yG|x,c;Go).
1 2 G

b. The density of (y1,...,yG) given x is obtained by integrating out with

234
respect to the distribution of c given x:
8& G *
g(y1,...,yG|x;Go) = i p fg(yg|x,c;Go) h(c|x;Do)dc,
g
7
-8 g=1
8
where c is a dummy argument of integration. Because c appears in each

D(yg|x,c), y1,...,yG are dependent without conditioning on c.

c. The log likelihood for each i is

log
# 8i& pG f (y |x ,c;Gg)*h(c|x ;D)dc$.
3 -87g=1 g ig i 8 i 4
As expected, this depends only on the observed data, (xi,yi1,...,yiG), and the

unknown parameters.

15.18. a. The probability is the same as under (15.67):

P(yit = 1|xi,ai) = F(j + xitB + xiX + ai), t = 1,2,...,T.


-----

The fact that ai given xi is heteroskedastic has no effect when we are

obtaining the distribution conditional on (xi,ai).

b. Let gt(yt|xi,ai;Q) = [F(j + xitB + xiX + ai)] t[1 - F(j + xitB + xiX +
----- y -----

1-yt
ai)] . Then, by the product and integration rules,

f(y1,...,yT|xi;Q) =
# 8i& pT g (y |x ,a;Q)*h(a|x ;D)da$,
3 -87t=1 t t i 8 i 4
where h(W|xi,D) is the Normal[0,saexp(xiL)] density.
2 -----

We get the log-

likelihood by plugging in the yit and taking the natural log.

15.19. a, b. Here is the Stata output for black men. I have balanced the

panel, so that only men in the sample from 1981 through 1987 appear.

. probit employ employ_1 if black

Iteration 0: log likelihood = -2793.6715


Iteration 1: log likelihood = -2251.6435
Iteration 2: log likelihood = -2248.0357
Iteration 3: log likelihood = -2248.0349

235
Probit estimates Number of obs = 4038
LR chi2(1) = 1091.27
Prob > chi2 = 0.0000
Log likelihood = -2248.0349 Pseudo R2 = 0.1953

------------------------------------------------------------------------------
employ | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
employ_1 | 1.389433 .0437182 31.78 0.000 1.303747 1.475119
_cons | -.5396127 .0281709 -19.15 0.000 -.5948268 -.4843987
------------------------------------------------------------------------------

. di normprob(-.540)
.29459852

. di normprob(-.540 + 1.389)
.80205935

. di .802 - .295
.507

The difference in employment probabilities this year, based on employment

status last year, is .507.

c. With year dummies, the story is very similar:

. probit employ employ_1 y83-y87 if black

Iteration 0: log likelihood = -2793.6715


Iteration 1: log likelihood = -2220.9214
Iteration 2: log likelihood = -2215.1822
Iteration 3: log likelihood = -2215.1795

Probit estimates Number of obs = 4038


LR chi2(6) = 1156.98
Prob > chi2 = 0.0000
Log likelihood = -2215.1795 Pseudo R2 = 0.2071

------------------------------------------------------------------------------
employ | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
employ_1 | 1.321349 .0453568 29.13 0.000 1.232452 1.410247
y83 | .3427664 .0749844 4.57 0.000 .1957997 .4897331
y84 | .4586078 .0755742 6.07 0.000 .3104852 .6067304
y85 | .5200576 .0767271 6.78 0.000 .3696753 .6704399
y86 | .3936516 .0774703 5.08 0.000 .2418125 .5454907
y87 | .5292136 .0773031 6.85 0.000 .3777023 .6807249
_cons | -.8850412 .0556041 -15.92 0.000 -.9940233 -.7760591
236
------------------------------------------------------------------------------

. di normprob(-.885 + .529)
.36092028

. di normprob(-.885 + .529 + 1.321)


.83272759

The estimated state dependence in 1987 is about .472. Employment probabilties

are generally rising over this period.

d. Here is one way to estimate the unobserved effects model:

. gen employ81 = employ if y81


(10428 missing values generated)

. replace employ81 = employ[_n-1] if y82


(1738 real changes made)

. replace employ81 = employ[_n-2] if y83


(1738 real changes made)

. replace employ81 = employ[_n-3] if y84


(1738 real changes made)

. replace employ81 = employ[_n-4] if y85


(1738 real changes made)

. replace employ81 = employ[_n-5] if y86


(1738 real changes made)

. replace employ81 = employ[_n-6] if y87


(1738 real changes made)

. xtprobit employ employ_1 employ81 y83-y87 if black, re

Fitting comparison model:

Iteration 0: log likelihood = -2793.6715


Iteration 1: log likelihood = -2207.2397
Iteration 2: log likelihood = -2200.3265
Iteration 3: log likelihood = -2200.3214

Fitting full model:

rho = 0.0 log likelihood = -2200.3214


rho = 0.1 log likelihood = -2189.493
rho = 0.2 log likelihood = -2194.3834
Iteration 0: log likelihood = -2189.493
237
Iteration 1: log likelihood = -2179.9725
Iteration 2: log likelihood = -2176.3849
Iteration 3: log likelihood = -2176.3738
Iteration 4: log likelihood = -2176.3738

Random-effects probit Number of obs = 4038


Group variable (i) : id Number of groups = 673

Random effects u_i ~ Gaussian Obs per group: min = 6


avg = 6.0
max = 6

Wald chi2(7) = 677.59


Log likelihood = -2176.3738 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
employ | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
employ_1 | .8987858 .0677035 13.28 0.000 .7660893 1.031482
employ81 | .5662849 .088493 6.40 0.000 .3928418 .739728
y83 | .4339896 .0804062 5.40 0.000 .2763964 .5915828
y84 | .6563064 .0841192 7.80 0.000 .4914358 .821177
y85 | .7919761 .0887153 8.93 0.000 .6180972 .9658549
y86 | .6896298 .0901566 7.65 0.000 .5129262 .8663335
y87 | .8381973 .0910525 9.21 0.000 .6597376 1.016657
_cons | -1.0051 .0660937 -15.21 0.000 -1.134641 -.8755586
-------------+----------------------------------------------------------------
/lnsig2u | -1.178755 .1995222 -1.569811 -.7876984
-------------+----------------------------------------------------------------
sigma_u | .5546726 .0553347 .4561628 .6744557
rho | .2352762 .0358983 .1722434 .3126631
------------------------------------------------------------------------------
Likelihood ratio test of rho=0: chibar2(01) = 47.90 Prob >= chibar2 = 0.000

e. There is strong evidence of state dependence conditional on ci because

the coefficient on lagged employment is very significant with t statistic =

13.3. As yet, we do not know how the coefficient .899 into the estimated

state dependence. Note that employ81 is also very significant, showing that
2 2
ci and employi,81 are positively correlated. The estimate of sa is (.555) ,
^2
or sa ~ .308.
f. The average state dependence, where we average out the distribution of

ci, is gotten as follows:

238
. gen prbdif87 = normprob((-1.005 + .838 + .899 + .566*employ81)/sqrt(1 +
.555^2)) - normprob((-1.005 + .838 + .566*employ81)/sqrt(1 + .555^2)) if y87
& black
(11493 missing values generated)

. sum prbdif87 if y87 & black

Variable | Obs Mean Std. Dev. Min Max


-------------+-----------------------------------------------------
prbdif87 | 673 .2831544 .025709 .2353894 .2969714

The estimated state dependence, averaged across the distribution of ci, is

.283. More precisely, this is our estimate of E[F(d87 + r + ci)] - E[F(d87 +

ci)], where the expectation is with respect to the distribution of ci. The
2 1/2
estimate is based on E{F[(j + d87 + r + xyi0)/(1 + sa) ]} - E{F[(j + d87 +
2 1/2
xyi0)/(1 + sa) ]} (by iterated expectations):
-1 N
N S {F[(j^ + d^87 + r^ + x^yi0)/(1 + ^s2a)1/2]
i=1
^ ^ ^ ^2 1/2
- F[(j + d87 + xyi0)/(1 + sa) ]};

see page 495 in the text. Interestingly, .283 is just over half if the

estimated state dependence if we ignore ci.

15.20. Since y1 = z1D1 + g(y2)A1 + u1, and we can write, just as before, u1 =
*

q1v2 + e1, where e1|z,v2 ~ Normal(0,1 - r1), we have


2

P(y1 = 1|z,y2,v2) = F{[z1D1 + g(y2)A1 + q1v2]/(1 - r1)


2 1/2
}.

Therefore, in Procedure 15.1, we will consistently estimate D1/(1 2 1/2


- r1) ,

A1/(1 2 1/2
- r1) , and q1/(1 - r1)
2 1/2
, where recall that A1 is now a vector of

parameters. The discussion for computing average partial effects is identical

to that on page 475.

It should also be clear that allowing for a 1 * G1 vector of endogenous


explanatory variables, y2, which appear as g1(y21), ..., gG (y2,G ), is not
1 1

difficult. The key restriction is that the vector of reduced form errors, v2,

would have to be jointly normally distributed (along with u1). But the

239
endogeneity is solved by including in the vector of reduced form residuals in

the probit. For details, see J.M. Wooldridge, "Unobserved Heterogeneity and

Estimation of Average Partial Effects," mimeo, Michigan State University

Department of Economics, 2002.

240

Potrebbero piacerti anche