Chap12 15 Sol

SOLUTIONS TO CHAPTER 12 PROBLEMS
12.1. Take the conditional expectation of equation (12.4) with respect to x,
and use E(u|x) = 0:
E{[y - m(x,Q)] |x} = E(u2|x) + 2[m(x,Qo) - m(x,Q)]E(u|x)

2
+ E{[m(x,Qo) - m(x,Q)] |x}

2
= E(u
2
|x) + 0 + [m(x,Qo) - m(x,Q)]2
= E(u
2
|x) + [m(x,Qo) - m(x,Q)]2.
The first term does not depend on Q and the second term is clearly minimized
at Q = Qo for any x. Therefore, the parameters of a correctly specified
conditional mean function minimize the squared error conditional on any value
of x. Usually, there would be multiple solutions for a particular x -- for
example, in the linear case m(x,Q) = xQ, any Q such that x(Qo - Q) = 0 sets
m(x,Qo) - m(x,Q) to zero. Uniqueness of Qo as a minimizer holds only after
we integrate out x to obtain E{[y - m(x,Q)] }.

2
12.2. a. Since u = y - E(y|x), Var(y|x) = Var(u|x) = E(u |x) since E(u|x) =

2
0. So E(u
2
|x) = exp(ao + xGo).
b. If we knew the ui = yi - m(xi,Qo), then we could do a nonlinear
regression of ui on exp(a + xG) and just use the asymptotic theory for
2
nonlinear regression. The NLS estimators of a and G would then solve

N
min S [u2i - exp(a + xiG)]2.
a,G i=1
The problem is that Qo is unknown. When we replace Qo with its NLS
estimator, Q^ 2 ^2
-- that is we replace ui with ui, the squared NLS residuals --
we are solving the problem
179
N
min S {[yi - m(xi,Q^)]2 - exp(a + xiG)}2.
a,G i=1
This objective function has the form of a two-step M-estimator in Section
12.4. Since Q
^
is generally consistent for Qo, the two-step M-estimator is
generally consistent for ao and Go (under weak regularity and identification

conditions). In fact, rN-consistency of â and ^G holds very generally.
-----
c. We now estimate Qo by solving

N
min S [yi - m(xi,Q)]2/exp(â + xi^G),
Q i=1
where a and ^G are from part b.

^
The general theory of WNLS under WNLS.1 to
WNLS.3 can be applied.
= exp(ao + xGo)v .
2 2
d. Using the definition of v, write u Taking logs
gives log(u ) =
2
ao + xGo + log(v2). Now, if v is independent of x, so is
log(v ).
2
Therefore, E[log(u )|x] =
2
ao + xGo + E[log(v2)|x] = ao + xGo + ko,
where ko _ E[log(v2)]. So, if we could observe the ui, and OLS regression of
log(ui) on 1, xi would be consistent for (ao +

2
ko), Go; in fact, it would be
unbiased. By two-step estimation theory, consistency still holds if ui is
So, if m(x,Q)
^
replaced with ui, by essentially the same argument in part b.
is linear in Q, we can carry out a weighted NLS procedure without ever doing
nonlinear estimation.
e. If we have misspecified the variance function -- or, for example, we
use the approach in part d but v is not independent of x -- then we should
use a fully robust variance-covariance matrix. It looks just like (12.52)

^
except that ui and Dqmî are each multiplied by exp[-(a^ + xiG
^
)/2] or just
exp(-xiG/2).
^
180
^
12.3. a. The approximate elasticity is dlog[E(y |z)]/dlog(z1) = d[^q1 +
^
q2log(z1) + ^q3z2]/dlog(z1) = ^q2.
^ ^
b. This is approximated by 100Wdlog[E(y|z)]/dz2 = 100Wq3.
c. Since dÊ(y|z)/dz2 = exp[^q1 + ^q2log(z1) + ^q3z2 + ^q4z22]W(^q3 + 2^q4z2),

^
the turning point is z2 =
*
q3/(-2^q4).
d. Since Dqm(x,Q) = exp(x1Q1 + x2Q2)x, the gradient of the mean function
evaluated under the null is Dq~mi = exp(xi1Q~1)xi _ m~ixi, where Q~1 is the
restricted NLS estimator. From (12.72), we can compute the usual LM
2 ~ ~ ~
statistic as NRu from the regression ui on mixi1, mixi2, i = 1,...,N, where
~ ~ ~ ~
ui = yi - mi. For the robust test, we first regress mixi2 on mixi1 and
obtain the 1 * K2 residuals, r~i. Then we compute the statistic as in
regression (12.75).
N
12.4. a. Write the objective function as (1/2) S [yi - m(xi,Q)]2/h(xi,^G).
i=1
The objective function, for any value of G, is q(wi,Q;G) = (1/2)[yi -
m(xi,Q)] /h(xi,G). Q Dqq(wi,Q;G) =

2
Taking the gradient with respect to gives
-Dqm(xi,Q)[yi - m(xi,Q)]/h(xi,G) = -Dqm(xi,Q)ui(Q)/h(xi,G). Taking the
transpose gives us the score for any Q and any G.

b. This follows because, under WNLS.1, ui _ ui(Qo) has a zero mean given
xi:
E[si(Qo;G)|xi] = -Dqm(xi,Qo)’E(ui|xi)/h(xi,G) = 0;
the value of G plays no role.
c. First, the Jacobian of si(Qo;G) with respect to G is Dgsi(Qo;G) =

Dqm(xi,Qo)’uiDgh(xi,G)/[h(xi,G)]2. Everything but ui is a function only of
xi, so
E[Dgsi(Qo;G)|xi] = Dqm(xi,Qo)’E(ui|xi)Dgh(xi,G)/[h(xi,G)]2 = 0.
181
It follows by the LIE that the unconditional expectation is zero, too. In
other words, we have shown that the key condition (12.37) holds (whether or
not WNLS.3 holds).
d. We would use the analog of equation (12.52):
Avar(Q) =
^ ^ & SN D mˇ’D mˇ *-1& SN uˇ2D ˇm’D ˇm *& SN D ˇm’D ˇm *-1,
7i=1 q i q i8 7i=1 i q i q i87i=1 q i q i8
ˇ ^ ^1/2 ˇ ^ ^1/2
where ui _ ui/hi and Dqmi _ Dqmi/hi .
12.5. We need the gradient of m(xi,Q) evaluated under the null hypothesis.
By the chain rule, Dbm(x,Q) = g[xB + d1(xB)2 + d2(xB)3]W[x + 2d1(xB)2 +

3d2(xB) ], Ddm(x,Q) = g[xB + d1(xB)2 + d2(xB)3]W[(xB)2,(xB)3]. B~
2
Let denote
the NLS estimator with d1 = d2 = 0 imposed. Then Dbm(xi,~Q) = g(xiB

~
)xi and
Ddm(xi,Q~) = g(xiB
~
)[(xiB) ,(xiB) ].
~ 2 ~ 3
Therefore, the usual LM statistic can be
obtained as NRu from the regression ui on gixi, giW(xiB) , giW(xiB) , where

2 ~ ~ ~ ~ 2 ~ ~ 3
~
gi _ g(xiB
~
). If G(W) is the identity function, g(W) _ 1, and we get RESET.
12.6. a. The pooled NLS estimator of Qo solves

N T
min S S [yit - m(xit,Q)]2/2,
Q i=1t=1
T
and so, as in the hint in part b, we can take qi(Q) = S [yit m(xit,Q)] /2.
2
-
t=1
T
Then, the score is si(Q) = - S Dqm(xit,Q)’uit(Q). Without further
t=1
-1 N
assumptions, a consistent estimator of Bo is N S si(Q^)si(Q^)’, where ^Q is
i=1
the pooled NLS estimator. Now, the Hessian for observation i can be written
as
T T
Hi(Q) = Dqsi(Q) = - S D2qm(xit,Q)uit(Q) + S Dqm(xit,Q)’Dqm(xit,Q).
t=1 t=1
When we plug in Qo and use the fact that E(uit|xit) = 0, all t = 1,...,T,
then
182
T T
Ao _ E[Hi(Qo)] = - S E[D2qm(xit,Qo)uit] + S E[Dqm(xit,Qo)’Dqm(xit,Qo)]
t=1 t=1
T
= S E[Dqm(xit,Qo)’Dqm(xit,Qo)]
t=1
because E[Dqm(xit,Qo)uit] = 0, t = 1,...,T.
2
By the usual law of large
numbers argument,
-1 N T N
N S S Dqm(xit,Q^)’Dqm(xit,Q^) _ N-1 S Âi
i=1t=1 i=1
is a consistent estimator of Ao. Then, we just use the usual "sandwich"
formula in (12.49).
b. As in the hint, I show that Bo = s2oAo. First, write si(Q) =

T
S sit(Q), where sit(Q) _ -Dqm(xit,Q)’uit(Q). Under dynamic completeness of
t=1
the mean, these scores are serially uncorrelated across t (when evaluated at
Qo, of course). This is very similar to the linear regression case from
Chapter 7. Let r < t for concreteness. Then E[sit(Qo)sir(Qo)’|xit,xir,uir]
= E(uit|xit,xir,uir)uirDqm(xit,Qo)’Dqm(xir,Q) = 0 because
E(uit|xit,ui,t-1,,xi,t-1,...,) = 0 and r < t. Now apply the LIE to conclude

T
E[sit(Qo)sir(Qo)’] = 0. So we have shown that Bo = S E[sit(Qo)sit(Qo)’].
t=1
But for each t, E[sit(Qo)sit(Qo)’] = E[uitD qm(xit,Qo)’Dqm(xit,Qo)]
2
=
E[E(uit|xit)Dqm(xit,Qo)’Dqm(xit,Qo)] (by the law of iterated expectations) =

2
s2oE[Dqm(xit,Qo)’Dqm(xit,Qo)] because E(u2it|xit) = s2o.

Next, the usual two-step estimation argument shows that
T ^2 p
(NT) Si=1St=1
-1 N
uit L T St=1E(uit
-1 T 2
) = so2 as N L 8. The degrees of freedom
correction does not affect consistency.
The variance matrix obtained by ignoring the time dimension and assuming
homoskedasticity is simply
s^27& S S Dqm(xit,Q^)’Dqm(xit,^Q)*8 ,
N T -1
i=1t=1
and we just showed that N times this matrix is a consistent estimator of Avar
rN(^Q - Qo).
-----
183
12.7. a. For each i and g, define uig _ yig - m(xig,Qo), so that E(uig|xi) =
0, g = 1,...,G. Further, let ui be the G * 1 vector containing the uig.
Then E(uiu’|
i xi) = E(uiu’
i) = )o. ^
^ be the vector of nonlinear least
Let ui
squares residuals. That is, do NLS for each g, and collect the residuals.
Then, by standard arguments, a consistent estimator of )o is

-1 N
)^ _ N S ^ûi^û’i
i=1
^
because each NLS estimator, Qg
^
is consistent for Qog as N L 8.
b. This part involves several steps, and I will sketch how each one
goes. First, let G be the vector of distinct elements of ) -- the nuisance
parameters in the context of two-step M-estimation. Then, the score for
observation i is
s(wi,Q;G) = -Dqm(xi,Q)’) ui(Q)

-1
where, hopefully, the notation is clear. With this definition, we can verify
condition (12.37), even though the actual derivatives are complicated. Each
element of s(wi,Q;G) is a linear combination of ui(Q). So Dgsj(wi,Qo;G) is a

linear combination of ui(Qo) _ ui, where the linear combination is a function
of (xi,Qo,G). Since E(ui|xi) = 0, E[Dgsj(wi,Qo;G)|xi] = 0, and so its
unconditional expectation is zero, too. This shows that we do not have to
adjust for the first-stage estimation of )o. Alternatively, one can verify
the hint directly, which has the same consequence.
Next, we derive Bo _ E[si(Qo;Go)si(Qo;Go)’]:

E[si(Qo;Go)si(Qo;Go)’] = E[Dqmi(Qo)’)o uiu’
i )o Dqmi(Qo)]
-1 -1
= E{E[Dqmi(Qo)’)o uiu’
i )o Dqmi(Qo)|xi]}
-1 -1
= E[Dqmi(Qo)’)o E(uiu’|
i xi))o Dqmi(Qo)]
-1 -1
= E[Dqmi(Qo)’)o )o)-1
o Dqmi(Qo)] = E[Dqmi(Qo)’)o Dqmi(Qo)].
-1 -1
184
Next, we have to derive Ao _ E[Hi(Qo;Go)], and show that Bo = Ao. The
Hessian itself is complicated, but its expected value is not. The Jacobian
of si(Q;G) with respect to Q can be written
Hi(Q;G) = Dqm(xi,Q)’)-1Dqm(xi,Q) + [IP t ui(Q)’]F(xi,Q;G),

where F(xi,Q;G) is a GP * P matrix, where P is the total number of
parameters, that involves Jacobians of the rows of )-1Dqmi(Q) with respect to
Q. The key is that F(xi,Q;G) depends on xi, not on yi. So,
E[Hi(Qo;Go)|xi] = Dqmi(Qo)’)-1
o Dqmi(Qo) + [IP t E(ui|xi)’]F(xi,Qo;Go)
= Dqmi(Qo)’)-1
o Dqmi(Qo).
Now iterated expectations gives Ao = E[Dqmi(Qo)’)o Dqmi(Qo)].

-1
So, we have
verified (12.37) and that Ao = Bo. Therefore, from Theorem 12.3, Avar rN(^Q -
-----
Qo) -1
= Ao = {E[Dqmi(Qo)’)o
-1
Dqmi(Qo)]}-1.
c. As usual, we replace expectations with sample averages and unknown
parameters, and divide the result by N to get Avar(Q):

^ ^
Avar(Q) =
^ ^ &N-1 SN D m (Q^)’)^ -1
D ^ *-1
mi(Q)
7 i=1 q i q 8 /N
& S D m (Q^)’)
N ^ -1
Dqmi(^Q)*8 .
-1
=
7 q i
i=1
The estimate )
^
can be based on the multivariate NLS residuals or can be
updated after the nonlinear SUR estimates have been obtained.
d. First, note that Dqmi(Qo) is a block-diagonal matrix with blocks

Dqgmig(Qog), a 1 * Pg matrix. (I implicityly assumed that there are no
cross-equation restrictions imposed in the nonlinear SUR estimation.) If )o

is block-diagonal, so is its inverse. Standard matrix multiplication shows
that
185
( )
2 s-2
o1Dq1mi1’Dq 1 m i1 0 W W W
o o
0 2
2 2
Dqmi(Qo)’)-1
o Dqmi(Qo) = 2
0
WW 2 .
2
W 2
2
s-2 0 W W W
oG DqGm iG’DqGmiG
o o
0 2
9 0
Taking expectations and inverting the result shows that Avar rN(Qg - Qog) =
^ -----
s2og[E(Dqgmoig’Dqgmoig)]-1, g = 1,...,G. (Note also that the nonlinear SUR
estimators are asymptotically uncorrelated across equations.) These
asymptotic variances are easily seen to be the same as those for nonlinear
least squares on each equation; see p. 360.
e. I cannot see a nonlinear analogue of Theorem 7.7. The first hint
given in Problem 7.5 does not extend readily to nonlinear models, even when
the same regressors appear in each equation. The key is that Xi is replaced
with Dqm(xi,Qo). While this G * P matrix has a block-diagonal form, as

described in part d, the blocks are not the same even when the same
regressors appear in each equation. In the linear case, Dqgmg(xi,Qog) = xi

for all g. But, unless Qog is the same in all equations -- a very
restrictive assumption -- Dqgmg(xi,Qog) varies across g. For example, if
mg(xi,Qog) = exp(xiQog) then Dqgmg(xi,Qog) = exp(xiQog)xi, and the gradients

differ across g.
12.8. As stated in the hint, we can use (12.33) and an updated version of
(12.66),
-1/2 N N
N S si(Q~;G
^
) = N
-1/2
S si(Qo;G
^
) + AorN(Q - Qo) + op(1),
~ -----
i=1 i=1
N
to show rN(Q
----- ~
- Q) = Ao N
^ -1 -1/2
S si(Q~;G
^
) + op(1); this is just standard
i=1
-1/2 N N
algebra. Under (12.37), N S si(Q~;G
^
) = N
-1/2
S si(Q~;Go) + op(1), by a
i=1 i=1
similar mean value expansion used for the unconstrained two-step M-estimator:
186
-1/2 N N
N S si(Q~;G
^
) = N
-1/2
S si(Q~;Go) + E[Dgsi(Qo;Go)]rN(^G - Go) + op(1),
-----
i=1 i=1
and use E[Dgsi(Qo;Go)] = 0. Now, the second-order Taylor expansion gives
N N N ^ ( N ) ~ ^
S q(wi,Q~;G
^
) - S q(wi,Q;G) = S si(Q;G) + (1/2)(Q - Q)’ S ¨
^ ^ ^ ^ ~
9 Hi (Q - Q)
i=1 i=1 i=1 i=1 0
^ ( N ¨ ) ~
= (1/2)(Q - Q)’ S Hi (Q - Q).
~ ^
9i=1 0
Therefore,
2
& SN q(w ,Q~;G
^ N ^ ^ *
) - S q(wi,Q;G) = [rN(Q - Q)]’Ao[rN(Q - Q)] + op(1)
~
----- ^ ~
----- ^
7i=1 i
i=1 8
& -1/2 SN s~ *’A-1&N-1/2 SN s~ * + o (1),
= N
7 i=1 8
i o 7
i=1 8
i p
where si = si(Q;Go).
~ ~
Again, this shows the asymptotic equivalence of the QLR
and LM statistics. To complete the problem, we should verify that the LM

-1/2 N
statistic is not affected by G^, either, but that follows from N S si(Q~;G
^
)
i=1
-1/2 N
= N S si(Q;Go) + op(1).
~
i=1
12.9. a. We cannot say anything in general about Med(y|x), since Med(y|x) =
m(x,Bo) + Med(u|x), and Med(u|x) could be a general function of x.
b. If u and x are independent, then E(u|x) and Med(u|x) are both
constants, say a and d. Then E(y|x) - Med(y|x) = a - d, which does not

depend on x.
c. When u and x are independent, the partial effects of xj on the
conditional mean and conditional median are the same, and there is no
ambiguity about what is "the effect of xj on y," at least when only the mean
and median are in the running. Then, we could interpret large differences
between LAD and NLS as perhaps indicating an outlier problem. But it could
just be that u and x are not independent.
12.10. The conditional mean function is m(xi,ni,B) = nip(xi,B). So we would,
Si=1[yi - nip(xi,B)] to get

N 2
as usual, minimize the sum of squared residuals,
187
^
the nonlinear least squares estimator, say B. Define the weights as hi _
^ ^
^ ^
nip(xi,B)[1 - p(xi,B)]. Then, the weighted NLS estimator minimizes S [yi -
^ ^ N
i=1
nip(xi,B)] /hi.
2^
12.11. a. For consistency of the MNLS estimator, we need -- in addition to
the regularity conditions, which I will ignore -- the identification
condition. That is, Bo must uniquely minimize E[q(wi,B)] = E{[yi -
m(xi,B)]’[yi - m(xi,B)]} = E({ui + [m(xi,Bo) - m(xi,B)]}’{ui + [m(xi,Bo) -
m(xi,B)]}) = E(u’
i ui) + 2E{[m(xi,Bo) - m(xi,B)]’ui} + E{[m(xi,Bo) -
m(xi,B)]’[m(xi,Bo) - m(xi,B)]} = E(uiu’

i ) + E{[m(xi,Bo) - m(xi,B)]’[m(xi,Bo) -
m(xi,B)]} because E(ui|xi) = 0. Therefore, the identification assumption is
that
E{[m(xi,Bo) - m(xi,B)]’[m(xi,Bo) - m(xi,B)]} > 0, B $ Bo.

In a linear model, where m(xi,B) = XiB for Xi a G * K matrix, the condition
is
(Bo - B)’E(X’i Xi)(Bo - B) > 0, B $ Bo,

and this holds provided E(X’
i Xi) is positive definite.
Provided m(x,W) is twice continuously differentiable, there are no
problems in applying Theorem 12.3. Generally, Bo = E[Dqmi(Bo)’uiu’D

i qmi(Bo)]
and Ao = E[Dqmi(Bo)’Dqmi(Bo)]. These can be consistently estimated in the
obvious way after obtain the MNLS estimators.
b. We can apply the results on two-step M-estimation. The key is that,
underl general regularity conditions,

-1 N
N S [yi - m(xi,B)]’[Wi(^D)]-1[yi - m(xi,B)]/2,
i=1
converges uniformly in probability to
E{[yi - m(xi,B)]’[W(xi,Do)] [yi - m(xi,B)]}/2,

-1
188
which is just to say that the usual consistency proof can be used provided we
verify identification. But we can use an argument very similar to the
unweighted case to show
E{[yi - m(xi,B)]’[W(xi,Do)] [yi - m(xi,B)]} = E{u’

i [Wi(Do)] ui}
-1 -1
+ E{[m(xi,Bo) - m(xi,B)]’[Wi(Do)] [m(xi,Bo) - m(xi,B)]},

-1
where E(ui|xi) = 0 is used to show the cross-product term, 2E{[m(xi,Bo) -
m(xi,B)]’[Wi(Do)] ui}, is zero (by iterated expectations, as always).

-1
As
before, the first term does not depend on B and the second term is minimized
at Bo; we would have to assume it is uniquely minimized.
To get the asymptotic variance, we proceed as in Problem 12.7. First,
it can be shown that condition (12.37) holds. In particular, we can write
Ddsi(Bo;Do) = (IP t ui)’G(xi,Bo;Do) for some function G(xi,Bo;Do). It
follows easily that E[Ddsi(Bo;Do)|xi] = 0, which implies (12.37). This means
that, under E(yi|xi) = m(xi,Bo), we can ignore preliminary estimation of Do

provided we have a rN-consistent estimator.
-----
To obtain the asymptotic variance when the conditional variance matrix
is correctly specified, that is, when Var(yi|xi) = Var(ui|xi) = W(xi,Do), we
can use an argument very similar to the nonlinear SUR case in Problem 12.7:
E[si(Bo;Do)si(Bo;Do)’] = E[Dbmi(Bo)’[Wi(Do)] uiu’

i [Wi(Do)] Dbmi(Bo)]
-1 -1
= E{E[Dbmi(Bo)’[Wi(Do)] uiu’
i [Wi(Do)] Dbmi(Bo)|xi]}
-1 -1
= E[Dbmi(Bo)’[Wi(Do)] E(uiu’|
i xi)[Wi(Do)] Dbmi(Bo)]
-1 -1
= E{Dbmi(Bo)’[Wi(Do)] ]Dbmi(Bo)}.
-1
Now, the Hessian (with respect to B), evaluated at (Bo,Do), can be written as
Hi(Bo;Do) = Dbm(xi,Bo)’[Wi(Do)]-1Dbm(xi,Bo) + (IP t ui)’]F(xi,Bo;Do),

for some complicated function F(xi,Bo;Do) that depends only on xi. Taking
expectations gives
189
Ao _ E[Hi(Bo;Do)] = E{Dbm(xi,Bo)’[Wi(Do)]-1Dbm(xi,Bo)} = Bo.
Therefore, from the usual results on M-estimation, Avar rN(B
^
- Bo) = Ao , and
----- -1
a consistent estimator of Ao is
N
^
A = N
-1
S Dbm(xi,B
^
)’[Wi(D)] Dbm(xi,B).
^ -1 ^
i=1
c. The consistency argument in part b did not use the fact that W(x,D)
is correctly specified for Var(y|x). Exactly the same derivation goes
through. But, of course, the asymptotic variance is affected because Ao $

Bo, and the expression for Bo no longer holds. The estimator of Ao in part b
still works, of course. To consistently estimate Bo we use

-1 N
^
B = N S Dbm(xi,B
^
)’[Wi(D)] uiu’
^ -1^ ^
i [Wi(D)] Dbm(xi,B).
^ -1 ^
i=1
Now, we estimate Avar rN(B
^
- Bo) in the usual way: A BA .
----- ^-1^^-1
12.12. (Bonus Question): Let Q

^
be an M-estimator with score si(Q) _ s(wi,Q)
and expected Hessian Ao. Let g(w,Q) be an M * 1 vector function of the
random vector w and the parameter vector, and suppose we wish to estimate Do
-1 N
_ E[g(wi,Qo)]. The natural estimator is D^ = N S g(wi,Q^).
i=1
a. Assuming that g(w,W) is continuously differentiable on int($), Qo e
int($), and other standard regularity conditions, find Avar rN(D^ - Do).
-----
b. How would you estimate Avar rN(^D - Do)?

-----
c. Show that if g(w,Q) = g(x,Q), and x is exogenous in the estimation
problem used to obtain Q

^
-- more precisely, E[s(wi,Qo)|xi] = 0 -- then
Avar rN(^D - Do) = Var[g(xi,Qo)] + Go[Avar rN(^D - Do)]G’o .

----- -----
Answer:
a. We use a mean value expansion, similar to the delta method from
Chapter 3 but now allowing for the randomness of wi. By a mean value
190
expansion, we can write
S g(wi,^Q) = N-1/2 S g(wi,Qo) + 7&N-1 S ¨Gi8*rN(^Q - Qo),

-1/2 N N N -----
N
i=1 i=1 i=1
where Gi is the M * P Jacobian of g(wi,Q) evaluate at mean values between Qo
¨
and Q^. Now, since rN(Q^ - Qo) ~a Normal(0,A-1

-----
o BoAo ), it follows that rN(Q -

-1 ^ -----
-1 N
Qo) = Op(1). Further, by Lemma 12.1, N S Gï p
L E[Dqg(w,Qo)] _ Go, since the
i=1
mean values converge in probability to Qo. Therefore,
&N-1 N
S Gï*8rN(^Q - Qo) = GorN(^Q - Qo) + op(1),
----- -----
7 i=1
and so
-1/2 N N
N S g(wi,Q^) = N-1/2 S g(wi,Qo) + GorN(Q^ - Qo) + op(1). -----
i=1 i=1
-1/2 N
Since rN(Q - Qo) = -N
----- ^
S Ao-1si( o) + Q op(1), we can write
i=1
-1/2 N -1/2 N
N S g(wi,^Q) = N S [g(wi, o) Q - GoAo si(Qo)] + op(1)
-1
i=1 i=1
or, subtracting rNDo from both sides,
-----
N
rN(D^ - Do) = N-1/2 S [g(wi,Qo) - Do - GoAo-1si(Qo)] + op(1).
-----
i=1
Since the term in the summation has zero mean, it follows from the CLT that
rN(^D - Do) ~a Normal(0,Do)

-----
where Do = Var[(gi - Do -1
- GoAo si)], where hopefully the shorthand is clear.
This differs from the usual delta method result by the presence of gi =
gi(Qo).
^ ^
b. We assume we have A consistent for Ao. By the usual arguments, G =
-1 N
N S Dqg(wi,^Q) is consistent for Go. Then
i=1
-1 N
^
D = N S (gî - D^ - GA si)(gi - D - GA si)’
^^-1^ ^ ^ ^^-1^
i=1
is consistent for Do, where the "^" denotes evaluation at Q.
^
c. Using our shorthand notation, if E(si|xi) = 0 then gi is uncorrelated
with si (since gi is a function of xi), which means (gi - Do) is uncorrelated
with GoAo si.

-1
But then Do = Var[(gi - Do -1
- GoAo si)] = Var(gi) +
GoAo BoAo G’
-1 -1
o , which is what we wanted to show.
191
13.1. No. We know that Qo solves
max E[log f(yi|xi;Q)],

Qe$
where the expectation is over the joint distribution of (xi,yi). Therefore,
because exp(W) is an increasing function, Qo also maximizes exp{E[log
f(yi|xi;Q)]} over $. The problem is that the expectation and the exponential
function cannot be interchanged: E[f(yi|xi;Q)] $ exp{E[log f(yi|xi;Q)]}. In
fact, Jensen’s inequality tells us that E[f(yi|xi;Q)] > exp{E[log
f(yi|xi;Q)]}.
13.2. a. Since
f(y|xi) = (2pso)
2 -1/2 (
exp -(y - m(xi,Bo)) /(2so) ,
2 2 )
9 0
it follows that for each i,
li(B,s2) 1
= - log(2p) -
2
1
-----
2
log(s ) -
2
-----
1
-------------- [yi - m(xi,B)] .
2
2s
2
Only the last of these terms depends on B. Further, for any s2 > 0,
N
maximizing S li(B,s2) with respect to B is the same as minimizing
i=1
N
S [yi - m(xi,B)]2. (13.66)
i=1
Thus, regardless of the MLE for s2, the MLE B
^
of B minimizes (13.66).
b. First,
Dbli(B,s2) = Dbm(xi,B)[yi - m(xi,B)]/s2;

note that Dbm(xi,B) is 1 * P. Next,
dli(B,s2)
[yi - m(xi,B)] .
1 1 2
------------------------------------------ = - -------------- + --------------
ds 2
2s
2
2s
4
192
For notational simplicity, define the residual function ui(B) _ yi
- m(xi,B). Then the score is
&D m (B)’u (B)/s 2 *
2
b i i 2
si(Q) =
[u i (B)]
2 1 1 2 2,
- ---------- + ----------------
7 2s 2s
2 2 4 2
0
where Dbmi(B) _ Dbm(xi,B).
Define the errors as ui _ ui(Bo), so that E(ui|xi) = 0 and E(u2i|xi) =
Var(yi|xi) = so2. Then, since Dbmi(Bo) is a function of xi, it is easily seen
that E[si(Qo)|xi] = 0. Note that we only use the fact that E(yi|xi) =
m(xi,Bo) and Var(yi|xi) = so2 in showing this. In other words, only the first
two conditional moments of yi need to be correctly specified; nothing else
about the normal distribution is used.

^2
The equation used to obtain s is
S 7&- ^12 + ^14 [yi - m(xi,B
^ 2*
N
)]
i=1
--------------
2s 2s
8 = 0,
--------------
where B^ is the nonlinear least squares estimator. Solving gives

N
s = N-1 S û2i,
^2
i=1
^
where ui _ yi - m(xi,B). ^
Thus, the MLE of s2 is the sum of squared residuals
divided by N. In practice, N is often replaced with N - P as a degrees-of-
freedom adjustment, but this makes no difference as N L 8.

c. The derivations are a bit tedious but fairly straightforward:
&-D m (B)’D m (B)/s 2 + D2m (B)u (B)/s2 -D b mi(B)’u i (B)/s
4 *
2
b i b i b i i 2
Hi(Q) = 2 2 ,
-D b mi(B)u i (B)/s [u i (B)]
2 4 1 1 22
-
7 ---------------
2s
2
---------------
2s
6 8
where Db2mi(B) is the P * P Hessian of mi(B).
d. From part c,
193
&D m (B )’D m (B )/s2 *
2
b i o b i o o 0
2
-E[Hi(Qo)|xi] = 2 2 , (13.67)
2 1 2
0
7 --------------
2so
4 0
where again I have used E(ui|xi) = 0 and E(ui|xi) = s2o.

2
e. To show that (13.67) equals E[si(Qo)si(Qo)’|xi], you need to know
that, with ui defined as above, E(ui|xi) = 0 and E(ui|xi) = 3so.

3 4 2
f. From general MLE, we know that Avar rN(^b - Bo) is the P * P upper
-----
left hand block of {E[Ai(Qo)]} , where Ai(Qo) is the matrix in (13.67).

-1
Because this matrix is block diagonal, it is easily seen that
Avar rN(B
^
- Bo) = so{E[Dbmi(Bo)’Dbmi(Bo)]} ,
----- 2 -1
and this is consistently estimated by

^2& N ^ *-1
s 7N-1 S Db^m’D
i bmi8 , (13.68)
i=1
which means that Avar(B) is (13.68) divided by N, or
^ ^
^2& N ^ *-1
Avar(B) = s
^ ^
S Dbm^’D
7i=1 i bmi8 . (13.69)
If the model is linear, Db^mi = xi, and we obtain exactly the asymptotic
variance estimator for the OLS estimator under homoskedasticity.
13.3. a. The conditional log-likelihood for observation i is
li(Q) = yilog[G(xi,Q)] + (1 - yi)log[1 - G(xi,Q)].
b. The derivation for the probit case in Example 13.1 extends
immediately:
si(Q) = yiDqG(xi,Q)’/G(xi,Q) - (1 - yi)DqG(xi,Q)’/[1 - G(xi,Q)]
= DqG(xi,Q)’[yi - G(xi,Q)]/{G(xi,Q)[1 - G(xi,Q)]}.

If we plug in Qo for Q and take the expectation conditional on xi we get
E[si(Qo)|xi] = 0 because E[yi - G(xi,Qo)|xi] = E(yi|xi) - G(xi,Qo) = 0, and
the functions multiplying yi - G(xi,Qo) depend only on xi.
194
c. We need to evaluate the score and the expected Hessian with respect
to the full set of parameters, but then evaluate these at the restricted
estimates. Now,
DqG(xi,B,0) = f(xB)[x,(xB)2,(xB)3],
a 1 * (K + 2) vector. Let B~ denote the probit estimates of B, obtained under
the null. The score for observation i, evaluated under the null estimates,
is the (K + 2) * 1 vector
si(Q) = DqG(xi,B ,0)’[yi - F(xiB)]/{F(xiB)[1 - F(xiB)]}
~ ~ ~ ~ ~
= f(xiB
~ ~
)z’i [yi - F(xiB)]/{F(xiB)[1 - F(xiB)]},
~ ~ ~
~
where zi _ [xi,(xiB) ,(xiB) ].
~ 2 ~ 3
The negative of the expected Hessian,
evaluated under the null, is the (K + 2) * (K + 2) matrix

A(xi,Q) = [f(xiB)] z’
i zi/{F(xiB)[1 - F(xiB
~ ~ 2~ ~ ~ ~
)]}.
These can be plugged into the second expression in equation (13.26) to obtain
a nonnegative, well-behaved LM statistic. Simple algebra shows that the
statistic can be computed as the explained sum of squares from the regression
~ ~ ~ ~ ~ ~
ui/[Fi(1 - Fi)] on fiWxi/[Fi(1 - Fi)] ,
1/2 1/2
f~iW(xiB) /[Fi(1 - Fi)] , fiW(xiB) /[Fi(1 - Fi)] , i = 1,...,N,

~ 2 ~ ~ 1/2 ~ ~ 3 ~ ~ 1/2
where "~" denotes evaluation at (B,0) and ui = yi - Fi.

~ ~ ~
Under H0, LM is
distributed asymptotically as c22.
13.4. If the density of y given x is correctly specified, then E[s(w,Qo)|x] =
0. But then E[a(x,Qo)s(w,Qo)|x] = a(x,Qo)E[s(w,Qo)|x] = 0, which, of course,
implies E[g(w,Qo)] = 0 where g(w,Q) = a(x,Q)s(w,Q).
13.5. a. Since si(Fo) = [G(Qo)’] si(Qo),

g -1
E[si(Fo)si(Fo)’|xi] = E{[G(Qo)’] si(Qo)si(Qo)’[G(Qo)] |xi}

g g -1 -1
195
= [G(Qo)’] E[si(Qo)si(Qo)’|xi][G(Qo)]
-1 -1
= [G(Qo)’] Ai(Qo)[G(Qo)] .
-1 -1
b. In part b, we just replace Qo with Q~ and Fo with F~:

Ai = [G(Q)’] Ai(Q)[G(Q)]
~g ~ ~ ~ -1
-1
_ ~G’-1~Ai~G-1.
c. The expected Hessian form of the statistic is given in the second
~g ~g
part of equation (13.36), but where it is based on si and Ai:
LMg =
& SN ~sg*’& SN ~Ag*-1& SN ~sg*
7i=1 i8 7i=1 i8 7i=1 i8
=
& SN ~G’-1~s *’& SN ~G’-1~A ~G-1*-1& SN ~G’-1s~ *
7i=1 i8 7 i 8 7i=1 i8
N ’ i=1
& S s~ * G~-1G~& S A~ * G~’G~’-1& S ~s *
N -1 N
=
7i=1 i8 7i=1 i8 7i=1 i8
=
& S ~s *’& S ~A * & S ~s * = LM.
N N -1 N
7i=1 8 7i=1 8
i i 7i=1 8
i
13.6. a. No, for two reasons. First, just specifying a distribution of yit
given xit says nothing, in general, about the distribution of yit given xi _
(xi1,...,xiT). We could assume these two are the same, which is a strict
exogeneity assumption. But, even under strict exogeneity, we would have to
specify something about joint or conditional distributions involving the
different time periods. We could assume independence (conditional on xi) or
make a dynamic completeness assumption. But, without substantially more
assumptions, we cannot derive the distribution of yi given xi.
b. This is given in a more general case in equation (19.46).
Specialized to exp(xitQ) (and using Q in place of B) gives

T T
li(Q) = S [yitxitQ - exp(xitQ)] _ S lit(Q).
t=1 t=1
Taking the gradient and transposing gives
T T
si(Q) = S x’it[yit - exp(xitQ)] _ S sit(Q).
t=1 t=1
c. First, we need minus the Hessian for each i, which is easily obtained
as -Dqsi(Q):
196
T
Hi(Q) = S exp(xitQ)x’itxit,
t=1
which, in this example, does not depend on the yit: Ait(Qo) = Hit(Qo).
Therefore,
N T
^
A = N
-1
S S exp(xitQ^)x’itxit,
i=1t=1
where Q
^
is the partial MLE. Further,
-1 N
^
B = N S si(^Q)si(^Q)’,
i=1
and then Avar(Q) is estimated as
^
& SN ST exp(x Q^)x’ x *-1& SN s (^Q)s (^Q)’*& SN ST exp(x ^Q)x’ x *-1.

7i=1t=1 it it it8 7
i=1
i i 87i=1t=1 it it it8
d. If E(yit|xit,yi,t-1,xi,t-1,...) = E(yit|xit), then
E[sit(Qo)|xit,yi,t-1,xi,t-1,...]
it[E(yit|xit,yi,t-1,xi,t-1,...) - exp(xitQo)] = 0.
= x’
This implies that sit(Qo) and sir(Qo) are uncorrelated, t $ r. Therefore,

T T
Bo = S E[sit(Qo)sit(Qo)’] = S E(uit
2
x’
itxit),
t=1 t=1
where uit _ yit - E(yit|xit) = yit - exp(xitQo). But, by the Poisson
assumption, E(uit|xit) = Var(yit|xit) = exp(xitQo).

2
By iterated
expectations,
T
Bo = S E[exp(xitQo)x’itxit] = Ao.
t=1
(We have really just verified the conditional information matrix equality for
each t, in the special case of Poisson regression with an exponential mean
Therefore, we can estimate Avar(Q) as

^
function.)
& SN ST exp(x ^Q)x’ x *-1,

7i=1t=1 it it it8
which is exactly what we get by using pooled Poisson estimation and ignoring
the time dimension.
197
13.7. a. The joint density is simply g(y1|y2,x;Qo)Wh(y2|x;Qo). The log-
likelihood for observation i is
li(Q) _ log g(yi1|yi2,xi;Q) + log h(yi2|xi;Q),
and we would use this in a standard MLE analysis (conditional on xi).
b. First, we know that, for all (yi2,xi), Qo maximizes E[li1(Q)|yi2,xi].
Since ri2 is a function of (yi2,xi),
E[ri2li1(Q)|yi2,xi] = ri2E[li1(Q)|yi2,xi];
since ri2 > 1, Qo maximizes E[ri2li1(Q)|yi2,xi] for all (yi2,xi), and

therefore Qo maximizes E[ri2li1(Q)]. Similary, Qo maximizes E[li2(Q)], and
so it follows that Qo maximizes ri2li1(Q) + li2(Q). For identification, we
have to assume or verify uniqueness.
c. The score is
si(Q) = ri2si1(Q) + si2(Q),
where si1(Q) _ Dqli1(Q)’ and si2(Q) _ Dqli2(Q)’. Therefore,
E[si(Qo)si(Qo)’] = E[ri2si1(Qo)si1(Qo)’] + E[si2(Qo)si2(Qo)’]
+ E[ri2si1(Qo)si2(Qo)’] + E[ri2si2(Qo)si1(Qo)’].
Now by the usual conditional MLE theory, E[si1(Qo)|yi2,xi] = 0 and, since ri2
and si2(Q) are functions of (yi2,xi), it follows that
E[ri2si1(Qo)si2(Qo)’|yi2,xi] = 0, and so its transpose also has zero
conditional expectation. As usual, this implies zero unconditional
expectation. We have shown
E[si(Qo)si(Qo)’] = E[ri2si1(Qo)si1(Qo)’] + E[si2(Qo)si2(Qo)’].
Now, by the unconditional information matrix equality for the density
h(y2|x;Q), E[si2(Qo)si2(Qo)’] = -E[Hi2(Qo)], where Hi2(Q) = Dqsi2(Q).

Further, byt the conditional IM equality for the density g(y1|y2,x;Q),
E[si1(Qo)si1(Qo)’|yi2,xi] = -E[Hi1(Qo)|yi2,xi], (13.70)
198
where Hi1(Q) = Dqsi1(Q). Since ri2 is a function of (yi2,xi), we can put ri2
inside both expectations in (13.70). Then, by iterated expectatins,
E[ri2si1(Qo)si1(Qo)’] = -E[ri2Hi1(Qo)].
Combining all the pieces, we have shown that
E[si(Qo)si(Qo)’] = -E[ri2Hi1(Qo)] - E[Hi2(Qo)]
= -{E[ri2Dqsi1(Q) + Dqsi2(Q)]
= -E[Dqli(Q)] _ -E[Hi(Q)].
2
So we have verified that an unconditional IM equality holds, which means we
can estimate the asymptotic variance of rN(^Q - Qo) by {-E[Hi(Q)]}-1.

-----
d. From part c, one consistent estimator of rN(^Q - Qo) is

-----
-1 N
N S (ri2^Hi1 + ^Hi2),
i=1
where the notation should be obvious. But, as we discussed in Chapters 12
and 13, this estimator need not be positive definite. Instead, we can break
the problem into needed consistent estimators of -E[ri2Hi1(Qo)] and
-E[Hi2(Qo)], for which we can use iterated expectations. Since, by

-1 N
definition, Ai2(Qo) _ -E[Hi2(Qo)|xi], N S Âi2 is consistent for -E[Hi2(Qo)]
i=1
by the usual iterated expectations argument. Similarly, since Ai1(Qo) _
-E[Hi1(Qo)|yi2,xi], and ri2 is a function of (yi2,xi), it follows that
E[ri2Ai1(Qo)] = -E[ri2Hi1(Qo)]. This implies that, under general regularity

-1 N
conditions, N S ri2Âi1 consistently estimates -E[ri2Hi1(Qo)]. This
i=1
completes what we needed to show. Interestingly, even though we do not have
a true conditional maximum likelihood problem, we can still used the
conditional expectations of the hessians -- but conditioned on different sets
of variables, (yi2,xi) in one case, and xi in the other -- to consistently
estimate the asymptotic variance of the partial MLE.
e. Bonus Question: Show that if we were able to use the entire random
199
sample, the result conditional MLE would be more efficient than the partial
MLE based on the selected sample.
Answer: We use a basic fact about positive definite matrices: if A and B
are P * P positive definite matrices, then A - B is p.s.d. if and only if B-1

-1
- A is positive definite. Now, as we showed in part d, the asymptotic
variance of the partial MLE is {E[ri2Ai1(Qo) + Ai2(Qo)]} .

-1
If we could use
the entire random sample for both terms, the asymptotic variance would be
{E[Ai1(Qo) + Ai2(Qo)]} . But {E[ri2Ai1(Qo) + Ai2(Qo)]} - {E[Ai1(Qo) +

-1 -1
Ai2(Qo)]} is p.s.d. because E[Ai1(Qo) + Ai2(Qo)] - E[ri2Ai1(Qo) + Ai2(Qo)]

-1
= E[(1 - ri2)Ai1(Qo)] is p.s.d. (since Ai1(Qo) is p.s.d. and 1 - ri2 > 0.
13.8. a. The score with respect to Q, for observation i, is
si(Q;G) = f[hi(G)Q]Whi(G)’W{yi - F[hi(G)Q]}/(F[hi(G)Q]{1 - F[hi(G)Q]}),

where hi(G) _ (xi,wi - ziG) is 1 * (K + 1). The negative of the expected
Hessian, where the expectation is conditional on (xi,zi,vi) or (xi,zi,wi),
evaluated at the true parameters, has the standard probit form:
Ai(Qo;Go) = {f[hi(Go)Qo]} hi(Go)’hi(Go)/(F[hi(Go)Qo]{1 - F[hi(Go)Qo]}).

2
The consistent estimator of Ao that we would use is the positive definite
matrix
-1 N
N S [f(hî^Q)]2^h’i ^hi/{F(h iQ)[1 - F(hiQ)]},
^ ^ ^ ^
i=1
^ ^
where hi _ (xi,vi) is the vector of "generated regressors." This is just the
usual probit estimator that we would use without generated regressors. The
tricky part is in adjusting the score. For this, we need to estimate Fo _

E[Dgsi(Qo;Go)], which we can do by first estimating E[Dgsi(Qo;Go)|xi,zi,wi]
and then averaging. But, by the same argument used to simplify the expected
200
Hessian, all but one term has zero conditional expectation. In particular,
E[Dgsi(Qo;Go)|xi,zi,wi] = -[f(hiQo)] hi’Q’D

o ghi(G )’/{F(hiQo)[1 - F(hoiQo)]},
o 2 o o o
o
where hi _ (xi,vi) and
( )
Dghi(Go)’ = -z0 _ -Ri, 2 2
9 i0
a (K + 1) * M matrix, where zi is 1 * M. Therefore, a consistent estimator
of Fo is
N
^
F = N
-1
S [f(hî^Q)]2^h’i ^Q’Ri/{F(h iQ)[1 - F(hiQ)]}.
^ ^ ^ ^
i=1
^ ^ ^
Now, for ri implicit in (12.61), we can take ri = (Z’Z/N) z’
-1
i vi. Then, we
^
form gi as in (12.61).
( )
b. In general, Q’o Ri = (D’
o ro) z0 = rozi.
2 2 If ro = 0 then Q’o Ri = 0,
9 i0
which means Fo = 0. Then, we need not adjust the score for the first-step
estimation of Go.
c. From part b, the usual probit t statistic on the generated regressor
^
vi is asymptotically standard normal under H0: ro = 0, so we can just ignore
^
the generated regressor aspect of vi when carrying out the test.
13.9. a. Under the Markov assumption, the joint density of (yi0,...,yiT) is
given by
fT(yT|yT-1)WfT-1(yT-1|yT-2)WWWf1(y1|y0)Wf0(y0),
so we would need to model f0(y0) to obtain a model of the joint density.
b. The log-likelihood
T
li(Q) = S log[ft(yit|yi,t-1;Q)]
t=1
is the conditional log-likelihood for the density of (yi1,...,yiT) given yi0,
and so the usual theory of conditional maximum likelihood applies. In
practice, this is MLE pooled across i and t.
201
c. Because we have the density of (yi1,...,yiT) given yi0, we can use
any of the three asymptotic variance estimators implied by the information
matrix equality. However, we can also use the simplifications due to
dynamice completeness of each conditional density. Let sit(Q) =
Dqlog[f(yit|yi,t-1;Q), Hit(Q) = Dqsit(Q), and Ait(Qo) = -E[Hit(Q)|yi,t-1], t

= 1,...,T. Then Avar rN(^Q - Qo) is consistently estimated using the inverse
-----
of any of the three matrices in equation (13.50). If we have a canned
package that computes a particular MLE, we can just use any of the usual
asymptotic variance estimates obtained from the pooled MLE.
13.10. a. Because of conditional independence, by the usual product rule,

G
f(y1,y2,...,yG|x,c) = p fg(yg|x,c).
g=1
b. Let g(y1,...,yG|x) be the joint density of yi given xi = x. Then
i f(y1,y2,...,yG|x,c)h(c|x)dc.
g(y1,...,yG|x) =
R
c. The density g(y1,...,yG|x) is now
g(y1,...,yG|x;Go,Do) = i f(y1,y2,...,yG|x,c;Go)h(c|x;Do)dc
R
G
= i p fg(yg|x,c;Go)h(c|x;Do)dc,
g
R g=1
and so the log likelihood for observation i is
log[g(yi1,...,yiG|xi;Go,Do)]
= log
#i pG f (y |x ,c;Gg)h(c|x ;D )dc$.
3R g=1 g ig i o i o 4
d. This setup has some features in common with a linear SUR model,
although here the correlation across equations is assumed to come through a
single common component, c. Because of computational issues with general
nonlinear models -- especially if G is large and some of the models are for
202
qualitative response -- one probably needs to restrict the cross correlation
somehow.
13.11. a. For each t > 1, the density of yit given yi,t-1 = yi,t-1, yi,t-2 =
yt-2, , ..., yi0 = y0, and ci = c is
ft(yt|yt-1,c) = (2pse) ryt-1 - c)2/(2s2e)].

2 -1/2
exp[-(yt -
Therfore, the density of (yi1,...,yiT) given yi0 = y0 and ci = c is obtained
by the product of these densities:

T
p (2ps2e)-1/2exp[-(yt - ryt-1 - c)2/(2s2e)].
t=1
If we plug in the data for observation i and take the log we get
T
S {-(1/2)log(se2) - (yit - ryi,t-1 - ci)2/(2se2)}
t=1
where we have dropped the term that does not depend on the parameters. It is
not a good idea to "estimate" the ci along with r and s2e, as the incidental
parameters problem causes inconsistency -- severe in some cases -- in the
estimator of r.
b. If we write ci = a0 + a1yi0 + ai, under the maintained assumption,
then the density of (yi1,...,yiT) given (yi0 = y0,ai = a) is
T
p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0 - a)2/(2s2e)].
t=1
Now, to get the density condition on yi0 = y0 only, we integrate this density
over the density of ai given yi0 = y0. But ai and yi0 are independent, and
ai ~ Normal(0,sa).
2
So the density of (yi1,...,yiT) given yi0 = y0 is
8 T
i&7 p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0 - a)2/(2s2e)]*8s-1
a f(a/sa)da.
-8 t=1
If we now plug in the data (yi0,yi1,...,yiT) for each i and take the log we
get a conditional log-likelihood (conditional on yi0) for each i. We can
estimate the parameters by maximizing the sum of the log-likelihoods across
i.
203
c. As before, we can replace ci with a0 + a1yi0 + ai. Then, the density
of yit given (yi,t-1,...,yi1,yi0,ai) is
Normal[ryi,t-1 + a0 + a1yi0 + ai + d(a0 + a1yi0 + ai)yi,t-1,s2e],

t = 1,...,T. Using the same argument as in part b, we just integrate out ai
to get the density of (yi1,...,yiT) given yi0 = y0:

8& T
i7 p (2ps2e)-1/2exp[-(yt - ryt-1 - a0 - a1y0
-8 t=1
2 2 * -1
- a - d(a0 + a1y0 + a)yt-1) /(2se)] sa f(a/sa)da.
8
Numerically, this could be a difficult MLE problem to solve. Assuming we can
get the MLEs, we would estimate r + E(ci) as r^ + a^0 + a^1y0, where y0 is the
----- -----
cross-sectional average of the initial observation.
d. The log likelihood for observation i, now conditional on (yi0,zi), is
the log of
8& T
i7 p (2pse2)-1/2exp[-(yit - ryi,t-1 - zitB
-8 t=1
----- 2 2 * -1
- a0 - a1y0 - ziD - a) /(2se)] sa f(a/sa)da.
8
-----
The assumption that we can put in the time average, zi, to account for
correlation between ci and (yi0,zi), may be too strong. It would be better
to put in the full vector zi, although this leads to many more parameters to
estimate.
13.12 (Bonus Question): Let {f(yt|xt;Q): t = 1,...,T} be a sequence of
correctly specified densities for yit given xit. That is, assume that there
is Qo e int($) such that f(yt|xt;Qo) is the density of yit given xit = xt.
Also assume that {xit: t=1,2,...,T} is strictly exogenous for each t:
D(yit|xi1,...,xiT) = D(yit|xit).
a. Is it true that, under the standard regularity conditions for partial
MLE, that E[sit(Qo)|xi1,...,xiT] = 0, where sit(Q) = Dqlogft(yit|xit;Q)’?

204
b. Under the assumptions given, is {sit(Qo): t = 1,...,T} necessarily
serially uncorrelated?
c. Let ci be "unobserved heterogeneity" for cross section unit i, and
assume that, for each t, D(yit|zi1,...,ziT,ci) = D(yit|zit,ci). In other
words, {zit: t = 1,...,T} is strictly exogenous conditional on ci. Further,
assume that D(ci|zi1,...,ziT) = D(ci|zi), where zi = T (zi1 + ... + ziT) is

----- ----- -1
the time average. Assuming that well-behaved, correctly-specifed conditional
densities are available, how do we choose xit to make part a applicable?
Answer:
a. This is true, because, by the general theory for partial MLE, we know
that E[sit(Qo)|xit] = 0, t = 1,...,T. But if D(yit|xi1,...,xiT) = D(yit|xit)
then, for any function mt(yit,xit), E[mt(yit,xit)|xi1,...,xiT] =
E[mt(yit,xit)|xit], including the score function.
b. No. Strict exogeneity and complete dynamic specification of the
conditional density are entirely different. Saying that D(yit|xi1,...,xiT)
does not depend on xis, s $ t, says nothing about whether yir, r < t, appears
in D(yit|xit,yi,t-1,xi,t-1,...).
If gt(yt|zt,c;G) is correctly
-----
c. We take xit = (zit,zi), t = 1,...,T.
specified for the density of yit given (zit = zt,ci = c), and h(c|z;D) is
-----
----- -----
correctly specified for the density of ci given zi = z, then the density of
yit given (zit,zi) is obtained as
ft(yt|zit,zi;Qo) = i gt(yt|zit,c;Go)h(c|zi;Do)n(dc). -----
C
Under the assumptions given, D(yit|zi1,...,ziT,zi) = D(yit|zit,zi), t =
----- -----
1,...,T. However, we have not eliminated the serial dependence in {yit}
205
----- -----
after only conditioning on (zit,zi): the part of ci not explained by zi
affects yit in each time period.
206
14.1. a. The simplest way to estimate (14.35) is by 2SLS, using instruments
(x1,x2). Nonlinear functions of these can be added to the instrument list
-- these would generally improve efficiency if g2 $ 1. If E(u2|x) =

2
s22,
2SLS using the given list of instruments is the efficient, single equation
GMM estimator. Otherwise, the optimal weighting matrix that allows
heteroskedasticity of unknown form should be used. Finally, one could try
to use the optimal instruments derived in section 14.5.3. Even under
homoskedasticity, these are difficult, if not impossible, to find
analytically if g2 $ 1.
b. No. If g1 = 0, the parameter g2 does not appear in the model. Of
course, if we knew g1 = 0, we would consistently estimate D1 by OLS.

c. We can see this by obtaining E(y1|x):
g
E(y1|x) = x1D1 + g1E(y22|x) + E(u1|x)
g
= x1D1 + g1E(y22|x).
g g
Now, when g2 $ 1, E(y22|x) $ [E(y2|x)] 2, so we cannot write
g
E(y1|x) = x1D1 + g1(xD2) 2;
in fact, we cannot find E(y1|x) without more assumptions. While the
regression y2 on x2 consistently estimates D2, the two-step NLS estimator of

g2
yi1 on xi1, (xiD2) D1
^
will not be consistent for and g2. (This is an
example of a "forbidden regression.") When g2 = 1, the plug-in method works:

it is just the usual 2SLS estimator.
14.2. a. When r1 = 1, we obtain the level-level model, hours = -g1 + z1D1 +

g1wage + u1. Using the hint, let r1 L 0 to get hours = z1D1 + g1log(wage)
207
+ u1.
b. We cannot use a standard t test after estimating the full model
(say, by nonlinear 2SLS), because r1 cannot be estimated under H0. The
score test and QLR test also fail because of lack of identification under
r1
H0. What we can do is fix a value for r1, and then use a t test on (wage
- 1)/r1 after 2SLS (or GMM more generally). This need not be a very good
test for detecting g1 $ 0 if our guess for r1 is not close to the actual
value. There is a growing literature on testing hypotheses when parameters
are not identified under the null.
c. If Var(u1|z) = s21, use nonlinear 2SLS, where we would use z and

functions of z as IVs. If we are not willing to assume homoskedasticity,
GMM is generally more efficient.
d. The residual funtion is r(Q) = [hours - z1D1 - g1(wager1 - 1)/r1],

where Q = (D’
1 ,g1,r1)’. The gradient is
Dqr(Q) = {-z1,-(wager1 - 1)/r1,g1[(wager1 - 1) - r1wager1log(wage)]/r21},

where I used the hint. The score is just the transpose.
e. Estimate D1 and g1 by 2SLS, or use the GMM estimator that accounts

for heteroskedasticity, under the restriction r1 = 1. Suppose the
instruments are zi, a 1 * L vector. This is just linear estimation because
the model is linear under H0. Then, using Zi = zi, and
ri(Q) = [hoursi - zi1D1 - g1(wagei - 1)]

~ ~ ~
Dqr(Q) = {-zi1,-(wagei - 1),g~ 1[(wagei - 1) - wageilog(wagei)]},

use the score statistic in equation (14.32).
*
14.3. Let Zi be the G * G matrix of optimal instruments in (14.63), where
we suppress its dependence on xi. Let Zi be a G * L matrix that is a
208
function of xi and let %o be the probability limit of the weighting matrix.
Then the asymptotic variance of the GMM estimator has the form (14.10) with
Go = E[Z’
i Ro(xi)]. So, in (14.54), take A _ G’o %oGo and s(wi) _
o %oZ’
G’ i r(wi,Qo). _
*
The optimal score function is s (wi)
Ro(xi)’)o(xi) r(wi,Qo). r = 1:
-1
Now we can verify (14.57) with
o %oE[Z’
E[s(wi)s (wi)’] = G’ i r(wi,Qo)r(wi,Qo)’)o(xi) Ro(xi)]
* -1
o %oE[Z’
= G’ i E{r(wi,Qo)r(wi,Qo)’|xi})o(xi) Ro(xi)]
-1
o %oE[Z’
= G’ i )o(xi))o(xi) Ro(xi)] = G’
o %oGo = A.
-1
14.4. a. The residual function for the conditional mean model E(yi|xi) =
m(xi,Bo) is ri(B) _ yi - m(xi,B). Then )o(xi) in (14.61) is just a scalar,
)o(xi) = Var(yi|xi) _ wo(xi). Under WNLS.3, wo(xi) = so2h(xi,Go) for a

known function h(W). Further, Ro(xi) _ E[Dbri(Bo)|xi] = -Dbm(xi,Bo), and
so the optimal instruments are Dbm(xi,Bo)/wo(xi). The asymptotic variance
of the efficient IV estimator is obtained from (14.66):
{E[Dbm(xi,Bo)’[wo(xi)] Dbm(xi,Bo)]}-1
-1
= so2{E[Dbm(xi,Bo)’Dbm(xi,Bo)/h(xi,Go)]}-1,
which is the asymptotic variance of the WNLS estimator under WNLS.1,
WNLS.2, and WNLS.3.
b. If Var(yi|xi) = s2o then NLS achieves the efficiency bound, as is

seen by setting h(x,Go) _ 1 in part a.
c. Now let ri1(B) _ ui(B) _ yi - m(xi,B) and ri2(B,s2) = [yi -
m(xi,B)] s2. Let ri(Q) denote the 2 * 1 vector obtained by stacking the
2
-
two residual functions. Then the moment conditions can be written as
E[ri(Qo)|xi] = 0,
where Qo = (B’
o ,so)’.
2
To obtain the efficient IVs, we first need
209
E[Dqri(Qo)|xi]. But
& -D b m i (B) 0 *
Dqri(Q) = .
7-2D b m i (B)ui(B) -1 8
2 2
Evaluating at Qo and using E[ui(Bo)|xi] = 0 gives
&-Dbm i (B) 0 *
Ro(xi) = Dqri(Q) = 2 2.
7 0 -1 8
We also need
s 2o & E(u i |xi)

3
*
)o(xi) = E[ri(Qo)ri(Qo)’|xi] =
7 E(u i |xi) E(ui|x i ) - s4o8
3 2
4 2
where ui _ yi - m(xi,Bo). The optimal IVs are [)o(xi)] Ro(xi).

-1
If
E(ui|xi) = 0, as occurs under conditional symmetry of ui, then the

3
asymptotic variance matrix of the optimal IV estimator is block diagonal,
and for B^ it is the same as NLS. In other words, adding the moment
condition for the homoskedasticity assumption does not improve efficiency
over NLS under symmetry, even if E(ui|xi) is not constant.

4
If, in
addition, E(ui|xi) is constant, then the usual estimator of s2o based on the
4
sum of squared NLS residuals is efficient.
14.5. We can write the unrestricted linear projection as
yit = pt0 + xiPt + vit, t = 1,2,3,

where Pt is 1 + 3K * 1, and then P is the 3 + 9K * 1 vector obtained by
stacking the Pt. Let Q = (j,L’
1 ,L’
2 ,L’
3 ,B’)’. With the restrictions imposed on
the Pt we have
pt0 = j, t = 1,2,3, P1 = [(L1 + B)’,L’2 ,L’3 ]’,

P2 = [L’
1 ,(L2 + B)’,L’3 ]’, P3 = [L’
1 ,L’
2 ,(L3 + B)’]’.
Therefore, we can write P = HQ for the (3 + 9K) * (1 + 4K) matrix H defined
210
by
&1 0 0 0 0 *
2
0 IK 0 0 IK 2
2 2
0 0 IK 0 0
2 2
0 2 0 0 IK 0 2
1 2
0 0 0 0 2
2 2
0 IK 0 0 0
H = 2 . 2
0 2
0 IK 0 IK 2
0 2 0 0 IK 0 2
1 2
0 0 0 0 2
2 2
0 IK 0 0 0
2 2
0 2
0 IK 0 0 2
70 0 0 IK IK8
14.6. By the hint, it suffices to show that
[Avar rN(^Q - Qo)]-1 - [Avar rN(~Q - Qo)]-1

----- -----
o %o Ho - H’
This difference is H’ o *o Ho = H’
o (%o *o-1)Ho.
-1 -1 -1
is p.s.d. - This is
positive semi-definite if %-1

o - *-1
o is p.s.d., which again holds by the hint
because *o - %o is assumed to be p.s.d.
14.7. With h(Q) = HQ, the minimization problem becomes
min (P - HQ)’% (P - HQ),

^ ^-1 ^
QeR P
where it is assumed that no restrictions are placed on Q. The first order
condition is easily seen to be
-2H’% (P - HQ) = 0 (H’% H)Q = H’% P.

^-1 ^ ^ ^-1 ^ ^-1^
or
Therefore, assuming H’% H is nonsingular -- which occurs w.p.a.1. when

^-1
H’%o H -- is nonsingular -- we have Q^ = (H’% H) H’% P.

-1 ^-1 -1 ^-1^
14.8. From the efficiency result of maximum likelihood -- see the discussion
211
on page 439 -- it is no less asymptotically efficient to use the density of
(yi0,yi1,...,yiT) than to use the conditional distribution (yi1,...,yiT)
given yi0. The cost of the asymptotic efficiency is that if we misspecify
f0(y0;Q), then the unconditional MLE will generally be inconsistent for Qo.
The MLE that conditions on yi0 is consistent provided we have the densitites
ft(yt|yt-1;Q) correctly specified, t > 1. As ft(yt|yt-1;Q) is the density of
interest, we are usually willing to put more effort into testing our
specification of it.
14.9. We have to verify equations (14.55) and (14.56) for the random effects
and fixed effects estimators. The choices of si1, si2 (with added i
subscripts for clarity), A1, and A2 are given in the hint. Now, from Chapter
10, we know that E(rir’|

i xi) = s2uIT under RE.1, RE.2, and RE.3, where ri = vi
ˇ ˇ
- ljTvi.
-----
Therefore, E(si1s’
i1) = E(X’i rir’
i Xi) = su2E(Xˇ’i Xˇi) _ su2A1 by the usual
iterated expectations argument. This means that, in (14.55), r _ su2. Now,
¨ ˇ
we just need to verify (14.56) for this choice of r. But si2s’
i1 = X’i uir’
i X i.
¨ ¨
Now, as described in the hint, X’i ri = X’i (vi - ljTvi) = ¨X’i vi = ¨X’i (cijT + ui) =
-----
¨ ¨ ˇ ¨ ˇ
X’i ui. So si2s’
i1 = X’i rir’
i Xi and therefore E(si2s’
i1|xi) = X’i E(rir’|
i xi)Xi =
su2X¨’i Xˇi. It follows that E(si2s’

i1) = su2E(X¨’i Xˇi). To finish off the proof,
¨ ˇ ¨
note that X’
i Xi = X’i (Xi - ljTxi) = ¨X’i Xi = ¨X’i ¨Xi.
-----
This verifies (14.56) with r

= s2u.
212
15.1. a. Since the regressors are all orthogonal by construction -- dkiWdmi =
0 for k $ m, and all i -- the coefficient on dm is obtained from the

regression yi on dmi, i = 1,...,N. But this is easily seen to be the fraction
of yi in the sample falling into category m. Therefore, the fitted values are
just the cell frequencies, and these are necessarily in [0,1].
b. The fitted values for each category will be the same. If we drop d1
but add an overall intercept, the overall intercept is the cell frequency for
the first category, and the coefficient on dm becomes the difference in cell
frequency between category m and category one, m = 2, ..., M.
15.2. a. First, since utility is increasing in both c and q, the budget
constraint is binding at the optimum: ci + piqi = mi. Plugging c = mi - piq
into the utility function reduces the problem to
max (mi - piq) + ai(1+q).

q>0
Define utility as a function of q, as
si(q) _ (mi - piq) + ai(1+q).

Then, for all q > 0,
dsi ai
(q) = -wi +
-------------- . ------------------------
dq 1 + q
The optimal solution is qi = 0 if the marginal utility of charitable giving at
q = 0 is nonpositive, that is, if

dsi
(0) = -pi + ai < 0 or ai
-------------- < pi.
dq
(This can also be obtained by solving the Kuhn-Tucker conditions.) Thus, for
this utility function, ai can be interpreted as the reservation price above
213
which no charitable contributions will be made; in other words, we have the
corner solution qi = 0 whenever the price of charitable giving is too high
relative to the marginal utility of charitable giving. On the other hand, if
ai > pi, then an interior solution exists (qi > 0) and necessarily solves the
first order condition

dsi ai
dq
(qi) = -pi +
--------------
1 + qi
---------------------------- _ 0
or
1 + qi = ai/pi.
b. By definition of yi, yi = 1 if and only if ai/pi > 1 or log(ai/pi) >
0. If ai = exp(ziG + vi), this is equivalent to ziG + vi - log pi > 0.
Therefore,
P(yi = 1|zi,mi,pi) = P(yi = 1|zi,pi)
= P(ziG + vi - log pi > 0|zi,pi) = P[vi/s > (-ziG + log pi)/s]
= 1 - G[(-ziG + log pi)/s] = G[(ziG - log pi)/s],
where the last equality follows by symmetry of the distribution of vi/s.
15.3. a. If P(y = 1|z1,z2) = F(z1D1 + g1z2+ g2z2) then

2
dP(y = 1|z1,z2)
= (g1 + 2g2z2)Wf(z1D1 + g1z2 + g2z2);
2
dz 2 -------------------------------------------------------------------------
for given z, this is estimated as

^ ^ ^ ^ ^ 2
(g1 + 2g2z2)Wf(z1D1 + g1z2 + g2z2),
where, of course, the estimates are the probit estimates.
b. In the model
P(y = 1|z1,z2,d1) = F(z1D1 + g1z2 + g2d1 + g3z2d1),
the partial effect of z2 is

dP(y = 1|z1,z2,d1)
= (g1 + g3d1)Wf(z1D1 + g1z2 + g2d1 + g3z2d1).
dz 2
---------------------------------------------------------------------------------------
The effect of d1 is measured as the difference in the probabilities at d1 = 1
214
and d1 = 0:
P(y = 1|z,d1 = 1) - P(y = 1|z,d1 = 0)
= F[z1D1 + (g1 + g3)z2 + g2] - F(z1D1 + g1z2).
Again, to estimate these effects at given z and -- in the first case, d1 -- we
just replace the parameters with their probit estimates, and use average or
other interesting values of z.
c. We would apply the delta method from Chapter 3. Thus, we would
require the full variance matrix of the probit estimates as well as the
gradient of the expression of interest, such as (g1 + 2g2z2)Wf(z1D1 + g1z2 +

2
g2z2), with respect to all probit parameters. (Not with respect to the zj.)
15.4. This is the kind of statement that arises out of failure to distinguish
between the underlying latent variable model and the model for P(y = 1|x).
The linear probability model assumes P(y = 1|x) = xB while, for example, the
probit model assumes that P(y = 1|x) = F(xB). Thus, both models make very
particular functional form assumptions on the response probabilities. The
fact that the probit model can be derived from a latent variable model with a
normal, homoskedastic error does not make it less plausible than the LPM. In
fact, we know that the probit functional form has some attractive properties
that the linear model does not have: F(xB) is always between zero and one,
and the marginal effect of any xj is diminishing after some point.
Incidentally, the LPM can be obtained from a latent variable model by assuming
that e has a uniform distribution over [-1,1] (actually, any symmetric,
uniform interval will do).
215
15.5. a. If P(y = 1|z,q) = F(z1D1 + g1z2q) then
dP(y = 1|z,q) = g qWf(z D + g z q),
dz 2 1 1 1 1 2
-----------------------------------------------------------------
assuming that z2 is not functionally related to z1.
= z1D1 + r, where r = g1z2q + e, and e is independent of

*
b. Write y
(z,q) with a standard normal distribution. Because q is assumed independent
of z, q|z ~ Normal(0,g1z2 + 1); this follows because E(r|z) = g1z2E(q|z) +

2 2
E(e|z) = 0. Also,
Var(r|z) = g1z2Var(q|z) + Var(e|z) + 2g1z2Cov(q,e|z) = g1z2 + 1

2 2 2 2
because Cov(q,e|z) = 0 by independence between e and (z,q). Thus,

5 ======================================
r/r g1z2 + 1 has a standard normal distribution independent of z.

2 2
It follows
that
& 5 ======================================
*
P(y = 1|z) = F z1D1/r g1z2 + 1 .
2 2
(15.90)
7 8
c. Because P(y = 1|z) depends only on g1, this is what we can estimate
2
along with D1. (For example, g1 = -2 and g1 = 2 give exactly the same model
for P(y = 1|z).)

2
This is why we define r1 = g1. Testing H0: r1 = 0 is most
easily done using the score or LM test because, under H0, we have a standard
^
probit model. Let D1 denote the probit estimates under the null that r1 = 0.
^ ^ ^ ^ ^ ^ ~
5 ================================================
Define Fi = F(zi1D1), fi = f(zi1D1), ui = yi ----- Fi, and ui _ ûi/r ^Fi(1 - ^Fi)

(the standardized residuals). The gradient of the mean function in (15.90)
^
with respect to D1, evaluated under the null estimates, is simply fizi1 . The
only other quantity needed is the gradient with respect to r1 evaluated at the
null estimates. But the partial derivative of (15.90) with respect to r1 is,
for each i,
-(zi1D1)(zi2/2) r1zi2 + 1
2 & 2 *-3/2f(z D /r5g2z2 + 1). ==========================================
7 8 9 i1 1 1 i2 0
^ ^ ^
When we evaluate this at r1 = 0 and D1 we get -(zi1D1)(zi2/2)fi.
2
Then, the
216
2
score statistic can be obtained as NRu from the regression
~ ^
5================================================
^ ^ ^ 2 ^ ^ ^
5================================================
ui on fizi1/r Fi(1 - Fi), (zi1D1)zi2fi/r Fi(1 - Fi);

2 a 2
under H0, NRu ~ c1.
d. The model can be estimated by MLE using the formulation with r1 in

2
place of g1. But this is not a standard probit estimation.
15.6. a. What we would like to know is that, if we exogenously change the
number of cigarettes that someone smokes per day, what effect would this have
on the probability of missing work over a three-month period? In other words,
we want to infer causality, not just find a correlation being missing work and
cigarette smoking.
b. Since people choose whether and how much to smoke, we certainly cannot
treat the data as coming from the experiment we have in mind in part a. (That
is, we cannot randomly assign people a daily cigarette consumption.) It is
possible that smokers are less healthy to begin with, or have other attributes
that cause them to miss work more often. Or, it could go the other way:
cigarette consumption may be related to personality traits that make people
harder workers. In any case, cigs might be correlated with the unobservables
in the equation.
c. If we start with the model
P(y = 1|z,cigs,q1) = F(z1D1 + g1cigs + q1), (15.91)
but ignore q1 when it is correlated with cigs, we will not consistently
estimate anything of interest, whether the model is linear or nonlinear.
Thus, we would not be estimating a causal effect. If q1 is independent of
cigs, then probit ignoring q1 does estimate the average partial effect of
217
another cigarette.
d. No. There are many people in the working population who do not smoke.
Thus, the distribution of cigs piles up at zero, conditional or unconditional
on z. Also, since cigs takes on integer values, it cannot be normally
distributed. But it is really the pile up at zero that is the most serious
issue.
^
e. Use the Rivers-Vuong test. Obtain the residuals, r2, from the
^
regression cigs on z. Then, estimate the probit of y on z1, cigs, r2 and use
^
a standard t test on r2. This does not rely on normality of r2 (or cigs). It
does, of course, rely on the probit model being correct for y under H0.
f. Assuming people will not immediately move out of their state of
residence when the state implements no smoking laws in the workplace, and that
state of residence is roughly independent of general health in the population,
a dummy indicator for whether the person works in a state with a new law can
be treated as exogenous and excluded from (15.91). (These situations are
often called "natural experiments.") Further, cigs is likely to be correlated
with the state law indicator, since people will not be able to smoke as much
as they otherwise would. Thus, it seems to be a reasonable instrument for
cigs.
15.7. a. The following Stata output is for part a:
. reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60
Source | SS df MS Number of obs = 2725

-------------+------------------------------ F( 8, 2716) = 30.48
Model | 44.9720916 8 5.62151145 Prob > F = 0.0000
Residual | 500.844422 2716 .184405163 R-squared = 0.0824
-------------+------------------------------ Adj R-squared = 0.0797
Total | 545.816514 2724 .20037317 Root MSE = .42942
218
------------------------------------------------------------------------------
arr86 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1543802 .0209336 -7.37 0.000 -.1954275 -.1133329
avgsen | .0035024 .0063417 0.55 0.581 -.0089326 .0159374
tottime | -.0020613 .0048884 -0.42 0.673 -.0116466 .007524
ptime86 | -.0215953 .0044679 -4.83 0.000 -.0303561 -.0128344
inc86 | -.0012248 .000127 -9.65 0.000 -.0014738 -.0009759
black | .1617183 .0235044 6.88 0.000 .1156299 .2078066
hispan | .0892586 .0205592 4.34 0.000 .0489454 .1295718
born60 | .0028698 .0171986 0.17 0.867 -.0308539 .0365936
_cons | .3609831 .0160927 22.43 0.000 .329428 .3925382
------------------------------------------------------------------------------
. reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60, robust
Regression with robust standard errors Number of obs = 2725

F( 8, 2716) = 37.59
Prob > F = 0.0000
R-squared = 0.0824
Root MSE = .42942
------------------------------------------------------------------------------
| Robust
arr86 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1543802 .018964 -8.14 0.000 -.1915656 -.1171948
avgsen | .0035024 .0058876 0.59 0.552 -.0080423 .0150471
tottime | -.0020613 .0042256 -0.49 0.626 -.010347 .0062244
ptime86 | -.0215953 .0027532 -7.84 0.000 -.0269938 -.0161967
inc86 | -.0012248 .0001141 -10.73 0.000 -.0014487 -.001001
black | .1617183 .0255279 6.33 0.000 .1116622 .2117743
hispan | .0892586 .0210689 4.24 0.000 .0479459 .1305714
born60 | .0028698 .0171596 0.17 0.867 -.0307774 .036517
_cons | .3609831 .0167081 21.61 0.000 .3282214 .3937449
------------------------------------------------------------------------------
The estimated effect from increasing pcnv from .25 to .75 is about -.154(.5) =
-.077, so the probability of arrest falls by about 7.7 points. There are no
important differences between the usual and robust standard errors. In fact,
in a couple of cases the robust standard errors are notably smaller.
b. The robust statistic and its p-value are gotten by using the "test"
command after appending "robust" to the regression command:
219
. test avgsen tottime
( 1) avgsen = 0.0
( 2) tottime = 0.0
F( 2, 2716) = 0.18
Prob > F = 0.8320
. qui reg arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60
. test avgsen tottime
( 1) avgsen = 0.0
( 2) tottime = 0.0
F( 2, 2716) = 0.18
Prob > F = 0.8360
c. The probit model is estimated as follows:
. probit arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60
Iteration 0: log likelihood = -1608.1837

Probit estimates Number of obs = 2725

LR chi2(8) = 249.09
Prob > chi2 = 0.0000
Log likelihood = -1483.6406 Pseudo R2 = 0.0774
------------------------------------------------------------------------------
arr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.5529248 .0720778 -7.67 0.000 -.6941947 -.4116549
avgsen | .0127395 .0212318 0.60 0.548 -.028874 .0543531
tottime | -.0076486 .0168844 -0.45 0.651 -.0407414 .0254442
ptime86 | -.0812017 .017963 -4.52 0.000 -.1164085 -.0459949
inc86 | -.0046346 .0004777 -9.70 0.000 -.0055709 -.0036983
black | .4666076 .0719687 6.48 0.000 .3255516 .6076635
hispan | .2911005 .0654027 4.45 0.000 .1629135 .4192875
born60 | .0112074 .0556843 0.20 0.840 -.0979318 .1203466
_cons | -.3138331 .0512999 -6.12 0.000 -.4143791 -.213287
------------------------------------------------------------------------------
Now, we must compute the difference in the normal cdf at the two different
values of pcnv, black = 1, hispan = 0, born60 = 1, and at the average values
220
of the remaining variables:
. sum avgsen tottime ptime86 inc86
Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------
avgsen | 2725 .6322936 3.508031 0 59.2
tottime | 2725 .8387523 4.607019 0 63.4
ptime86 | 2725 .387156 1.950051 0 12
inc86 | 2725 54.96705 66.62721 0 541
. di -.313 + .0127*.632 - .0076*.839 - .0812*.387 - .0046*54.97 + .467 + .0112

-.1174364
. di normprob(-.553*.75 - .117) - normprob(-.553*.25 - .117)

-.10181543
This last command shows that the probability falls by about .10, which is
somewhat larger than the effect obtained from the LPM.
d. To obtain the percent correctly predicted for each outcome, we first
generate the predicted values of arr86 as described on page 465:
. predict phat
(option p assumed; Pr(arr86))
. gen arr86h = phat > .5
. tab arr86h arr86
| arr86
arr86h | 0 1 | Total
-----------+----------------------+----------
0 | 1903 677 | 2580
1 | 67 78 | 145
-----------+----------------------+----------
Total | 1970 755 | 2725
. di 1903/1970
.96598985
. di 78/755
.10331126
For men who were not arrested, the probit predicts correctly about 96.6% of
221
the time. Unfortunately, for the men who were arrested, the probit is correct
only about 10.3% of the time. The overall percent correctly predicted is
quite high, but we cannot very well predict the outcome we would most like to
predict.
e. Adding the quadratic terms gives
. probit arr86 pcnv avgsen tottime ptime86 inc86 black hispan born60 pcnvsq
pt86sq inc86sq


LR chi2(11) = 336.77
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
arr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | .2167615 .2604937 0.83 0.405 -.2937968 .7273198
avgsen | .0139969 .0244972 0.57 0.568 -.0340166 .0620105
tottime | -.0178158 .0199703 -0.89 0.372 -.056957 .0213253
ptime86 | .7449712 .1438485 5.18 0.000 .4630333 1.026909
inc86 | -.0058786 .0009851 -5.97 0.000 -.0078094 -.0039478
black | .4368131 .0733798 5.95 0.000 .2929913 .580635
hispan | .2663945 .067082 3.97 0.000 .1349163 .3978727
born60 | -.0145223 .0566913 -0.26 0.798 -.1256351 .0965905
pcnvsq | -.8570512 .2714575 -3.16 0.002 -1.389098 -.3250042
pt86sq | -.1035031 .0224234 -4.62 0.000 -.1474522 -.059554
inc86sq | 8.75e-06 4.28e-06 2.04 0.041 3.63e-07 .0000171
_cons | -.337362 .0562665 -6.00 0.000 -.4476423 -.2270817
------------------------------------------------------------------------------
note: 51 failures and 0 successes completely determined.
. test pcnvsq pt86sq inc86sq
( 1) pcnvsq = 0.0
( 2) pt86sq = 0.0
( 3) inc86sq = 0.0
222
chi2( 3) = 38.54
Prob > chi2 = 0.0000
The quadratics are individually and jointly significant. The quadratic in

pcnv means that, at low levels of pcnv, there is actually a positive
relationship between probability of arrest and pcnv, which does not make much
sense. The turning point is easily found as .217/(2*.857) ~ .127, which means
that there is an estimated deterrent effect over most of the range of pcnv.
15.8. a. Here is my Stata session:
. gen smokes = cigs > 0
. tab smokes
smokes| Freq. Percent Cum.
------------+-----------------------------------
0 | 1176 84.73 84.73
1 | 212 15.27 100.00
------------+-----------------------------------
Total | 1388 100.00
. probit smokes motheduc white lfaminc
Probit Estimates Number of obs = 1387

chi2(3) = 92.67
Prob > chi2 = 0.0000
Log Likelihood = -546.76991 Pseudo R2 = 0.0781
------------------------------------------------------------------------------
smokes | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | -.1450599 .0207899 -6.977 0.000 -.1858074 -.1043124
white | .1896765 .1098804 1.726 0.084 -.0256852 .4050382
lfaminc | -.1669109 .0498894 -3.346 0.000 -.2646923 -.0691296
_cons | 1.126276 .2504608 4.497 0.000 .6353822 1.617171
------------------------------------------------------------------------------
. sum faminc

---------+-----------------------------------------------------
faminc | 1388 29.02666 18.73928 .5 65
. di 1.126 - .167*log(29.027)
.56350619
223
. di normprob(-.145*16 + .5635) - normprob(-.145*12 + .5635)
-.08019603
For nonwhite women at the average income level, the estimated difference in
the probability of smoking between college graduates and high school is about
-.08, that is, women with a college education are .08 less likely to smoke.
b. faminc is probably not exogenous, since, at a minium, it is likely
correlated with quality of health care. It might also be correlated withh
unobserved cultural factors that are correlated with smoking.
c. The reduced form equation for lfaminc is estimated as follows:
. reg lfaminc motheduc white fatheduc
Source | SS df MS Number of obs = 1191

---------+------------------------------ F( 3, 1187) = 119.23
Model | 140.936735 3 46.9789115 Prob > F = 0.0000
Residual | 467.690904 1187 .394010871 R-squared = 0.2316
---------+------------------------------ Adj R-squared = 0.2296
Total | 608.627639 1190 .511451797 Root MSE = .6277
------------------------------------------------------------------------------
lfaminc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | .0709044 .0098338 7.210 0.000 .0516109 .090198
white | .3452115 .050418 6.847 0.000 .2462931 .4441298
fatheduc | .0616625 .008708 7.081 0.000 .0445777 .0787473
_cons | 1.241413 .1103648 11.248 0.000 1.024881 1.457945
------------------------------------------------------------------------------
. predict v2hat, resid

(197 missing values generated)
As expected, fatheduc has a positive partial effect on lfaminc, and the
relationship is statistically significant. We need the residuals from this
regression for the next part. Note that we lose 197 observations due to
missing data on fatheduc.
d. To test the null of exogeneity, we estimate the probit that includes
224
^
v2:
. probit smokes motheduc white lfaminc v2hat
Probit Estimates Number of obs = 1191

chi2(4) = 79.43
Prob > chi2 = 0.0000
Log Likelihood = -432.06242 Pseudo R2 = 0.0842
------------------------------------------------------------------------------
smokes | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
motheduc | -.0826247 .0465203 -1.776 0.076 -.1738029 .0085535
white | .4611075 .1965242 2.346 0.019 .0759272 .8462879
lfaminc | -.7622559 .3652944 -2.087 0.037 -1.47822 -.046292
v2hat | .6107298 .3708066 1.647 0.100 -.1160378 1.337497
_cons | 1.98796 .5996364 3.315 0.000 .8126946 3.163226
------------------------------------------------------------------------------
There is not real strong evidence of endogeneity, but even if there were, we
would not really know whether it is because lfaminc is endogenous or because
fatheduc belongs in the structural equation. Remember, the test can be
interpreted as a test for endogeneity of lfaminc only when we maintain that
fatheduc is exogenous.
This is not a very good example, but it shows you how to mechanically
carry out the tests.
15.9. a. Let P(y = 1|x) = xB, where x1 = 1. Then, for each i,
li(B) = yilog(xiB) + (1 - yi)log(1 - xiB),
which is only well-defined for 0 < xiB < 1.
b. For any possible estimate B^, the log-likelihood function is well-

^
defined only if 0 < xiB < 1 for all i = 1,...,N. Therefore, during the
iterations to obtain the MLE, this condition must be checked. It may be
impossible to find an estimate that satisfies these inequalities for every
225
observation, especially if N is large.
c. This follows from the KLIC: the true density of y given x --
evaluated at the true values, of course -- maximizes the KLIC. Since the MLEs
are consistent for the unknown parameters, asymptotically the true density
will produce the highest average log likelihood function. So, just as we can
use an R-squared to choose among different functional forms for E(y|x), we can
use values of the log-likelihood to choose among different models for P(y =
1|x) when y is binary.
^ ^
15.10. a. There are several possibilities. One is to define pi = P(y = 1|xi)
-- the estimated response probabilities -- and obtain the square of the

^
correlation between yi and pi. For the LPM, this is just the usual R-squared.
^
For the general index model, G(xiB) is the estimate of E(y|xi), and so it
makes sense to compute an analogous goodness-of-fit measure. This measure is
always between zero and one. An alternative is to use the sum of squared
residuals form. While this produces the same R-squared measure for the linear
model, it does not for nonlinear models.
b. I will report the square of the correlation between yi and the fitted
probabilities for the LPM and probit. The LPM R-squared is about .106 and
that for probit is higher, about .116. So probit is preferred, marginally, on
this goodness-of-fit measure.
15.11. We really need to make two assumptions. The first is a conditional
independence assumption: given xi = (xi1,...,xiT), (yi1,...,yiT) are
independent. This allows us to write
f(y1,...,yT|xi) = f1(y1|xi)WWWfT(yT|xi),
226
that is, the joint density (conditional on xi) is the product of the marginal
densities (each conditional on xi). The second assumption is a strict
exogeneity assumption: D(yit|xi) = D(yit|xit), t = 1,...,T. When we add the
standard assumption for pooled probit -- that D(yit|xit) follows a probit
model -- then
T
f(y1,...,yT|xi) = p [G(xitB)]yt[1 - G(xitB)]1-yt,
t=1
and so pooled probit is conditional MLE.
15.12. We can extend the T = 2 case on page 491:
P(yi1 = 1|xi,ci,ni = 1) = P(yi1 = 1,ni = 1|xi,ci)/P(ni = 1|xi,ci)
= P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci)/{P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci)
+ P(yi1 = 0,yi2 = 1,yi3 = 0|xi,ci) + P(yi1 = 0,yi2 = 0,yi3 = 1|xi,ci)}.
Now, we just the conditional independence assumption (across t) and the
logistic functional form:
P(yi1 = 1,yi2 = 0,yi3 = 0|xi,ci) = L(xi1B + ci)[1 - L(xi2B + ci)]
W[1 - L(xi3B + ci)]

P(yi1 = 0,yi2 = 1,yi3 = 0|xi,ci) = [1 - L(xi1B + ci)]L(xi2B + ci)
W[1 - L(xi3B + ci)]

and
P(yi1 = 0,yi2 = 0,yi3 = 1|xi,ci)} = [1 - L(xi1B + ci)]
W[1 - L(xi1B + ci)]L(xi3B + ci).

Now, the term
1/{[1 + exp(xi1B + ci)]W[1 + exp(xi2B + ci)]W[1 + exp(xi3B + ci)]}
appears multiplicatively in both the numerator and denominator, and so they
cancel. Therefore,
P(yi1 = 1|xi,ci,ni = 1) = exp(xi1B + ci)/[exp(xi1B + ci)
227
+ exp(xi2B + ci) + exp(xi3B + ci)]
= exp(xi1B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)].
Also,
P(yi2 = 1|xi,ci,ni = 1) = exp(xi2B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)]
and
P(yi3 = 1|xi,ci,ni = 1) = exp(xi3B)/[exp(xi1B) + exp(xi2B) + exp(xi3B)],
which are of the conditional logit form in (15.80). A consistent estimator of
B is obtained using only the ni = 1 observations and applying conditional
logit. This, however, would be inefficient because it does not use the ni = 2
observations.
A similar argument can be used for the three possible configurations with
ni = 2, which leads to the log-likelihood conditional on (xi,ni), where ci has
dropped out. For example,
P(yi1 = 1,yi2 = 1|xi,ci,ni = 2) = exp[(xi1 + xi2)B]
/{[exp[(xi1 + xi2)B] + exp[(xi1 + xi3)B] + exp[(xi2 + xi3)B].
Again, this has the conditional logit form, but where the explanatory
variables consist of sums of explanatory variables across two different time
periods.
15.13. a. If there are no covariates, there is no point in using any method
other than a straight comparison of means. The estimated probabilities for
the treatment and control groups, both before and after the policy change,
will be identical across models.
b. Let d2 be a binary indicator for the second time period, and let dB be
an indicator for the treatment group. Then a probit model to evaluate the
treatment effect is
228
P(y = 1|x) = F(d0 + d1d2 + d2dB + d3d2WdB + xG),
where x is a vector of covariates. We would estimate all parameters from a
probit of y on 1, d2, dB, d2WdB, and x using all observations. Once we have
the estimates, we need to compute the "difference-in-differences" estimate,

-----
which requires either plugging in a value for x, say x, or averaging the
differences across xi. In the former case, we have

^ ^ ^ ^ ^ ^ ^ ^ ^
q _ [F(d0 + d1 + d2 + d3 + xG) - F(d0 + d2 + xG)]
------ ------
^ ^ ^ ^ ^
- [F(d0 + d1 + xG) - F(d0 + xG)],
------ ------
and in the latter we have

~ N
q _ N-1 S {[F(d^0 + d^1 + d^2 + d^3 + xiG
^ ^ ^ ^
) - F(d0 + d2 + xiG)]
i=1
^ ^ ^ ^ ^
- [F(d0 + d1 + xiG) - F(d0 + xiG)]}.
Both are estimates of the difference, between groups B and A, of the change in
the response probability over time.
c. We would have to use the delta method to obtain a valid standard error
^ ~
for either q or q.
15.14. a. The following Stata output contains the linear regression results.
Since pctstck is discrete (taking on only 0, 50, 100), it seems likely that
heteroskedasticity is present in a linear model. In fact, the robust standard
errors are not very different from the usual ones (not reported).
. reg pctstck choice age educ female black married finc25-finc101 wealth89
prftshr, robust

F( 14, 179) = 2.15
Prob > F = 0.0113
R-squared = 0.0998
Root MSE = 39.134
------------------------------------------------------------------------------
229
| Robust
pctstck | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | 12.04773 5.994437 2.01 0.046 .2188715 23.87658
age | -1.625967 .8327895 -1.95 0.052 -3.269315 .0173813
educ | .7538685 1.172328 0.64 0.521 -1.559493 3.06723
female | 1.302856 7.148595 0.18 0.856 -12.80351 15.40922
black | 3.967391 8.974971 0.44 0.659 -13.74297 21.67775
married | 3.303436 8.369616 0.39 0.694 -13.21237 19.81924
finc25 | -18.18567 16.00485 -1.14 0.257 -49.76813 13.39679
finc35 | -3.925374 15.86275 -0.25 0.805 -35.22742 27.37668
finc50 | -8.128784 15.3762 -0.53 0.598 -38.47072 22.21315
finc75 | -17.57921 16.6797 -1.05 0.293 -50.49335 15.33493
finc100 | -6.74559 16.7482 -0.40 0.688 -39.7949 26.30372
finc101 | -28.34407 16.57814 -1.71 0.089 -61.05781 4.369671
wealth89 | -.0026918 .0114136 -0.24 0.814 -.0252142 .0198307
prftshr | 15.80791 8.107663 1.95 0.053 -.190984 31.80681
_cons | 134.1161 58.87288 2.28 0.024 17.9419 250.2902
------------------------------------------------------------------------------
b. With relatively few husband-wife pairs -- 23 in this application -- we
do not expect big differences in standard errors, and we do not see them:
. reg pctstck choice age educ female black married finc25-finc101 wealth89
prftshr, robust cluster(id)

F( 14, 170) = 2.12
Prob > F = 0.0128
R-squared = 0.0998
Number of clusters (id) = 171 Root MSE = 39.134
------------------------------------------------------------------------------
| Robust
pctstck | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | 12.04773 6.184085 1.95 0.053 -.1597615 24.25521
age | -1.625967 .8192942 -1.98 0.049 -3.243267 -.0086663
educ | .7538685 1.1803 0.64 0.524 -1.576064 3.083801
female | 1.302856 7.000538 0.19 0.853 -12.51632 15.12203
black | 3.967391 8.711611 0.46 0.649 -13.22948 21.16426
married | 3.303436 8.624168 0.38 0.702 -13.72082 20.32769
finc25 | -18.18567 16.82939 -1.08 0.281 -51.40716 15.03583
finc35 | -3.925374 16.17574 -0.24 0.809 -35.85656 28.00581
finc50 | -8.128784 15.91447 -0.51 0.610 -39.54421 23.28665
finc75 | -17.57921 17.2789 -1.02 0.310 -51.68804 16.52963
finc100 | -6.74559 17.24617 -0.39 0.696 -40.78983 27.29865
finc101 | -28.34407 17.10783 -1.66 0.099 -62.1152 5.427069
wealth89 | -.0026918 .0119309 -0.23 0.822 -.0262435 .02086
230
prftshr | 15.80791 8.356266 1.89 0.060 -.6874976 32.30332
_cons | 134.1161 58.1316 2.31 0.022 19.36333 248.8688
------------------------------------------------------------------------------
For later use, the predicted pctstck for the person described in the problem,
with choice = 0, is about 38.37. With choice, it is roughly 50.42.
c. The ordered probit estimates follow, including commands that provide
the predictions for pctstck with and without choice:
. oprobit pctstck choice age educ female black married finc25-finc101 wealth89
prftshr

Ordered probit estimates Number of obs = 194

LR chi2(14) = 20.77
Prob > chi2 = 0.1077
------------------------------------------------------------------------------
pctstck | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
choice | .371171 .1841121 2.02 0.044 .010318 .7320241
age | -.0500516 .0226063 -2.21 0.027 -.0943591 -.005744
educ | .0261382 .0352561 0.74 0.458 -.0429626 .0952389
female | .0455642 .206004 0.22 0.825 -.3581963 .4493246
black | .0933923 .2820403 0.33 0.741 -.4593965 .6461811
married | .0935981 .2332114 0.40 0.688 -.3634878 .550684
finc25 | -.5784299 .423162 -1.37 0.172 -1.407812 .2509524
finc35 | -.1346721 .4305242 -0.31 0.754 -.9784841 .7091399
finc50 | -.2620401 .4265936 -0.61 0.539 -1.098148 .5740681
finc75 | -.5662312 .4780035 -1.18 0.236 -1.503101 .3706385
finc100 | -.2278963 .4685942 -0.49 0.627 -1.146324 .6905316
finc101 | -.8641109 .5291111 -1.63 0.102 -1.90115 .1729279
wealth89 | -.0000956 .0003737 -0.26 0.798 -.0008279 .0006368
prftshr | .4817182 .2161233 2.23 0.026 .0581243 .905312
-------------+----------------------------------------------------------------
_cut1 | -3.087373 1.623765 (Ancillary parameters)
_cut2 | -2.053553 1.618611
------------------------------------------------------------------------------
. * The estimated cut points are -3.087 and -2.053.

. * Now compute index to obtain prediction for the person described.
231
. * First, without choice.
. di - .050*60 + .026*12 + .046 - .262 - .000096*150
-2.9184
. * Now, with choice:

. di -2.918 + .371
-2.547
. * Now, compute probabilities.

. * First, P(pctstck = 50):
. di normprob(-2.054 + 2.918) - normprob(-3.087 + 2.918)

.37330773
. * Now, P(pctstck = 100)
. di 1 - normprob(-2.054 + 2.918)
.19379395
. * Now estimate the expected value:
. di 50*.373 + 100*.194
38.05
. * With choice:
. di normprob(-2.054 + 2.547) - normprob(-3.087 + 2.547)

.39439519
. di 1 - normprob(-2.054 + 2.547)
.31100629
. di 50*.394 + 100*.311
50.8
. di 50.8 - 38.05
12.75
. * So, using the ordered probit, the effect of choice for this person is
. * about 12.8 percentage points more in stock, which is not far from the
. * 12.1 points obtained with the linear model.
d. We can compute an R-squared for the ordered probit model by using the
squared correlation between the predicted pctstcki and the actual. The
following Stata session does this, after using the "oprobit" command:
232
. predict p1 p2 p3
(option p assumed; predicted probabilities)
. sum p1 p2 p3

-------------+-----------------------------------------------------
p1 | 194 .331408 .1327901 .0685269 .8053644
p2 | 194 .3701685 .0321855 .1655734 .3947809
p3 | 194 .2984236 .1245914 .0290621 .6747374
. gen pctstcko = 50*p2 + 100*p3

. corr pctstck pctstcko

(obs=194)
| pctstck pctstcko
-------------+------------------
pctstck | 1.0000
pctstcko | 0.3119 1.0000
. di .321^2
.103041
The R-squared for the linear regression was about .100, so the R-squared is
only slightly higher for ordered probit. In fact, the correlation between the
fitted values for the linear regression and ordered probit is .998, so the
fitted values are very similar.
15.15. We should use an interval regression model, that is, ordered probit
with known cut points. We would be assuming that the underlying GPA is
normally distributed conditional on x, but we only observe interval coded
data. (Clearly a conditional normal distribution for the GPAs is at best an
approximation.) Along with the bj -- including an intercept -- we estimate

2
s . The estimated coefficients are interpreted as if we had done a linear
regression with actual GPAs.
233
15.16. a. P(yi = 1|xi,ri) = P(xiB + ui > ri|xi,ri) = P[ui/s > (ri - xiB)/s] =
1 - F[(ri - xiB)/s] = F[xi(B/s) - (1/s)ri].

^ ^
b. From part a, plim G = B/s and plim d = -1/s. So a consistent
^ ^ ^
estimator of s is -1/d and a consistent estimator of B is -G/d.
c. Just maximize the log-likelihood function with respect to B and s;
this would make it easy to obtain standard errors and other test statistics
involving B (and s). For each i,
li(B,s) = yilog{F[xi(B/s) - (1/s)ri]}
+ (1 - yi)log{1 - F[xi(B/s) - (1/s)ri]}.
d. As in part a,
P(yi = 1|xi,ri) = P(xiB + ui > ri|xi,ri) = P(ui > ri - xiB) = 1
- G(ri - xiB;D). Therefore,
li(B,D) = yilog[1 - G(ri - xiB;D)]
+ (1 - yi)log[G(ri - xiB;D)].
Note how this derivation allows for an asymmetric distribution of u.
e. Yes, because B has quantitative meaning in this particular binary
response application: bj measures the effect of xj on the average willingness
to pay. Our choice of distribution for u can certainly affect our estimation
of B. If we could observe wtpi for each i, we would use linear regression and
never have to specify a distribution for u.
15.17. a. We obtain the joint density by the product rule, since we have
independence conditional on (x,c):
f(y1,...,yG|x,c;Go) = f1(y1|x,c;Go)f2(y1|x,c;Go)WWWfG(yG|x,c;Go).
1 2 G
b. The density of (y1,...,yG) given x is obtained by integrating out with
234
respect to the distribution of c given x:
8& G *
g(y1,...,yG|x;Go) = i p fg(yg|x,c;Go) h(c|x;Do)dc,
g
7
-8 g=1
8
where c is a dummy argument of integration. Because c appears in each
D(yg|x,c), y1,...,yG are dependent without conditioning on c.
c. The log likelihood for each i is
log
# 8i& pG f (y |x ,c;Gg)*h(c|x ;D)dc$.
3 -87g=1 g ig i 8 i 4
As expected, this depends only on the observed data, (xi,yi1,...,yiG), and the
unknown parameters.
15.18. a. The probability is the same as under (15.67):
P(yit = 1|xi,ai) = F(j + xitB + xiX + ai), t = 1,2,...,T.

-----
The fact that ai given xi is heteroskedastic has no effect when we are
obtaining the distribution conditional on (xi,ai).
b. Let gt(yt|xi,ai;Q) = [F(j + xitB + xiX + ai)] t[1 - F(j + xitB + xiX +
----- y -----
1-yt
ai)] . Then, by the product and integration rules,
f(y1,...,yT|xi;Q) =
# 8i& pT g (y |x ,a;Q)*h(a|x ;D)da$,
3 -87t=1 t t i 8 i 4
where h(W|xi,D) is the Normal[0,saexp(xiL)] density.
2 -----
We get the log-
likelihood by plugging in the yit and taking the natural log.
15.19. a, b. Here is the Stata output for black men. I have balanced the
panel, so that only men in the sample from 1981 through 1987 appear.
. probit employ employ_1 if black

235
LR chi2(1) = 1091.27
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
employ | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
employ_1 | 1.389433 .0437182 31.78 0.000 1.303747 1.475119
_cons | -.5396127 .0281709 -19.15 0.000 -.5948268 -.4843987
------------------------------------------------------------------------------
. di normprob(-.540)
.29459852
. di normprob(-.540 + 1.389)
.80205935
. di .802 - .295
.507
The difference in employment probabilities this year, based on employment
status last year, is .507.
c. With year dummies, the story is very similar:
. probit employ employ_1 y83-y87 if black


LR chi2(6) = 1156.98
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
employ_1 | 1.321349 .0453568 29.13 0.000 1.232452 1.410247
y83 | .3427664 .0749844 4.57 0.000 .1957997 .4897331
y84 | .4586078 .0755742 6.07 0.000 .3104852 .6067304
y85 | .5200576 .0767271 6.78 0.000 .3696753 .6704399
y86 | .3936516 .0774703 5.08 0.000 .2418125 .5454907
y87 | .5292136 .0773031 6.85 0.000 .3777023 .6807249
_cons | -.8850412 .0556041 -15.92 0.000 -.9940233 -.7760591
236
------------------------------------------------------------------------------
. di normprob(-.885 + .529)
.36092028
. di normprob(-.885 + .529 + 1.321)

.83272759
The estimated state dependence in 1987 is about .472. Employment probabilties
are generally rising over this period.
d. Here is one way to estimate the unobserved effects model:
. gen employ81 = employ if y81

. replace employ81 = employ[_n-1] if y82

(1738 real changes made)





. xtprobit employ employ_1 employ81 y83-y87 if black, re
Fitting comparison model:

Fitting full model:
rho = 0.0 log likelihood = -2200.3214

237
Random-effects probit Number of obs = 4038

Group variable (i) : id Number of groups = 673
Random effects u_i ~ Gaussian Obs per group: min = 6

avg = 6.0
max = 6
Wald chi2(7) = 677.59

Log likelihood = -2176.3738 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
employ_1 | .8987858 .0677035 13.28 0.000 .7660893 1.031482
employ81 | .5662849 .088493 6.40 0.000 .3928418 .739728
y83 | .4339896 .0804062 5.40 0.000 .2763964 .5915828
y84 | .6563064 .0841192 7.80 0.000 .4914358 .821177
y85 | .7919761 .0887153 8.93 0.000 .6180972 .9658549
y86 | .6896298 .0901566 7.65 0.000 .5129262 .8663335
y87 | .8381973 .0910525 9.21 0.000 .6597376 1.016657
_cons | -1.0051 .0660937 -15.21 0.000 -1.134641 -.8755586
-------------+----------------------------------------------------------------
/lnsig2u | -1.178755 .1995222 -1.569811 -.7876984
-------------+----------------------------------------------------------------
sigma_u | .5546726 .0553347 .4561628 .6744557
rho | .2352762 .0358983 .1722434 .3126631
------------------------------------------------------------------------------
Likelihood ratio test of rho=0: chibar2(01) = 47.90 Prob >= chibar2 = 0.000
e. There is strong evidence of state dependence conditional on ci because
the coefficient on lagged employment is very significant with t statistic =
13.3. As yet, we do not know how the coefficient .899 into the estimated
state dependence. Note that employ81 is also very significant, showing that
2 2
ci and employi,81 are positively correlated. The estimate of sa is (.555) ,
^2
or sa ~ .308.
f. The average state dependence, where we average out the distribution of
ci, is gotten as follows:
238
. gen prbdif87 = normprob((-1.005 + .838 + .899 + .566*employ81)/sqrt(1 +
.555^2)) - normprob((-1.005 + .838 + .566*employ81)/sqrt(1 + .555^2)) if y87
& black
. sum prbdif87 if y87 & black

-------------+-----------------------------------------------------
prbdif87 | 673 .2831544 .025709 .2353894 .2969714
The estimated state dependence, averaged across the distribution of ci, is
.283. More precisely, this is our estimate of E[F(d87 + r + ci)] - E[F(d87 +
ci)], where the expectation is with respect to the distribution of ci. The
2 1/2
estimate is based on E{F[(j + d87 + r + xyi0)/(1 + sa) ]} - E{F[(j + d87 +
2 1/2
xyi0)/(1 + sa) ]} (by iterated expectations):
-1 N
N S {F[(j^ + d^87 + r^ + x^yi0)/(1 + ^s2a)1/2]
i=1
^ ^ ^ ^2 1/2
- F[(j + d87 + xyi0)/(1 + sa) ]};
see page 495 in the text. Interestingly, .283 is just over half if the
estimated state dependence if we ignore ci.
15.20. Since y1 = z1D1 + g(y2)A1 + u1, and we can write, just as before, u1 =
*
q1v2 + e1, where e1|z,v2 ~ Normal(0,1 - r1), we have

2
P(y1 = 1|z,y2,v2) = F{[z1D1 + g(y2)A1 + q1v2]/(1 - r1)

2 1/2
}.
Therefore, in Procedure 15.1, we will consistently estimate D1/(1 2 1/2

- r1) ,
A1/(1 2 1/2
- r1) , and q1/(1 - r1)
2 1/2
, where recall that A1 is now a vector of
parameters. The discussion for computing average partial effects is identical
to that on page 475.
It should also be clear that allowing for a 1 * G1 vector of endogenous

explanatory variables, y2, which appear as g1(y21), ..., gG (y2,G ), is not
1 1
difficult. The key restriction is that the vector of reduced form errors, v2,
would have to be jointly normally distributed (along with u1). But the
239
endogeneity is solved by including in the vector of reduced form residuals in
the probit. For details, see J.M. Wooldridge, "Unobserved Heterogeneity and
Estimation of Average Partial Effects," mimeo, Michigan State University
Department of Economics, 2002.
240

Chap12 15 Sol

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chap12 15 Sol

Caricato da

Copyright:

Formati disponibili

SOLUTIONS TO CHAPTER 12 PROBLEMS

12.1. Take the conditional expectation of equation (12.4) with respect to x,

and use E(u|x) = 0:

E{[y - m(x,Q)] |x} = E(u2|x) + 2[m(x,Qo) - m(x,Q)]E(u|x)

+ E{[m(x,Qo) - m(x,Q)] |x}

at Q = Qo for any x. Therefore, the parameters of a correctly specified

of x. Usually, there would be multiple solutions for a particular x -- for

m(x,Qo) - m(x,Q) to zero. Uniqueness of Qo as a minimizer holds only after

we integrate out x to obtain E{[y - m(x,Q)] }.

12.2. a. Since u = y - E(y|x), Var(y|x) = Var(u|x) = E(u |x) since E(u|x) =

nonlinear regression. The NLS estimators of a and G would then solve

The problem is that Qo is unknown. When we replace Qo with its NLS

we are solving the problem

This objective function has the form of a two-step M-estimator in Section

generally consistent for ao and Go (under weak regularity and identification

c. We now estimate Qo by solving

where a and ^G are from part b.

WNLS.3 can be applied.

log(ui) on 1, xi would be consistent for (ao +

e. If we have misspecified the variance function -- or, for example, we

use the approach in part d but v is not independent of x -- then we should

use a fully robust variance-covariance matrix. It looks just like (12.52)

c. Since d^E(y|z)/dz2 = exp[^q1 + ^q2log(z1) + ^q3z2 + ^q4z22]W(^q3 + 2^q4z2),

obtain the 1 * K2 residuals, r~i. Then we compute the statistic as in

m(xi,Q)] /h(xi,G). Q Dqq(wi,Q;G) =

-Dqm(xi,Q)[yi - m(xi,Q)]/h(xi,G) = -Dqm(xi,Q)ui(Q)/h(xi,G). Taking the

transpose gives us the score for any Q and any G.

the value of G plays no role.

c. First, the Jacobian of si(Qo;G) with respect to G is Dgsi(Qo;G) =

not WNLS.3 holds).

d. We would use the analog of equation (12.52):

By the chain rule, Dbm(x,Q) = g[xB + d1(xB)2 + d2(xB)3]W[x + 2d1(xB)2 +

the NLS estimator with d1 = d2 = 0 imposed. Then Dbm(xi,~Q) = g(xiB

obtained as NRu from the regression ui on gixi, giW(xiB) , giW(xiB) , where

12.6. a. The pooled NLS estimator of Qo solves

b. As in the hint, I show that Bo = s2oAo. First, write si(Q) =

Chapter 7. Let r < t for concreteness. Then E[sit(Qo)sir(Qo)’|xit,xir,uir]

E(uit|xit,ui,t-1,,xi,t-1,...,) = 0 and r < t. Now apply the LIE to conclude

E[E(uit|xit)Dqm(xit,Qo)’Dqm(xit,Qo)] (by the law of iterated expectations) =

s2oE[Dqm(xit,Qo)’Dqm(xit,Qo)] because E(u2it|xit) = s2o.

correction does not affect consistency.

Then, by standard arguments, a consistent estimator of )o is

goes. First, let G be the vector of distinct elements of ) -- the nuisance

parameters in the context of two-step M-estimation. Then, the score for

s(wi,Q;G) = -Dqm(xi,Q)’) ui(Q)

element of s(wi,Q;G) is a linear combination of ui(Q). So Dgsj(wi,Qo;G) is a

unconditional expectation is zero, too. This shows that we do not have to

the hint directly, which has the same consequence.

Next, we derive Bo _ E[si(Qo;Go)si(Qo;Go)’]:

of si(Q;G) with respect to Q can be written

Hi(Q;G) = Dqm(xi,Q)’)-1Dqm(xi,Q) + [IP t ui(Q)’]F(xi,Q;G),

Q. The key is that F(xi,Q;G) depends on xi, not on yi. So,

Now iterated expectations gives Ao = E[Dqmi(Qo)’)o Dqmi(Qo)].

parameters, and divide the result by N to get Avar(Q):

updated after the nonlinear SUR estimates have been obtained.

d. First, note that Dqmi(Qo) is a block-diagonal matrix with blocks

cross-equation restrictions imposed in the nonlinear SUR estimation.) If )o

s2og[E(Dqgmoig’Dqgmoig)]-1, g = 1,...,G. (Note also that the nonlinear SUR

estimators are asymptotically uncorrelated across equations.) These

least squares on each equation; see p. 360.

e. I cannot see a nonlinear analogue of Theorem 7.7. The first hint

with Dqm(xi,Qo). While this G * P matrix has a block-diagonal form, as

regressors appear in each equation. In the linear case, Dqgmg(xi,Qog) = xi