Chap5 2017 PDF

Contents
5 Maximum Likelihood 3
5.1 Examples and definitions . . . . . . . . . . . . . . . . . . . . . . . . 3
5.1.1 Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5.1.2 Probability Model . . . . . . . . . . . . . . . . . . . . . . . . 3
5.1.3 The Likelihood Function . . . . . . . . . . . . . . . . . . . . 4
5.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 5
5.2.1 The Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2.3 The Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.4 The Information Matrix . . . . . . . . . . . . . . . . . . . . 10
5.2.5 The Fréchet-Darmois-Cramer-Rao Lower Bound . . . . . . . 12
5.3 Asymptotic Properties of MLE . . . . . . . . . . . . . . . . . . . . 12
5.3.1 Consistency and Asymptotic Normality . . . . . . . . . . . . 13
5.3.2 Asymptotic Efficiency . . . . . . . . . . . . . . . . . . . . . 14
5.3.3 Variance Estimation . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Binary Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . 16
5.4.1 Linear Probability Model . . . . . . . . . . . . . . . . . . . . 16
5.4.2 Probit and Logit . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.3 Interpretation of Results . . . . . . . . . . . . . . . . . . . . 19
5.5 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5.1 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5.2 Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.3 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . 23
5.5.4 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.6 Confidence Region . . . . . . . . . . . . . . . . . . . . . . . 25
A Further and Deeper Details 26

A.1 Proof of the Strict Expected Log-Likelihood Inequality, Lemma 5.2.1 26
A.2 Proof of Lemma 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.3 Proof of Lemma 5.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 27
P. Lavergne — F. Poinas — S. Sinha Econometrics M1
A.4 Non invariance of the Wald test . . . . . . . . . . . . . . . . . . . . 28
2
Chapter 5
Maximum Likelihood
5.1 Examples and definitions

5.1.1 Non-normality
Some variables are not “normal”
,→ Income, wealth: positive, not symmetric and usually skewed.
,→ Return from equities: more extreme values.
Student or Laplace distribution (see figure 5.1)

,→ Discrete variables: binary (working/not working), multinomial (transporta-
tion mode), count data (number of visits to the hospital)
Bernouilli distribution (for a binary variable), multinomial distribution (for
a multinomial variable)
Maximum likelihood estimation can be used in such cases.
5.1.2 Probability Model

{y, x} are random variables. We know the conditional distribution, either the cu-
mulative distribution function (c.d.f.) F (y|x; θ) or the probability density function
(p.d.f.) f (y|x; θ), but the true values of parameters θ0 are unknown.
Note: In case we have a discrete variable y, we use a “Probability Mass Function”
(P.M.F.) instead of “Probability Density Function” (P.D.F.), see later.
,→ Normal distribution:
(y − x0 β)2

2
1
f y|x; β, σ =√ exp −
2πσ 2 2σ 2
Figure 5.1: Student and Laplace Distributions
,→ Laplace distribution:
√ |y − x0 β|

2
1
f y|x; β, σ =√ exp − 2
2σ 2 σ
5.1.3 The Likelihood Function

Let the sample {(yi , xi ) : i = 1, . . . , n} be independent and identically distributed
(i.i.d.).
Likelihood function: l (θ; y|x) = f (y|x; θ).
It measures how likely a parameter value θ is when we observe value y, given x.
,→ The only difference with the conditional density is that the likelihood is
considered as a function of parameters θ.
Log-likelihood: L (θ; y|x) = log f (y|x; θ).
Log-likelihood of the sample:
L (θ; y|X) ≡ L (θ; y1 , . . . , yn |x1 , . . . , xn ) = log f (y1 , . . . , yn |x1 , . . . , xn ; θ)

Yn
= log f (yi |xi ; θ) since i.i.d.
i=1
n
X
= log f (yi |xi ; θ) .
i=1
4
It measures how likely a parameter value θ is when we observe the sample {y1 , . . . , yn },
given {x1 , . . . , xn }.
Some examples:
,→ Normal data generating process: y|x ∼ N (x0 β, σ 2 )
– Likelihood of individual i:
l β, σ 2 ; yi |xi = f (yi |xi ; β, σ 2 )

!
1 (yi − x0i β)2
= √ exp −
2πσ 2 2σ 2
– Log-likelihood of individual i:
L β, σ 2 ; yi |xi = log l β, σ 2 ; yi |xi

1 2
(yi − x0i β)2
= − log 2πσ −
2 2σ 2
– Log-likelihood of the sample:
n
X
2
L β, σ 2 ; yi |xi

L β, σ ; y|X = because of independence
i=1
n
n 2
X (yi − x0i β)2
= − log 2πσ −
2 i=1
2σ 2
n (y − Xβ)0 (y − Xβ)
= − log 2πσ 2 −
2 2σ 2
– Note: the density may not be defined for some values of the parameters.
Here, we must have σ > 0.
,→ Laplace data generating process: y|x ∼ L(x0 β, σ 2 )

√ Pn
n 2 i=1 |yi − x0i β|
L β, σ 2 ; y|X = − log 2σ 2 −

2 σ
5.2 Maximum Likelihood Estimation

5.2.1 The Estimator
We denote by E n the empirical expectation (or sample average), i.e.
n
X
−1
E n L (θ; y|x) = n L (θ; yi |xi ) = n−1 L (θ; y1 , . . . , yn |x1 , . . . , xn ) .
i=1
5
Maximum likelihood estimator:
θb = arg max E n L (θ; y|x)

θ∈Θ
where Θ is the parameter space.
Normal data generating process

0
Vector of unknown parameters: θ = (β 0 , σ 2 )
1 E n (y − x0 β)2
E n L β, σ 2 ; y|x 2πσ 2 −

= − log
2 2σ 2
−1
P n 0 2
1 2 n i=1 (yi − xi β)

= − log 2πσ −
2 2σ 2
−1
n ky − Xβk2
1
= − log 2πσ 2 −
2 2σ 2
1 n−1 (y − Xβ)0 (y − Xβ)
= − log 2πσ 2 − .
2 2σ 2
First order derivatives:
∂ n−1 X 0 (y − Xβ)
E n L β, σ 2 ; y|x =

∂β σ2
∂ 2
1 n−1 ky − Xβk2
E n L β, σ ; y|x = − + .
∂σ 2 2σ 2 2σ 4
This yields the First-Order Conditions (FOC)
∂
2
n−1 X 0 (y − X β)b
E n L β, σ
b b ; y|x = =0
∂β b2
σ
∂
2
1 n−1 ky − X βk
b 2
E n L β,
b σ ; y|x = − + = 0.
∂σ 2 σ2 σ4
b
2b 2b
Therefore,
,→ βbM L = (X 0 X)−1 X 0 y
2 −1 b 2 n−K 2
,→ σ
bM L = n ky − X βk = n
s
6
Thus, βbM L = βbOLS .

Second order derivatives (Hessian matrix):
" 0 0
#
∂2 − XnσX 2 − X (y−Xβ)
nσ 4
E n L (θ; y|x) = [X 0 (y−Xβ)]0 ky−Xβk2 .
∂θ∂θ 0 − nσ 4
1
2σ 4
− nσ6
Second-Order Conditions (SOC)
∂2 X0X
− 0

nb
σ 2
E n L θbM L ; y|x =
∂θ∂θ 0 0 − 2bσ1 4
is negative definite if X has full rank. So θbM L is a global and thus local maximum.
Laplace data generating process
2
1 2
√ E n |y − x0 β|2
E n L β, σ ; y|x = − log 2σ − 2
2 Pσn
1 √ n −1 0
i=1 |yi − xi β|
= − log 2σ 2 − 2

.
2 σ
Then βbM L = arg min n−1 ni=1 |yi − x0i β|. This is the same as the Minimum
P
Absolute Deviations (MAD) estimator.
Exercice 5.1. Consider the (unconditional) exponential model where
1 y
f (y; θ) = exp − y>0 E (y) = θ Var (y) = θ2 .
θ θ
Show that the MLE is the sample average.
5.2.2 Identification
Definition 1. θ0 is globally identified in Θ if no other value of θ gives the same
model, i.e. f (y|x; θ) 6= f (y|x; θ0 ) for all θ ∈ Θ different from θ0 .
Assumption 2 (Identification). Every value of the vector of parameters in Θ

is identified, i.e.
Pr [f (y|x; θ1 ) 6= f (y|x; θ2 )] > 0 ∀θ1 , θ2 θ1 6= θ2 .
7
Normal data generating process: y|x ∼ N (x0 β, σ 2 )

For a given variance, σ 2 , the densities are equal for β0 6= β1 if and only if (iff)
x0 (β0 − β1 ) = 0, i.e. if there is collinearity between the components of x.
Exercice 5.2. Consider the (unconditional) exponential model. Show that the
unconditional version of Assumption 2 holds, i.e.
Pr [f (y; θ0 ) 6= f (y; θ1 )] > 0 ∀θ0 , θ1 θ0 6= θ1 .
Assumption 3 (Dominance). E [supθ∈Θ |L (θ; y|x) |] < ∞.
Lemma 5.2.1 (Strict Expected Log-Likelihood Inequality). Under Assumptions
Dominance and Identification, if θ0 is the true value of the parameter, then
E [L (θ; y|x)] < E [L (θ0 ; y|x)] ∀θ 6= θ0 .
NB: When we write E , it means expectation with respect to the true distribution
f (y|x; θ0 ).
Proof: see A.1.
Remark: Non-conditional version is also true: If f (y; θ) is the probability model,

θ0 is the true value, then
E L (θ; y) < E L (θ0 ; y) ∀θ 6= θ0
under unconditional versions of Assumptions Dominance and Identification.
Here E means expectation with respect to the distribution f (y; θ0 ).
Consequences:
,→ θ0 = arg maxΘ E [L (θ; y|x)]. The Maximum Likelihood Estimator is defined
similarly using the sample-analog:
θb = arg max E n L (θ; y|x) .
Θ
,→ If the log-likelihood function is differentiable, and the true parameter, θ0 is

in the interior of Θ,
∂ ∂2
E [L (θ0 ; y|x)] = 0 and E [L (θ0 ; y|x)] negative definite .
∂θ ∂θ∂θ 0
Similarly, for the Maximum Likelihood Estimator, θ̂, we should have

∂ h i ∂2 h i
E n L θ̂; y|x = 0 and E n L θ̂; y|x negative definite .
∂θ ∂θ∂θ 0
8

,→ The true model is:
y = x 0 β0 + ε with ε|x ∼ N (0, σ02 )
,→ The expected log-likelihood can be written, for any value of β and σ as:
1 E [(y − x0 β)2 |x]
E L β, σ 2 ; y|x |x = − log 2πσ 2 −

2" 2σ 2 #
1 2
σ02 + (x0 (β0 − β))2
= − log 2πσ + .
2 σ2
– Steps to calculate the expected value:

E [(y − x0 β)2 |x] = E [(y − x0 β0 + x0 β0 − x0 β)2 |x]
= E [(ε + x0 β0 − x0 β)2 |x]
= E [(ε)2 |x] + E [(x0 β0 − x0 β)2 |x] + 2E [ε(x0 β0 − x0 β)|x]
= Var (ε|x) + (x0 β0 − x0 β)2 + 2E [ε|x](x0 β0 − x0 β)
= σ02 + (x0 β0 − x0 β)2 because E [ε|x] = 0
The expected log-likelihood is maximum when β = β0 and σ 2 = σ02 .

Exercice 5.3. Consider the (unconditional) exponential model and show that the
expected log-likelihood inequality holds.
5.2.3 The Score

To simplify, we will use the notation L(θ) for L (θ; y|x) (or L (θ; y) for uncondi-
tional model) .
Assumption 4 (Differentiability). The density f (y|x; θ) is twice continu-
ously differentiable in θ for all θ ∈ Θ. The support of the density does not depend
on the value of θ, and differentiation and integration are interchangeable.
In particular, this implies
∂ ∂ ∂2 ∂2
E L (θ) = E L (θ) and E L (θ) = E L (θ) .
∂θ ∂θ ∂θ∂θ 0 ∂θ∂θ 0
By definition of MLE, if it is an interior solution, the FOC are
∂
E n L θb = 0 .
∂θ
Those equations are called likelihood equations.
9
Score function
The score function is defined as
∂
Lθ (θ) = L (θ) .
∂θ
Lemma 5.2.2. Under Assumption Differentiability, E [Lθ (θ0 ) |x] = 0.
(Non-conditional version also holds.)
Proof: see A.2.
∂ (xx0 ) (β0 − β)
E Lβ β, σ 2 ; y|x |x = L β, σ 2 ; y|x |x =

E
∂β σ2
∂ 1 σ02 + [x0 β0 − x0 β]2
E Lσ2 β, σ 2 ; y|x |x = 2

E L β, σ ; y|x |x = − 2 + .
∂σ 2 2σ 2σ 4
Equal to zero for β = β0 and σ 2 = σ02 .
Exercice 5.4. Consider the (unconditional) exponential model. Derive the score
function. Check that the lemma holds in this case.
5.2.4 The Information Matrix

We have seen that, if the log-likelihood function is differentiable, we should have
∂2
E [L (θ0 ; y|x) |x] = E [Lθθ (θ0 ; y|x) |x] negative definite .
∂θ∂θ 0
This matrix is related to the information matrix.
Conditional Information Matrix

I(θ0 |x) = Var [Lθ (θ0 )|x]
Since E [Lθ (θ0 )|x] = 0 by the score identity, we get:
I(θ0 |x) = E [Lθ (θ0 )L0θ (θ0 )|x] .
10
Unconditional Information Matrix

I(θ0 ) = Var [Lθ (θ0 )] = E [Lθ (θ0 )L0θ (θ0 )] .
Note: The unconditional variance is just the expectation of the conditional vari-
ance, as
Var [Lθ (θ0 )] = E [Var [Lθ (θ0 )|x]] + Var [E [Lθ (θ0 )|x]] by the l.i.e.
= E [Var [Lθ (θ0 )|x]] because E [Lθ (θ0 )|x] = 0 .
Lemma 5.2.3 (Information Matrix Equality). Under Assumption Differentia-

bility and Var Lθ (θ0 ) < ∞,
I(θ0 |x) = Var [Lθ (θ0 )|x] = −E [Lθθ (θ0 ) |x] .
This matrix is positive semi-definite.
Unconditional version Var [Lθ (θ0 )] = −E [Lθθ (θ0 )].

Proof: see A.3.
More generally
One can define the conditional information matrix (of the sample) as
I (θ|X) = nVar (E n Lθ (θ)|X)

n
!
1X
= nVar Lθ (θ)|X
n i=1
n
1X
= Var (Lθ (θ)|xi ) because iid
n i=1
n
1X
= I (θ|xi ) .
n i=1
Similarly, we have
n
1X
I (θ|X) = − E (Lθθ (θ)|xi ) .
n i=1
11

We have (check as an exercise!)
X0X
" #
nσ02
0
I(θ0 |X) = 1 .
0 2σ04
Exercice 5.5. Consider the (unconditional) exponential model. Derive the in-
formation matrix.
Assumption 5 (Non-singular information). The information matrix I(θ|x)

exists and is non-singular for all θ.
Since by definition, the information matrix is positive semi-definite, this assump-

tion implies that it is positive definite.
5.2.5 The Fréchet-Darmois-Cramer-Rao Lower Bound

Under some technical assumptions, every unbiased estimator θ̃ is such that
0
nE θ̃ − θ0 θ̃ − θ0 − I −1 (θ0 ) = nVar θ̃ − I −1 (θ0 )
is positive semi-definite.
Conditional version is also true and we can show that the Gauss-Markov Theorem
is a consequence of this latter result.

X 0 X −1

2
I (θ0 |X) = σ0
−1 n
0 .
0 2σ04
Therefore, the OLS/ML estimator of β0 attains the Cramer-Rao lower bound.
5.3 Asymptotic Properties of MLE

MLE is not generally unbiased, so we cannot apply the FDCR lower bound directly.
,→ We don’t always have an analytic formula for MLE, so small sample proper-
ties difficult to establish.
,→ MLE is not necessarily unbiased, but it has nice large sample properties.
12
There is however one important small sample property of MLE: it is invariant.

If we transform θ into γ = h(θ) where h(·) is one-to-one, then
γ
b = arg max E n L(γ)
γ
is such that γ
b = h(θ).
b
5.3.1 Consistency and Asymptotic Normality

Assumption 6 (Iid). (yi , x1i , . . . , xKi ), i = 1, . . . , n are independent and identi-
cally distributed
Assumption 7 (Compactness). The parameter space Θ is compact.
Theorem 5.3.1. Under Assumptions Iid, Dominance, Identification, Com-
pactness,
p
θbM L −→ θ0 .
Under the supplementary Assumptions Differentiability, Non-singular in-
formation, if θ0 belongs to the interior of Θ and E supΘ |Lθθ (θ)| < ∞,
√
d
n θbM L − θ0 −→ N 0, I −1 (θ0 ) .

Exercice 5.6. Consider the (unconditional) exponential model, where θbM L = ȳ

as shown previously. Show that the estimator is consistent and give its asymptotic
distribution. Check that your result agrees with the general above result.
Asymptotic Analysis
Convergence
For convergence, the intuition is that
θbM L = arg max E n L(θ) should be close to θ 0 = arg max E L(θ)
with probability approaching 1 when n increases, because of LLN (we actually

need a “uniform” LLN for this argument to be correct).
Asymptotic Normality
Now, if θ0 is in the interior of Θ, then θbM L should also be in the interior of Θ
with probability approaching 1 when n increases.
Under Differentiability, a Taylor expansion of FOC gives
h i
b = 0 = E n [Lθ (θ 0 )] + E n Lθθ (θ̄) θb − θ0 ,
E n Lθ (θ)
13
where θ̄ is such that kθ̄ − θ0 k ≤ kθb − θ0k.

Hence, solving the previous equation for θb − θ0 , we get:
√ −1 √
n θ − θ0 = −E n Lθθ (θ̄)
b nE n [Lθ (θ 0 )] .
Now
√
1. nE n [Lθ (θ 0 )] is a sample average, so it is asymptotically normal.
n
√ √ 1X
nE n [Lθ (θ 0 )] = n Lθ (θ 0 ; yi |xi )
n i=1
with E [Lθ (θ0 ; y|x)] = 0 and Var [Lθ (θ 0 , y|x)] = I(θ 0 ) (here we consider
the unconditional variance).
Hence the CLT yields
√ d
nE n [Lθ (θ 0 )] −→ N (0, I(θ 0 )) .

2. −E n Lθθ (θ̄) converges to the information matrix.
p p
Since θ̄ −→ θ0 , the LLN yields −E n Lθθ (θ̄) −→ I(θ0 ).
3. By Slutsky’s theorem,
√
d
n θb − θ0 −→ N 0, I −1 (θ0 ) .

5.3.2 Asymptotic Efficiency

MLE is not generally unbiased, so we cannot apply the FDCR lower bound directly.
Definition 8. ,→ An estimator is CAN if it is Consistent and Asymptotically

Normal.
,→ An estimator is CUAN if it is Consistent and Uniformly Asymptotically

Normal over compact sets of Θ.
,→ An estimator is asymptotically efficient if it has the “smallest” (in matrix

sense) asymptotic variance among CUAN estimators.
Under the assumptions of Theorem 5.3.1, MLE is asymptotically efficient.
14
5.3.3 Variance Estimation

Useful for tests and confidence regions. Three consistent ways of estimating
I(θ0 ) = Var (Lθ (θ0 ; y|x)):
1. I(θ)
b if we know the exact form of the information matrix I(θ).
In the case where y|x ∼ N (x0 β, σ 2 ), since
" 0 #
X X
nσ02
0
I(θ0 |X) = 1 ,
0 2σ04
we can estimate it by
X0X

σ2
nb
0
I(θ|X)
b = 1
0 σ4
2b
b2 = n1 ky − Xβk2 . Alternatively, we can use s2 instead of σ

where σ b2 .
2. Empirical variance of the score
h i
Var n Lθ θ;
b y|x = E n Lθ (θ)L b0
b θ (θ)
" n 0 #
X ∂ ∂
= n−1 log f (yi |xi ; θ)
b log f (yi |xi ; θ)
b .
i=1
∂θ ∂θ
This is the outer-product of gradient estimator.

In the case where y|x ∼ N (x0 β, σ 2 ), this is
 3 
1 0b 1 0b
E x y−xβ 6 E nx y−xβ
 σb4 n 2b
σ 4 .
 
1 0b 4
· 4bσ 8 E n y − x β − σ
b
3. Minus the empirical hessian

n
h i
−1
X ∂2
E n −Lθθ θ; y|x = −n
b log f (yi |xi ; θ)
b
i=1
∂θ∂θ 0
In the case where y|x ∼ N (x0 β, σ 2 ), this yields the same estimator as the
first one.
15
The three estimators are generally different. Careful, as the third one is not
necessarily positive definite.
Exercice 5.7. Check that in the exponential model where
1 y
f (y; θ) = exp − y>0 Ey = θ Var y = θ2 ,
θ θ
the three estimators are respectively
1
I θb = 2
ȳ
n
1 X (yi − ȳ)2
Var n Lθ (θ) =
b 6= I θb
n i=1 ȳ 4
1 ȳ 1
E n −Lθθ (θ)b = − 2 +2 3 = 2 .
ȳ ȳ ȳ
5.4 Binary Dependent Variable

Limited dependent variable: range of possible values is limited, e.g.
,→ Discrete: includes binary, multinomial
,→ Count data, ordered multinomial
Bernoulli model
If ỹ is Bernoulli with parameter p,
Pr (ỹ = y; p) = py (1 − p)(1−y) .
The likelihood is l(p; y) = Pr (ỹ = y; p) and the log-likelihood is
L(p; y) = y log p + (1 − y) log (1 − p) .
Exercice 5.8. Check that pbM L = ȳ. Derive the score function and the informa-
tion matrix.
5.4.1 Linear Probability Model

For y binary, E (y|x) = Pr (y = 1|x). The linear regression
y = x0 β + ε E (ε|x) = 0
is called the linear probability model. Indeed, since Pr (y = 1|x) = E (y|x) it
implies that it is equal to x0 β. But
16
,→ ε is not normally distributed, so y is not conditionnally normal
,→ ε cannot be homoscedastic
As y = y 2 ,
Var (ε|x) = Var (y|x) = E (y 2 |x)−E 2 (y|x) = E (y|x) [1 − E (y|x)] = x0 β(1−x0 β)
and this varies with x, unless E (y|x) is constant.
,→ Fitted values can be outside (0, 1) (see Figure 5.2):
c = 1|x) = x0 βb
Pr(y
5.4.2 Probit and Logit

We specify
E (y|x) = Pr (y = 1|x) = F (x0 β)
where F (·) is a cdf.
Motivation: y = 1 if a latent variable y ∗ = x0 β − u ≥ 0. E.g. one works if the
differential / latent utility y ∗ is positive. Then
Pr (y = 1|x) = Pr (y ∗ ≥ 0|x) = Pr (u ≤ x0 β|x) = F (x0 β)

Pr (y = 0|x) = Pr (y ∗ < 0|x) = Pr (u > x0 β|x) = 1 − F (x0 β)
where F (·): cdf of u (independent of x).
Probit
u ∼ N (0, 1) → F (·) = Φ(·).
σu2 = 1 is a normalization, because if σ 2 unrestricted
0
0 xβ
F (x β) = Φ
σ
and only β/σ is identified.
Logit
1
u is Logistic, F (t) = Λ(t) = 1+exp(−t)
, and Var (u) = π 2 /3 ≈ 3.29.
17
Figure 5.2: Linear Probability and Probit Models: Fitted Values
18
Formulas
Note that Var (y|x) = F (x0 β) (1 − F (x0 β)).
,→ Log-likelihood:
L(β; y|x) = y log F (x0 β) + (1 − y) log (1 − F (x0 β))
,→ Sample average log-likelihood:

n
X
−1
E n L(β; y|x) = n yi log F (x0i β) + (1 − yi ) log (1 − F (x0i β))
i=1
,→ Score for the sample:

n
−1
X xi f (x0i β) xi f (x0i β)
E n Lβ (β; y|x) = n yi − (1 − y i ) where f : pdf of u
i=1
F (x0i β) 1 − F (x0i β)
n
X f (x0i β)
= n−1 xi 0 0
(yi − F (x0i β))
i=1
F (xi β) (1 − F (xi β))
,→ Information matrix for the sample:

n
−1
X f 2 (x0i β)
I(β|X) = n xi x0i .
i=1
F (x0i β) (1 − F (x0i β))
Exercice 5.9. Find the score function and check that its conditional expectation
is zero. Find the conditional information and the unconditional version. Check
that the information matrix equality holds.
5.4.3 Interpretation of Results

,→ There is no analytic formula for the estimator, MLE should be determined
numerically.
,→ Since F (·) is strictly increasing (and f (·) always positive), βj > 0 implies
a positive relation between xj and y (respectively βj < 0 means negative
relation).
,→ A test of H0 : βj = 0 is a test of significance of the effect of xj on y.
19
,→ In a linear regression, marginal effects are constant. In Probit and Logit

models,
∂ ∂
E (y|x) = F (x0 β) = βj f (x0 β) .
∂xj ∂xj
Nonlinear models yield varying marginal effects.
,→ In practice, one computes predicted probabilities and marginal effects for

particular values of the explanatory variables (e.g. for mean or median val-
ues)
For binary explanatory variable xk , the effect is determined as the difference
between the expected probability for xk = 1 and xk = 0
c = 1|xk = 1, . . .) − Pr(y
Pr(y c = 1|xk = 0, . . .)
when other explanatory variables are fixed at some values (e.g. median
values).
,→ The ratios of marginal effects are constant

∂
∂xj
E (y|x) βj f (x0 β) βj
∂
= 0
= .
∂xl
E (y|x) βl f (x β) βl
So we can compare coefficients.
,→ We cannot directly compare Probit and Logit coefficients, in particular be-

cause the variance normalization is different. But we can compare parameters
ratios and predicted probabilities. In practice, prediction are often similar,
because the two cdf look alike.
5.5 Tests
We will assume that the null hypothesis of interest is H0 : r(θ0 ) = 0 for some
function r(·) from Rp into Rq , that is there are q restrictions on the p parameters
in θ0 . We will look at the trinity of tests.
5.5.1 Wald Test

Principle (already known): Is r(θ)
b close to 0? The Wald test statistic is
−1
b 0 Var
W = r(θ) d r(θ)
b r(θ)
b .
20
Assume r(·) is differentiable and

∂
R (θ) = r(θ)
∂θ 0
is such that R (θ0 ) has full rank, which means that there are no redundant con-
straints.
Then the delta method yields
b ≈ n−1 R (θ0 ) I −1 (θ0 ) R (θ0 )0 .
Var r(θ)
We use
b = n−1 R(θ)
d r(θ)
Var b Ib−1 R(θ)
b 0,
Then −1
0 −1 0
W = n r(θ) R(θ)I R(θ)
b b b b r(θ)
b ,
where Ib is a consistent estimator of I (θ0 ), see Section 5.3.3.

We usually use an estimator that relies on the MLE θ, b so that the test statistic
requires only to compute the unconstrained estimator.
d
Under H0 , W −→ χ2q , and the rejection region is W > χ2q,1−α .
Linear restrictions when data generating process is normal

We have seen that for testing H0 : Rβ = c
0 −1
W = Rβb − c d Rβb − c
Var Rβb − c
0
−1
Rβb − c (R(X 0 X)−1 R0 ) Rβb − c RSSR − RSS
= 2
= .
σ
b RSS/n
,→ In linear model, we used s2 = RSS/(n − K) to estimate σ 2 , so that we had
(RSSR − RSS)
W = qF =
RSS/(n − K)
,→ Here, we have used the MLE σ b2 = RSS/n, so a slight difference in the

formula. This should not matter much in practice if n is large.
21
5.5.2 Score Test

The constrained (or restricted) estimator can be obtained as
θbR = arg max E n L(θ)

θ∈ΘR
where ΘR is the subspace of parameters satisfying the constraints.

The Score test statistic, due to Rao, evaluates whether the full set of likelihood
equations are close to 0 for the restricted estimator, that is
0
−1
S = n E n Lθ (θR ) I
b b E n Lθ (θR ) .
b
The test is also known as the Lagrange Multiplier test, because S can be rewritten
as a function of the Lagrange multipliers of the constrained problem. The test
statistic is then also labeled LM .
We usually use for Ib a consistent estimator that depends only on the restricted
estimator θbR . This estimator will then be consistent under H0 (but not necessarily
under H1 ).
One can show that under H0
p d
S − W −→ 0 ⇒ S −→ χ2q .
The rejection region is S > χ2q,1−α .

Under restrictions H0 : Rβ = c,
n
X xi (yi − x0 βbR )
E n Lβ (θbR ) = n−1 i
i=1
bR2
σ
1 0
0

−1 0 −1 b−c

= R R(X X) R R β
σR2
nb
E n Lσ2 (θbR ) = 0 .
Then
0
−1
Rβb − c (R(X 0 X)−1 R0 ) Rβb − c RSSR − RSS
S= =n
bR2
σ RSSR
where σbR2 = RSSR /n is MLE for σ 2 in the restricted model.
Note that in practice some of the components of the average score E n Lβ (θbR ) will
be zero. For instance, if we impose β2 = 0, then
E n Lβ1 (βb1R ) = 0 .
22
5.5.3 Likelihood Ratio Test

Evaluates the difference between constrained and unconstrained maximum log-
likelihoods, so that
!
n
h i
b − E n L(θbR ) = 2 log Π i=1 f (y i |x i , θ)
b
LR = 2n E n L(θ) .
Πni=1 f (yi |xi , θbR )
We need to deal with the two constrained and unconstrained problems, but we
don’t need to choose how to evaluate the information I (θ0 ).
One can show that under H0
p d
LR − W −→ 0 ⇒ LR−→ χ2q .
The rejection region is LR > χ2q,1−α .

Under restrictions H0 : Rβ = c,
1 n−1 ky − X βk b 2 1 1
En L θb = − log 2πb σ2 − = − log 2πb
σ 2
−
2 2bσ2 2 2
−1 2
1 n ky − X βR k
b 1 1
σR2 − σR2 −

En L θbR = − log 2πb 2
= − log 2πb
2 2b
σR 2 2
2
σ RSSR
LR = n log R2 = n log
b
.
σ
b RSS
The three asymptotic tests use the same rejection region, but not the
same statistics. So their outcome may not coincide.
Exercice 5.10. In the exponential model, consider testing the null hypothesis
that θ = θ0 against θ 6= θ0 . Show that
(ȳ − θ0 )2
W = n
ȳ 2
(ȳ − θ0 )2
S = n
θ2
0
θ0 (ȳ − θ0 )
LR = 2n log + .
ȳ θ0
What are the decision rules for each test?
23
5.5.4 Invariance
There may be different ways of writing the restrictions, e.g. exp(β1 ) − 1 = 0 ⇔
β1 = 0. One would like to obtain the same decision in any case: this property is
called invariance. But this is not always the case. It is true for the LR test, not
true for the Wald test, and true in some cases for the Score test.
The Wald test is not invariant

See Exercice 3.10 and Appendix A.4.
The LR test is invariant
r
For two equivalent restrictions 1 (θ) = 0 and r2 (θ) = 0, the value of the maximum
average log-likelihood E n L θR is also the same.
b
This is related to the invariance of MLE.
The Score test may be invariant

For two equivalent restrictions r1 (θ) = 0 and r2 (θ) = 0, the constrained MLE θbR
does not change. Then En Lθ (θbR ) does not change either.
But the invariance of the Score test depends on how we estimate the information
matrix.
,→ If we use the outer-product of gradients form, we have invariance.
,→ If we use the empirical average of the second derivatives, we generally loose

invariance.
5.5.5 Consistency
The Wald test is consistent
To see it, assume r(θ0 ) 6= 0, then
p
b −→ r(θ0 ) 6= 0
,→ r(θ)
p
b Ib−1 R(θ)
,→ R(θ) b 0 −→ R(θ0 )I −1 (θ0 )R(θ0 )0 which is positive definite.
p
,→ Hence W −→ ∞.

,→ Hence Pr W > χ2r,1−α → 1.
24
The LR test is consistent

Moreover, the LR test is optimal (i.e. with the highest power) in simple situations.
Lemma 5.5.1 (Neyman-Pearson Lemma). For a non-conditional model, i.e

{f (y|θ), θ ∈ Θ}, and two simple hypotheses
H0 : θ = θ0 against H1 : θ = θ1 .
Then the LR test is the most powerful test of H0 against H1 .
The Score test is consistent in usual cases

For example, testing restrictions in the linear model. But if there are several local
maxima in the log-likelihood, that is several values of θ such that E Lθ (θ) = 0,
then the Score test may fail to be consistent.
5.5.6 Confidence Region

Definition 9. A confidence region for θ at asymptotic confidence level 1 − α is
the set of values θ0 that cannot be rejected by a test of H0 : θ = θ0 at asymptotic
level α.
,→ We can consider confidence regions associated with any test of the trinity.
,→ Practitionners most often use Wald to derive confidence intervals, because

the test is easier to apply.
,→ Might not be the best choice, as it is not invariant.
25
Appendix A
Further and Deeper Details
A.1 Proof of the Strict Expected Log-Likelihood

Inequality, Lemma 5.2.1
,→ This proof will use the Jensen’s Inequality: if z is a random variable and h(·) is a
concave function, then E (h(z)) ≤ h (E (z))
,→ Application of Jensen’s Inequality to our framework:
– The random variable is l (θ; y|x) /l (θ0 ; y|x)
– The concave function is log(·)
,→ Therefore:

l (θ; y|x) l (θ; y|x)
E log < log E for θ 6= θ0
l (θ0 ; y|x) l (θ0 ; y|x)
Note: E is the expectation with respect to the true distribution.
Note: we have the strict inequality because Identification insures that l (θ; y|x) 6=
l (θ 0 ; y|x) for θ 6= θ 0 so l (θ; y|x) /l (θ0 ; y|x) cannot be constant over different val-
ues of θ.
,→ The right hand side in the inequality is the following:
Z Z
l (θ; y|x) l (θ; y|x)
E = l (θ0 ; y|x) dy = l (θ; y|x) dy = 1
l (θ0 ; y|x) y l (θ0 ; y|x) y
,→ The inequality becomes:

l (θ; y|x)
E log < log(1) = 0
l (θ0 ; y|x)
E (log (l (θ; y|x))) − E (log (l (θ0 ; y|x))) < 0
E (L (θ; y|x)) < E (L (θ0 ; y|x))
A.2 Proof of Lemma 5.2.2

For all θ Z Z
∂
f (y|x; θ) dy = 1 ⇒ f (y|x; θ) dy = 0 .
∂θ
We have
∂ 1
Lθ (θ) = L (θ) = fθ (y|x; θ) .
∂θ f (y|x; θ)
Therefore, Z
fθ (y|x; θ)
E [Lθ (θ) |x] = f (y|x; θ0 ) dy .
f (y|x; θ)
Z
fθ (y|x; θ0 )
⇒ E [Lθ (θ0 ) |x] = f (y|x; θ0 ) dy
f (y|x; θ0 )
Z Z
∂
= fθ (y|x; θ0 ) dy = f (y|x; θ 0 ) dy = 0 .
∂θ
A.3 Proof of Lemma 5.2.3

From the proof in A.2, we have:
Z
E [Lθ (θ0 ) |x] = Lθ (θ0 )f (y|x; θ0 ) dy = 0 .
Differentiating this expression with respect to θ (and using shortcut notations) gives:
∂(Lθ f ) ∂Lθ ∂f
= f + Lθ 0
∂θ 0 ∂θ 0 ∂θ
= Lθθ f + Lθ (fθ )0
fθ
= Lθθ f + Lθ L0θ f since Lθ =
f
= (Lθθ + Lθ L0θ )f
Using the assumption Differentiability and differentiating the 2 sides, we get:

Z Z
Lθ (θ0 )f (y|x; θ0 ) dy = 0 ⇒ (Lθθ (θ0 ) + Lθ (θ0 )L0θ (θ0 ))f (y|x; θ0 )dy = 0
which is equivalent to
− E [Lθθ (θ0 ; y|x)|x] = E Lθ (θ0 ; y|x)L0θ (θ0 ; y|x)|x

= Var [Lθ (θ0 ; y|x)|x]
Since the Hessian is equal to a negative variance matrix (which is positive semidefinite),
it is a negative semidefinite matrix.
27
A.4 Non invariance of the Wald test

For a scalar θ0 , consider H0 : θ0 = 0.
√ b
d
We know that MLE satisfies n θ − θ0 −→ N (0, I −1 (θ0 )). We will assume for sim-
plicity that we know exactly I −1 (θ0 ) = ω 2 .
The Wald test statistic for H0 : θ0 = 0 is then
W = nθb2 /ω 2 .
θ0
Choose some α > 0, then H00 : α−θ 0
= 0 is equivalent to H0 . What is the Wald test
statistic for this hypothesis?
∂ θ0 2
As ∂θ α−θ0 = α/(α − θ0 ) , the delta method yields
! !
√

θb θ0 d
n − −→ N (0, α2 ω 2 /(α − θ0 )4 )
α − θb α − θ0
θ0
The Wald test statistic for H00 : α−θ0 = 0 is then
2 !2
b α − θb
θ/ θb
W0 = n =W 1− .
α2 ω 2 /(α − θ)
b4 α
b W 0 is “as small as we like.”

For α close to θ,
For α close to 0, W 0 is “as large as we like.” So we can obtain any outcome by changing
the null hypothesis.
28

Chap5 2017 PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chap5 2017 PDF

Caricato da

Copyright:

Formati disponibili

Contents

A Further and Deeper Details 26

A.4 Non invariance of the Wald test . . . . . . . . . . . . . . . . . . . . 28

5.1 Examples and definitions

Student or Laplace distribution (see figure 5.1)

5.1.2 Probability Model

Figure 5.1: Student and Laplace Distributions

5.1.3 The Likelihood Function

L (θ; y|X) ≡ L (θ; y1 , . . . , yn |x1 , . . . , xn ) = log f (y1 , . . . , yn |x1 , . . . , xn ; θ)

,→ Laplace data generating process: y|x ∼ L(x0 β, σ 2 )

5.2 Maximum Likelihood Estimation

Maximum likelihood estimator:

θb = arg max E n L (θ; y|x)

where Θ is the parameter space.

Normal data generating process

First order derivatives:

Thus, βbM L = βbOLS .

Laplace data generating process

Assumption 2 (Identification). Every value of the vector of parameters in Θ

Pr [f (y|x; θ1 ) 6= f (y|x; θ2 )] > 0 ∀θ1 , θ2 θ1 6= θ2 .

Normal data generating process: y|x ∼ N (x0 β, σ 2 )

Proof: see A.1.

Remark: Non-conditional version is also true: If f (y; θ) is the probability model,

,→ If the log-likelihood function is differentiable, and the true parameter, θ0 is

Similarly, for the Maximum Likelihood Estimator, θ̂, we should have

Normal data generating process

– Steps to calculate the expected value:

The expected log-likelihood is maximum when β = β0 and σ 2 = σ02 .

5.2.3 The Score

Proof: see A.2.

Normal data generating process

5.2.4 The Information Matrix

Conditional Information Matrix

I(θ0 |x) = E [Lθ (θ0 )L0θ (θ0 )|x] .

Unconditional Information Matrix

Lemma 5.2.3 (Information Matrix Equality). Under Assumption Differentia-

I(θ0 |x) = Var [Lθ (θ0 )|x] = −E [Lθθ (θ0 ) |x] .

This matrix is positive semi-definite.

Unconditional version Var [Lθ (θ0 )] = −E [Lθθ (θ0 )].

I (θ|X) = nVar (E n Lθ (θ)|X)

Normal data generating process

Assumption 5 (Non-singular information). The information matrix I(θ|x)

Since by definition, the information matrix is positive semi-definite, this assump-

5.2.5 The Fréchet-Darmois-Cramer-Rao Lower Bound

Normal data generating process

5.3 Asymptotic Properties of MLE

There is however one important small sample property of MLE: it is invariant.

5.3.1 Consistency and Asymptotic Normality

Exercice 5.6. Consider the (unconditional) exponential model, where θbM L = ȳ

For convergence, the intuition is that

θbM L = arg max E n L(θ) should be close to θ 0 = arg max E L(θ)

with probability approaching 1 when n increases, because of LLN (we actually

where θ̄ is such that kθ̄ − θ0 k ≤ kθb − θ0k. 

5.3.2 Asymptotic Efficiency

Definition 8. ,→ An estimator is CAN if it is Consistent and Asymptotically

,→ An estimator is CUAN if it is Consistent and Uniformly Asymptotically

,→ An estimator is asymptotically efficient if it has the “smallest” (in matrix

Under the assumptions of Theorem 5.3.1, MLE is asymptotically efficient.

5.3.3 Variance Estimation

b2 = n1 ky − Xβk2 . Alternatively, we can use s2 instead of σ

2. Empirical variance of the score

This is the outer-product of gradient estimator.

where θ̄ is such that kθ̄ − θ0 k ≤ kθb − θ0k.