Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
ST217 Mathematical Statistics B 2
Contents
1.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Hypothesis Testing 53
4.3.1 χ2 − Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Linear Models 73
Introduction
I will try a project this term to constantly LaTeX the lecture notes, and send them out
to some people each week for revision purposes. At the end of Week 10, I should have a full
set of lecture notes.
ST217 Mathematical Statistics B 4
Credit goes to Professor Wilfrid Kendall and Dr Jon Warren and Dr Barbel Finkenstädt for
the lecture notes.
Thanks goes to Vainius Indilas for pointing out the many mistakes I have made.
Disclaimer: Nothing here is guaranteed to be 100% accurate. If in doubt, check your own
lecture notes.
ST217 Mathematical Statistics B 5
Often, we are interested in several random variables, which may or may not be related to
each other. We want to answer certain questions about them, such as finding out how they
vary, or in which range their values lie in, and so on.
We start by giving a few examples of two random variables that we may be interested in.
Note that this can be extended to three, four, ... random variables, but we keep the examples
at two random variables to make things simpler.
2) Pick a random member of the Warwick staff, and measure their blood pressure. We
denote their systolic blood pressure as X1 , and their diastolic blood pressure as X2 . Both
these random variables X1 and X2 are continuous random variables.
3) We wish to find the number of people who are a) live more than 25 miles away from a
theatre (denoted by X), and are b) older than 60 years of age (denoted by Y ) in a sample
of n theatre patrons. The random variables X and Y are both discrete random variables.
4) We want to find the estimates of mean and variance for n independent identically dis-
tributed random variables, X1 , X2 , ..., Xn . Then we want to find (µ̂, σ̂ 2 ), where µ̂ and σ̂ 2 are
the random variables we are interested in.
5) We let θ ∼ Uniform(0, 1) and X|θ ∼ Bin(n, θ). We see here that while this binomial
distribution is dependent on θ, but the random variable θ is continuous, but the random
variable X is discrete.
We will now cover concepts that talk about how the probability of these random variables
are located in some specific region as opposed to another region.
ST217 Mathematical Statistics B 6
Definition 1.1. The bivariate cumulative distribution function (c.d.f.), also called
the joint c.d.f., is defined as
FX,Y (x, y) = P[X ≤ x, Y ≤ y]
for any x, y ∈ R. This definition holds no matter whether X, Y are discrete or continuous.
Given λ1 , λ2 , I get a particular list of probability for various outcomes (x1 , x2 ) under the
model of Exercise 1.1. If I choose λ, θ so λ1 = λθ, λ2 = λ(1 − θ), then under the model of
Exercise 1.2, I get exactly the same list of probabilities.
So p.m.f.s are the same, and we are not going to be able to tell the difference between the
two models by observing pig litters.
A discrete random variable has a countable state space (without loss of generality repre-
sentable using N = {0, 1, 2, 3, ...}) and so we can always tabulate a bivariate p.m.f.
ST217 Mathematical Statistics B 8
0 1 2 3
0 PX,Y (0, 0) PX,Y (0, 1) PX,Y (0, 2) ...
1 PX,Y (1, 0) PX,Y (1, 1) PX,Y (1, 2) ...
2 PX,Y (2, 0) PX,Y (2, 1) PX,Y (2, 2) ...
.. .. .. ...
3 . . .
X
P[(X, Y ) ∈ E] = P[X = x, Y = y]
(x,y)∈E
X
= PX,Y (x, y)
(x,y)∈E
Exercise 1.4. Suppose we work with model of Exercise 1.2 with N ∼ Po(λ), and X1 |N ∼
Bin(N,θ). Then
X
P[X1 = x1 ] = fX1 ,X2 (x1 , x2 )
X2 =0,1,2,...
X (λθ)x1 (λ(1 − θ))x2 −λ
= e
X2 =0,1,2,...
x1 !x2 !
∞
(λθ)x1 −λ X (λ(1 − θ))x2
= e
x1 ! x2 =0
x2 !
| {z }
eλ(1−θ)
(λθ)x1 −λθ
= e
x1 !
So we verify X1 ∼ Po(λθ).
Definition 1.3. The random variables X and Y has an (absolutely) continuous R R joint distri-
bution if there is a function (a joint density) fX,Y = f with P[(X, Y ) ∈ A] = A
f (x, y) dx dy
for all A ⊆ R2 . We say fX,Y (x, y) is the joint probability density of X, Y .
ST217 Mathematical Statistics B 9
Remarks:
• RSinceR probability is non-negative, fX,Y (x, y) ≥ 0 everywhere
∞ ∞
• −∞ −∞ f (x, y) dx dy = 1
• f (x, y) Rcould
R always be changed at a countable set of points without changing any of the
integrals A
f (x, y) dx dy
i) What is C?
Z ∞ Z ∞
We know that f (x, y) dx dy = 1. So we integrate the p.d.f. to see what we get.
−∞ −∞
Z 1 Z 1
f (x, y) dx dy (f is zero elsewhere)
Z0 1 Z0 y
= f (x, y) dx dy (as x < y)
Z0 1 Z0 y
= c dx dy
0Z 0Z
1 y
= c 1 dx dy
Z0 1
0
= c y dy
0
= c/2
Example 1.2. Suppose we have x1 , x2 , ..., xn being independent identically distributed ran-
dom variables distributed as N(µ, σ 2 ).
X ∼ N(µs , σS2 )
2
Y |X ∼ N(α + βX, σD )
In fact, we will see fX,Y (x, y) = fX (x)fY |X=x (y|x) which will be discussed later.
Definition 1.5. FX (x) = lim FX,Y (x, y) is the marginal cumulative distribution function
y→∞
of X.
Discrete case: PX (x) = P[X = x]
| {z }
marginal p.m.f.
d
Continuous case: fX (x) = FX (x)
|dx {z }
marginal p.d.f.
d
fX (x) = FX (x)
dx
Z ∞Z ∞
d
= fX,Y (x, y) dy dx
dx −∞ −∞
Z ∞ Z x
d
= fX,Y (x, y) dx dy
−∞ dx −∞
Z ∞
= fX,Y (x, y) dy
−∞
Very similar results hold for marginal p.m.f. densities for joint distributions of three or more
random variables.
1
We pick a coin at random, each with probability 5
and then toss it twice.
Let θ = P [Head — coin] (so θ varies depending on what coin has been picked)
X = number of heads observed
0 1 2
1 1 1 1
0 ×1 ×0 ×0
5 5 5 5
1 1 1 3 3 1 3 1 1 1 1
× × ×2× × × ×
5 4 5 4 4 5 4 4 5 4 4
θ
2 1 2 1 1 2 1 1 2 1 1
× × ×2× × × ×
5 2 5 2 2 5 2 2 5 2 2
1 1 1 1
1 ×0 ×0 ×1
5 5 5 5
We calculate out the values, and then tabulate the marginal p.m.f. of X and θ as well. We
call the p.m.f. marginal, because it appears in the margins of columns and rows.
The red row denotes the possible values of X, and the blue column denotes the possible
values of θ.
The magenta column denotes the probability of picking up a particular coin and are the
marginals of θ.
0 1 2
1 16
0 0 0
5 80
1 1 9 6 1
5 4 80 80 80
θ
2 1 8 16 8
5 2 80 80 80
1 16
1 0 0
5 80
33 22 25
80 80 80
Definition 1.6. If X, Y have a discrete joint distribution with p.m.f. PX,Y (x, y), then the
conditional probability mass function of Y given X = x is
PX,Y (x, y)
PY |X (y|x) =
PX (x)
1 PX,θ (x, 14 )
PX|θ= 1 (x|θ = ) =
4 4 Pθ ( 41 )
We refer to the second row of Figure 2 (where θ = 14 and divide by its marginal entry 1
5
(on
the LHS) to give us probabilities summing up to 1. Therefore, what we get is:
ST217 Mathematical Statistics B 14
1
X |θ= 4
0 1 2
9 6 1
80 80 80
1 1 1
5 5 5
9 6 1
16 16 16
9 6 1
So the p.m.f. is P[X = 0] = , P[X = 1] = and P[X = 2] = .
16 16 16
Exercise 1.8. In Exercise 1.6, find the conditional distributions of θ|x = 0.
Pθ,X (θ, 0)
Pθ|X=0 (θ|x = 0) =
PX (0)
33
We refer to the first column of Figure 2 (where X = 0 and divide by its marginal entry 80
(on the bottom) to give us probabilities summing up to 1. Therefore, what we get is:
1 1
θ|x=0 0 4 2
1
16 9 8
80 80 80 0
33 33 33 33
80 80 80 80
10 9 8
0
33 33 33
10 9 8
So the p.m.f. is P[θ = 0] = , P[θ = 14 ] = , P[θ = 12 ] = and P[θ = 1] = 0.
33 33 33
0
In this case, P[X = x] = 0, so a p.m.f. type definition leads to 0
which is BAD.
Z ∞
and by the formula for marginal density, it is immediate that fY |X (y|X = x) dy = 1.
−∞
Exercise 1.9. Show that the random variables X and Y are independent according to (†)
if and only if
FX,Y (x, y) = FX (x)FY (y) (††)
For (†),
P[X ∈ A, Y ∈ B] = P[X ∈ A] P[Y ∈ B]
d
Similarly, F (y)
dy Y
= fY (y)
∂2
∂y
FX (x)FY (y) = fX (x)fY (y)
Z x Z y
∂2
F (x, y) =
∂x∂y X,Y
fX,Y (a, b) da db.
−∞ Z x−∞
∂
= ∂dx fX,Y (a, y) da
−∞
= fX,Y (x, y)
It is now immediate that if X, Y are independent with joint density, then fX|Y (x|Y = y) =
fX (x).
X1
Vector notation can be helpful, with ... = (X1 , ..., Xn )T .
Xn
We see that Definition 1.8 is very similar to the definition for the bivariate case. In
particular, the marginal and conditional distribution functions are similar.
Definition 1.9. If X = (X1 , X2 , ..., Xn ) has a discrete distribution (only a countable range
of values), then its probability mass function is
Definition 1.11. The fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) in Definition 1.10 is called the multivari-
ate probability density function (multivariate p.d.f.).
vectors in Rn
ST217 Mathematical Statistics B 18
• The c.d.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) is FX (x) = FX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
• The p.m.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) in the discrete case is PX (x) =
PX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
• The p.d.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) in the continuous case is fX (x) =
fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
• The marginal c.d.f. of Xj is Fj (xj ) = lim fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
x1 ,x2 ,...,xj−1 ,xj+1 ,...,xn →∞
X X
• The marginal p.m.f. of Xj is Pj (xj ) = ... pX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
x1 ,x2 ,...,xj−1 ,xj+1 ,...,xn →∞
• The marginal p.d.f. for both Xj and Xk is fj,k (xj , xk ) where we integrate out all variables
excluding xj , xk .
• The marginal p.m.f. for both Xj and Xk is Pj,k (xj , xk ) where we sum all variables excluding
xj , xk .
1.4 Expectation
Recall that if aX
discrete distribution X has a probability mass function PX (x) = P[X = x],
then E[X] = xPX (x).
x
Similarly, Zif a continuous distribution X has a probability density function fX (x), then
∞
E[X] = xfX (x) dx.
−∞
X
More generally, for a discrete distribution X, E[h(X)] = h(x)PX (x), and for a contin-
Z ∞ x
• Var[aX + b] = a2 Var[X]
Cov[X1 , X2 ] Cov[X1 , X2 ]
• Cor[X] = p =
Var[X1 ] Var[X2 ] s.d.[X1 ] s.d.[X2 ]
Definition 1.12. Multivariate expectation
Given a n-dimensional vector X = (X1 , X2 , ..., Xn ),
e
E[h(X )] = E[h(X1 , X2 , ..., Xn )]
e
X
• Discrete case: E[h(X )] = h(x)PX (x), where PX (x) is the joint p.m.f. of X
x
e e e e e e e
Z
• Continuous case: E[H(X )] = h(x)fX (x) dx, where fX (x) is the joint p.d.f. of X
e e e e e e e e
Rn
Z ∞ Z ∞
= ... h(x1 , ..., xn )fX1 ,...,Xn (x1 , ..., xn ) dx1 . . . dxn
−∞ −∞
| {z }
n integrals
Then the discriminant of the quadratic equation must be negative, as there are no real roots,
i.e.
(2E[XY ])2 − 4E[X 2 ]E[Y 2 ] < 0 ⇒ E[XY ] < E[X 2 ]E[Y 2 ]
p
Case 2: h(t) = 0 for a single unique value of t, say t0 . So the quadratic has repeated roots.
Then
(2E[XY ])2 − 4E[X 2 ]E[Y 2 ] = 0 ⇒ E[XY ] = E[X 2 ]E[Y 2 ]
p
This implies the Cauchy-Schwartz Inequality but with the equality holding.
We usually have a common practical problem: Say we have two random variables X1 and
X2 which may be related to each other. If we observe X2 = x2 , what is the expected value
ST217 Mathematical Statistics B 22
Comments:
1) E[X1 , X2 ] is really random because it depends on the value of X2 .
2) If we think of E[X1 |X2 = x2 ] as a function of x2 , then it describes a curve: the regression
curve. For example: a regression line E[X1 |X2 = x2 ] = β0 + β1 x2 .
Since E[X1 |X2 ] is really random, we can ask questions about its distribution.
Theorem 1.2. The marginal expectation theorem states that E[E[X1 |X2 ]] = [X1 ].
We follow the same argument to prove the discrete case, but with sums instead of integrals.
Exercise 1.12. Suppose X ∼ Uniform(0, 1) and Y |X = x ∼ Uniform (x, 1). Find E[Y |X =
x] and hence show E[X] = 43 .
ST217 Mathematical Statistics B 23
1+x
Since Y |X = x is uniform on (x, 1), E[Y |X = x] = .
2
|X]]
So E[Y ] = E[E[Y
1+x
=E
2
= 12 + 12 E[X]
1 1 1 1 3
As X ∼ Uniform on (0,1), then E[X] = , then E[Y ] = + = .
2 2 2 2 4
Exercise 1.13. Suppose Θ ∼ Uniform (0, 1) and Y |Θ = θ ∼ Binomial (2, θ). Find E[X|Θ]
and hence show E[X] = 1.
E[X|Θ = θ] = 2θ
Proof. The proof is exactly the same as in Theorem 1.2, but we replace x1 by h(x1 )
instead.
Vital special case: h(X1 ) = (X1 − µ1 )2 where µ1 = E[X1 ]. Then we will get the following
theorem.
Theorem 1.4. Marginal variance theorem
Proof. We note that Var[X1 |X2 = x2 ] = E[X1 2 |X2 = x2 ] − (E[X1 |X2 = x2 ])2 so
E[Var[X1 |X2 ]] = E[E[X1 2 |X2 ]] − E[(E[X1 |X2 ])2 ] 1n
Also
Var[E[X1 |X2 ]] = E[(E[X1 |X2 ])2 ] − (E[E[X1 |X2 ]])2 2n
ST217 Mathematical Statistics B 24
1n+ 2ngives E[Var[X1 |X2 ]] + Var[E[X1 |X2 ] = E[E[X1 2 |X2 ]] − (E[E[X1 |X2 ]])2 .
So using Theorem 1.2 and Theorem 1.3, we get E[Var[X1 |X2 ]]+Var[E[X1 |X2 ] = E[X1 2 ]−
(E[X1 ])2 = Var[X1 ].
Exercise 1.14. We recall Exercise 1.13, where Θ ∼ Uniform(0, 1) and X|Θ = θ ∼ Bino-
mial (2, θ).
2
Show that Var[X] = 3
and comment on the effect of observing Θ = θ.
Now E[X|Θ = θ] = 2θ
2 2
Var[E[X|Θ]] = E[(2Θ] ] − (E[2θ])
Z 1
1
= 4θ2 (1) dθ − 1 =
0 3
On average, E[Var[X|Θ]] = 13 .
2 1
So we have the variance of X reducing from 3
to an average of 3
if we observe Θ.
Recall that a random variable X is normal if it has mean µ, variance σ 2 , and distributed
2
1
with N(µ, σ 2 ) if it has p.d.f. f (x; µ, σ 2 ) = √2πσ exp[ −(x−µ)
2σ 2
], with parameters µ ∈ R, σ > 0.
N(0, 1). As they are independent, their joint p.d.f. is a product of their individual p.d.f.s.
• x ∼MVN(0, I)
• e ∼MVN e(0e, I)
x n
• x ∼Nn (0, I)e e
e
e ee
which mean the same thing.
We can use this notation to produce MVN random variables with different means, variances,
and covariances.
Question: How do we obtain joint p.d.f.s of random variables produced in this kind of way?
If the r’s are smooth, there is a formula for the joint p.d.f. of Y .
e
As the transformation is 1 to 1, we can invert it to get:
X1 = s1 (Y1 , Y2 , . . . , Yn )
X2 = s2 (Y1 , Y2 , . . . , Yn )
..
.
Xn = sn (Y1 , Y2 , . . . , Yn )
(
|J| fX (s(y )) for possible y
Now it can be shown that the joint p.d.f. of Y is fY (y ) = e ee e.
e e e 0 otherwise
X1
Compute the joint p.d.f. of Y1 , Y2 where Y1 = X2
and Y2 = X1 X2 .
x2 y2
y1 = y2
1
y1 y2 = 1
0 x1 y1
1 0
Note that the shaded area in the figures represent where the values of x1 , x2 , y1 , y2 must lie
in respectively.
Step 2: We calculate J.
∂s1 ∂s1
∂y1 ∂y2
J = det ∂s ∂s2
2
∂y1 ∂y2
r r
1 Y2 1 Y1
2 Y1 2 Y2
= det
s r
1 Y2 1 1
−
2 Y13 2 Y1 Y2
ST217 Mathematical Statistics B 28
r r ! s r !
1 Y2 1 1 1 Y2 1 Y1
= × − − 3
×
2 Y1 2 Y1 Y2 2 Y1 2 Y2
1
=
2Y1
r
1 p Y2 Y2
× 4 Y1 Y2 × =2 for Y1 , Y2 in appropriate ranges
fY1 ,Y2 = 2Y Y1 Y1
1
0 otherwise
X
and Yi = Aij Xj .
j
1
The Jacobian J = det((A−1 )) = .
det(A)
ST217 Mathematical Statistics B 29
1
Then fY (y ) = fX ((A−1 )y) See Exercise Sheet 2
e e |det(A)|
Definition 2.1. If A is a non singular n × n matrix and Y = AX and if X has a joint p.d.f.
e e e
fX (x), then the joint p.d.f. of Y is
e e e
1
fY (y ) = fX ((A)−1 y )
e e |det(A)| e e
X1 = σ1 Z1 + µ1p
X2 = σ2 (ρZi + 1 − ρ2 Z2 ) + µ2
σ1 0
Here, A = p .
σ 2 ρ σ 2 1 − ρ2
p
Then det(A) = 1 − ρ2 σ1 σ2 .
p
−1 1 − ρ2 σ2 0 1
So A = .
−σ2 ρ
p
σ1 1 − ρ2 σ1 σ2
1 2
e 2 |Z|
1 1 2 2
fZ1 ,Z2 (z1 , z2 ) = exp − (z1 + z2 ) =
2π 2 2π
ST217 Mathematical Statistics B 30
1
Then fX (x) = fZ ((A)−1 (x − µ))
e e |det(A)| e e
| {z }
non negative p.d.f.
2
1 1 −1
= exp − (A) (x − µ)
2π|det(A)| 2 e
1 Q(x)
1 1
=p exp −
1 − ρ2 σ1 σ2 2π 2 1 − ρ2
e
2 2
x1 − µ 1 x1 − µ 1 x2 − µ 2 x2 − µ2
where Q = − 2ρ + .
σ1 σ1 σ2 σ2
E[X1 ] = µ1 E[X2 ] = µ2
Var[X1 ] = σ12 Var[X2 ] = σ22
Cov[X1 , X2 ] = ρσ1 σ2
T µ1 σ12 ρσ1 σ2
Then (x1 , x2 ) ∼ N , .
e e µ2 ρσ1 σ2 σ22
Exercise 2.2. From Example 2.1, compute the
a) Marginal distribution of X1
b) Conditional distribution of X1 if X2 = x2 .
2 !
1 x1 − µ 1
= exp −
2 σ1
Two proportional p.d.f.s must be equal since they both integrate to 1, so the marginal p.d.f.
is N (µ1 , σ12 ).
For b), we note that fX1 ,X2 (x1 , x2 ) ∝ fX1 |X2 =x2 (x1 , x2 ) as a function of X1 .
σ1 2 2
This is proportional to the p.d.f. of N µ1 + ρ(x2 − µ2 ), σ1 (1 − ρ ) .
σ2
Notice: Marginal distributions are normal and easy to compute, and conditional distributions
are normal.
If correlation is zero, then the random variables are independent. You can substitute ρ = 0
and check.
These features persist with essentially the same proofs for multivariate normal distribution.
µ1
where µ is the mean vector ...
µp
and V is the p×p, non symmetric, positive definite variance-covariance matrix with (i, j)th
entry as Cov[Xi , Xj ].
µ1 E[X1 ]
We can see that µ = ... = E[X ] = ... . So µi = E[Xi ].
µp E[Xp ]
e
σ1 2 ρ12 σ1 σ2 . . . . . . ρ1p σ1 σp
ρ12 σ1 σ2 σ2 2 . . . . . . ρ2p σ2 σp
Similarly, we see that V =
ρ13 σ1 σ3 ρ23 σ2 σ3 σ3 2 . . . ρ3p σ3 σp
... ..
... ... ... .
2
ρ1p σ1 σp ... ... ... σp
Then for our parameters, we have p number of means µ, p number of variances σi 2 , and
1
2
p(p − 1) number of correlations ρij .
We write X ∼ M V N (µ, V ).
e e
Question: Why is V positive definite?
1 2 2
Consider the scalar √ e−u /2σ . We know that if this is a p.d.f., integrating it must give
2πσ 2 Z ∞
2 2
us the value 1. In other words, this implies that e−u /2σ du < ∞. If we replace σ 2 by a
−∞
negative number, this integral would be infinite, and hence it is not possible for the scalar to
be a p.d.f. Similarly, V has to be positive definite or else the joint p.d.f. will not integrate
properly. The positive definiteness of V also means the correlations ρij must satisfy suitable
relationships.
ST217 Mathematical Statistics B 33
q q q
So det(BB T ) = det(B)det(B T ) = (det(B))2 = det(B).
1 1 1 −1 T −1
Therefore fY (y ) = exp − (B (y − a)) (B (y − a))
e e (2π)n/2 det(B) 2 e e e e
1 1 T T −1
=q exp − (y − a) (BB ) (y − a)
(2π)n det(BB T ) 2 e e e e
∼ M V N (a, BB T )
e
Suppose X (1) , X (2) , ..., X (n) are independent p vectors, each M V N (µ, V ), and we want esti-
mates of µ and V . In other words, we want to estimate each µi , σi 2 and ρij .
e e e
We want to construct statistics from the data, close to the µi , σi 2 , ρij , and we do this by
using maximum likelihood estimates (m.l.e.).
The m.l.e. idea is to take the joint p.d.f. of data X (1) , ..., X (n) , i.e., the joint p.d.f. is
fX (1) (x(1) )fX (2) (x(2) ) . . . fX (n) (x(n) ) and choose µ, V to maximize this. In practice, we differ-
e e
entiate with respect to µi , σi 2 , and ρij , set the derivatives to zero, and solve.
e e e e e e
Eventually, we get
n
1 X (i)
µ̂j = Xj
n i=1
n
1 X (i) 2
σ̂j2 = Xj − µˆj
n i=1
n
1 X (i)
Xj − µ̂j Xj (i) − µ̂k
n i=1
ρ̂jk =
σ̂j σ̂k
ST217 Mathematical Statistics B 35
Typically, if the number of parameters p is fixed, one gets more and more data, and one
uses m.l.e. in regular situations, the p maximum likelihood estimators for p parameters
approximately has a M V N distribution.
Here are some diagrams of the shape of the density function with varying n.
ST217 Mathematical Statistics B 36
f (x) f (x)
1
2
0 2 4 x 0 x
x 0 x
0 k
Special cases:
χ2n = Gamma( n2 , 2)
χ22 = Exponential with parameter 21 , i.e. mean = 2
n/2
1
If X ∼ χ2n , tx
then E[X] = n, Var[X] = 2n, and Mx (t) = E[e ] = , for nt < 21 .
1 − 2x
n → ∞.
There is no explicit c.d.f. formula for this distribution, but there are tables provided to
approximate the c.d.f. of this distribution.
This distribution is symmetric about 0 and converges to N (0, 1) as n → ∞ but it has heavier
tails.
Special cases:
t1 is Cauchy distribution
There is no explicit c.d.f. formula for this distribution except in the case n = 1, but there
are tables provided to approximate the c.d.f. of this distribution.
There is no explicit c.d.f. formula for this distribution, but there are tables provided to
approximate the c.d.f. of this distribution.
This sets the scene for Chapter 3 (Maximum Likelihood Estimation). Maximum Likelihood
Estimators are typically easy to compute, but how good are they, especially when we have
lots of data? We will find out in Chapter 3.
We get data which leads to a list of cases with each case having several (or just one) response.
We can then put the data into a data matrix, for example
Responses
X1 Y1 Z1 . . .
1 ... ... ... ...
.. . .
2 . . ... ...
Cases 3 .. .
. . . . .. ...
..
.. .. ..
. ..
. . . .
ST217 Mathematical Statistics B 39
We can then generate a statistical model with parameters θ1 , . . . , θp , yielding joint p.m.f. or
joint p.d.f. for data fθ (x).
ee
So we have the relation:
fθ (x1 ) . . . fθ (xp )
MaximumeLikelihood e Estimators
for observation
Parameter Ωθ Observation X
θ∈ x1 , . . . , x n X∈
space e Subset
e of
e e space e Subset of
Euclidean space Euclidean space
estimators
t(x)
ee
What we want to do is to
1) Construct estimators using fθ (x)
2) Decide how good they are e
e
Idea: We look for estimators which are good when amount n of data becomes large.
P
We write Xn → X, or Xn → X in probability.
ST217 Mathematical Statistics B 40
Definition 3.2. The estimators θˆ1 , θˆ2 , . . . are consistent for θ ∈ Ωθ if, for all > 0, all
θ ∈ Ωθ , e
θ̂ → θ in prob
e
en e
Important: When calculating P[|θ̂n − θ| ≥ ], we need to use fθ , the p.m.f / p.d.f. given by
θ. e e e
σ2
Now, X̄n has mean µ, and variance n
.
√
rσ n
Fix > 0, and set √ = so r = .
n σ
ST217 Mathematical Statistics B 41
1
So we find P[|X̄n − µ| ≥ ] ≤ √ 2
n
σ
σ2
= 2 → 0 as n → ∞
n
Sometimes, we estimate something and then decide we really want to estimate a function of
it.
Suppose s2n estimates σ 2 consistently. Then how do we estimate the standard deviation σ?
p √
sn = s2n will then estimate σ = σ 2 consistently.
a) Before we begin, we note that that Var[Z] = E[Z 2 ] − (E[Z])2 and hence E[Z 2 ] = Var[Z] +
(E[Z])2 (†)
Therefore
" n
!#
1 X 2
E[s2n ] = E Xi − X̄n
n − 1 i=1
" n
!#
1 X
= E X 2 − nX̄n2
n − 1 i=1 i
1
nE[x21 ] − nE[X̄n2 ]
=
n−1
1 2 2
= n Var[X1 ] + (E[X1 ]) − nVar[X̄n ] − n E[X̄n ] by (†)
n−1 2
1 σ
= nσ 2 + nµ2 − n − nµ2
n−1 n
1 2
= (n − 1) σ
n−1
= σ2
n n
1X 2 1X
b) We note that X and Xi = X̄n can be subjected to the Weak Law of
n i=1 I n i=1
Large Numbers as in Exercise 3.1. Furthermore, we also know from Exercise 3.1 that
n n
1X prob 1 X 2 prob
Xi −→ E[X1 ] = µ and X −→ E[Xi2 ] = σ 2 + µ2 (††)
n i=1 n i=1 i
Therefore
n
1 X
s2n = (Xi − X̄)2
n − 1 i=1
n
!
1 X
= X 2 − n(X̄)2
n − 1 i=1 i
ST217 Mathematical Statistics B 43
!2
n n
n 1 X 1 X
= Xi2 − Xi
n − 1 n i=1 n i=1
prob n
σ 2 + µ2 − µ2
−→ lim
n→∞ n − 1
= σ2
n
1X 2
Notice that s˜2n = Xi − X̄n is therefore also consistent for σ 2 (but not unbiased).
n i=1
3.1 Consistency
Consistency in MSE (mean square error) is a sufficient (but not necessary) condition for
consistency.
prob
We can use Chebyshev’s Inequality to deduce θ̂n −→ θ, hence θ̂n is consistent.
n
X n
X
µ̂1 = X̄n µ̂2 = ai Xi for ai = 1
i=1 i=1
n
1 1 X
µ̂3 = (X1 + Xn ) µ̂4 = Xi (for n > 2)
2 n − 2 i=1
ST217 Mathematical Statistics B 44
We test for unbiasedness. Note that if an estimator is unbiased, then it is naturally asymp-
totically unbiased as well.
n
1X
E[µ̂1 ] = E[Xi ] = µ (unbiased, and hence asymptotically unbiased)
n i=1
Xn Xn
E[µ̂2 ] = E[aXi ] = ai µ = µ (unbiased, and hence asymptotically unbiased)
i=1 i=1
1 µ+µ
E[µ̂3 ] = (E[X1 ] + E[Xn ]) = = µ (unbiased, and hence asymptotically unbiased)
2 n
2
1 X n
E[µ̂4 ] = E[Xi ] = µ 6= µ (biased)
n − 2 i=1 n−2
n n
However, we note that as n → ∞, n−2
→ 1, and therefore n−2
µ → µ as n → ∞, so E[µ̂4 ] is
asymptotically unbiased.
We now check for consistency in MSE. Since all four are asymptotically unbiased, we just
need to show that Var[µ̂i ] → 0 as n → ∞.
σ2
Var[µ̂1 ] = → 0 as n → ∞ (Consistent in MSE)
n !
n
X n
X
Var[µ̂2 ] = Var[ ai X i ] = a2i σ 2 (So this is only consistent in MSE
i=1 i=1
n
X
if ai ’s depend on n in such a way that a2i → 0 as n → ∞)
i=1
1 σ2
Var [µ̂3 ] = (2σ 2 ) = 9 0 as n → ∞ (Not consistent in MSE)
4 2
2
1 2 n σ
Var [µ̂4 ] = (nσ ) = → 0 as n → ∞ (Consistent in MSE)
(n − 2)2 n−2n−2
ST217 Mathematical Statistics B 45
3.2 Estimators
the maximum likelihood estimator is θ̂n such that Ln (θ, x) is as large as possible.
e e
Note that f (x1 , θ), . . . , f (xn , θ) are the joint p.m.f.s of X1 , . . . , Xn if we know θ, viewed the
other way roundea function ofe θ supposing I have observed X = x. e
e e e
Note that the MLE may not be unique, in the sense that there may be different values of θ
that give the same maximum. The values of θ are typically found using calculus. e
e
Notice Ln (θ; x) is maximized exactly when log Ln (θ; x) (the log-likelihood) is maximized.
e e e e
n
X
We write ln (θ; x) = log Ln (θ; x) = log f (xi ; θ)
i=1
e e e e e
For the m.l.e., we use θ̃ which makes L (or equivalently l) as large as possible.
∂
l(. . .) = 0
∂θ1
∂
l(. . .) = 0
∂θ2
..
.
∂
l(. . .) = 0
∂θp
ST217 Mathematical Statistics B 46
for θ1 , . . . , θp .
We can solve these equations either analytically, or numerically (eg, methods like Newton
Raphson).
∂
We can write the problem in brief as: Solve for θ in l(x; θ) = 0
e ∂θ e e
e
∂
Iii (θ) = Var log f (X; θ)
e ∂θ
2 i e
∂
= −E log f (X; θ)
∂θi2 e
Comments: 1) The proof for the theorem is given in advanced statistics books. For example,
the proof can be found here at (Lehmann & Casella, 1998, Theory of Point Estimation, 2nd
Edition, Springer).
2) Uniqueness is not assured, but if uniqueness, then consistency and asymptotic normality
ST217 Mathematical Statistics B 47
follow.
3) Regularity conditions:
(R0) The p.m.f / p.d.f. is distinct and different for different θ:
if θ 6= θ0 , then f (X; θ) 6= f (X; θ0 ) e
(R1) The support of the p.d.f. / p.m.f. does not depend on θ. The support is the region
e e e e
of possible values of x. e
(R2) There is an open subset Ω0 ⊆ Ωθ such that the true θ ∈ Ω0 , and all 3rd partial
derivatives of f (x; θ) exist for alle θ ∈ Ω0 .
e
(R3) The expectation
ofethescore function
e vanishes:
∂
i.e. Eθ log f (x; θ) = 0.
e ∂θj e
(R4) I(θ) is positive definite for all θ ∈ Ω0 (and hence invertible)
e e
(R5) There are functions M jkl (x) such that
∂3
∂θj ∂θk ∂θl log f (x; e
θ) ≤ Mjkl (x) and Eθ [Mjkl (x)] < ∞.
e
∂ log f (x; θ)
So consider, e.g. E e = 0 (by regularity conditions)
∂θj
Z Z
∂
f (x; θ) dx = 1, so f (x; θ) dx = 0
e ∂θj e
Under regularity conditions, in particular, range of possible values for X doesn’t depend on
θ.
e
Z
1 ∂
So f (x; θ) f (x; θ) dx = 0
f (x; θ) ∂θj e
e
ST217 Mathematical Statistics B 48
Z
∂
log f (x; θ) f (x; θ) dx = 0
∂θj e e
∂
and LHS = E log f (x; θ)
∂θj e
∂2
Z
∂ ∂
0= log f (x; θ) f (x; θ) + log f (x; θ) f (x; θ) dx
∂θ j ∂θ j e e ∂θ j e ∂θ k e
∂2
Z Z
∂ ∂
⇒ log f (x; θ) f (x; θ) dx = − log f (x; θ) f (x; θ) dx
∂θj e ∂θk e ∂θ2j ∂θk e e
∂
= −E log f (x; θ)
∂θj ∂θk e
1 ∂ ∂
Now use f = (log f ), so
f ∂θk ∂θk
Z
∂ ∂
LHS = log f (x; θ) log f (x; θ) f (x; θ) dx
∂θj e ∂θk e e
∂ ∂
=E log f (x; θ) log f (x; θ)
∂θ j e ∂θ k e
∂2
= −E log f (x; θ)
∂θj ∂θk e
Exercise 3.4. Find the Maximum Likelihood Estimate and Fisher information matrix if f
is N (µ, σ 2 ) density.
ST217 Mathematical Statistics B 49
θ = (µ, σ 2 )T
e
n
X
l(µ, σ) = log f (xi ; µ, σ 2 )
i=1
n
X 1 1 2
= − log 2π − log σ − 2 (xi − µ)
i=1
2 2σ
n
n 1 X
= − log 2π − n log σ − 2 (xi − µ)2
2 2σ i=1
n
∂ 1 X
l(µ, σ) = 2 (xi − µ) = 0
∂µ σ i=1
n
1X
⇒ µ̂ = xi = x̄
n i=1
n
∂ n 1 X
l(µ, σ) = − 2 + 3 (xi − µ)2 = 0
∂σ σ σ i=1
s n
1X
⇒ σ̂ = (xi − x̄)2
n i=1
1 1
log f (x; µ, σ 2 ) = − log 2π − log σ − 2 (x − µ)2
2 2σ
∂ 1
log f = (x − µ)
∂µ σ2
∂ 1 1
log f = − + 3 (x − µ)2
∂σ σ σ
∂2 1
2
log f = − 2
∂µ σ
ST217 Mathematical Statistics B 50
∂2 1 3
2
log f = 2
− 4 (x − µ)2
∂σ σ σ
∂2 2
log f = − 3 (x − µ)
∂µ∂σ σ
2 2
−
E[X
2
µ] = 0 ; E[(x − µ) ] = σ
∂ 1
E 2
log f (x; µ, σ 2 ) = − 2
∂µ2 σ
∂ 1 3 2
E 2
log f (x; µ, σ 2 ) = 2
− 4 E[(x − µ)2 ] = − 2
∂σ 2 σ σ σ
∂ 2
E log f (x; µ, σ 2 ) = − 3 E[x − µ] = 0
∂µ∂σ σ
1
σ2 0
I(µ, σ) = 2
0
σ2
σ2 0
V = (I(µ, σ))−1 = σ2
0
2
Summary:
2
• Nn (µ̂ − µ) ∼ N (0,
σ )2 for
large n
σ
• Nn (σ̂ − σ) ∼ N 0, and asymptotically µ̂, σ̂ are independent.
σ
Since we don’t know µ, σ, if we need them to estimate the variance terms we plug in the
m.l.e.
√
We know that n(θ̂ − θ) = M V N (0, I(θ)−1 ), with θ1 = µ, θ2 = σ.
e e e e
∂g1 ∂g1
∂µ ∂σ
B = ∂g ∂g2
2
∂µ ∂σ
µ µ
g1 = µ, g2 = σ2
σ σ
1 0
so B =
0 2σ
σ2 0
I −1 = σ2
0
2
1 0
σ2 0
1 0
So BI −1 B T = σ2
0 2σ 0 0 2σ
2
ST217 Mathematical Statistics B 52
σ2 0 1 0
= 3
02 σ 0 2σ
σ 0
=
0 2σ 4
X
n
Consequently for large n, η̂ = g (θ̂) = 1 X 2 has approximate distribution
(X i − X n )
e ee n i=1
µ 1 σ2 0
MV N , .
σ2 n 0 2σ 4
Remarks:
Using the Cramer-Rao lower bound, the lowest possible variance for an unbiased estimator
2σ 4
of σ 2 can be seen to be .
n
n
1X
Here, our estimator σ̂ = 2
(Xi − X n )2 is biased, though asymptotically unbiased and
n i=1
consistent. Its asymptotic variance is the Cramer-Rao lower bound.
n
2 1 X 2σ 4 2σ 4
The estimator s = (Xi − X n )2 is unbiased, with variance is > .
n − 1 i=1 n−1 n
1 1
In fact s2 has greater mean square error that σ̂ 2 . It is not much greater, for large n, ≈
n−1 n
but this demonstrates that exactly unbiased estimators need not be best.
In general,
√ dist
n(θ̂j − θj ) −→ N (0, Ij (θj )−1 )
so we deal with the unknown θ in variance term by plugging in the maximum likelihood
estimate. e
Iˆjj
−1
We can get asymptotic 95% confidence intervals for θj , as θ̂j ± 1.96 √
n
Likelihood function is proportional to joint density of data, given the parameters. In Bayesian
Inference, model uncertainty about θ by supposing it random with density π(θ) and then we
can compute the conditional distribution
e of parameter θ given data e
e
f (x | θ)π(θ)
X : π(θ |x) = e e e ∝ f (x | θ)π(θ)
e e f (x) e e
e
The m.l.e. theory tells us what f (x | θ) looks like when we have large amount n of data.
e e
4 Hypothesis Testing
Recall the set up: in a statistical model with data x1 , . . . , xn modelled as observed values of
random variables X1 , . . . , Xn .
A hypothesis test is a procedure for comparing two statements about the true value of the
parameter θ.
The null hypothesis, usually denoted by H0 , is a statement of the form that the true value
of Θ belongs to a given subset Θ0 ⊆ Θ.
The alternative hypothesis, usually denoted by H1 , is a statement of the form that the
true value of Θ belongs to a given subset Θ1 .
Usually, the null hypothesis corresponds to some simplifying assumptions or natural special
case.
Example 4.1. Data on blood pressure where women at age 35 and then at age 40 get their
blood pressure measured.
The parameter θ might represent the mean change in blood pressure between the two ages.
Natural null hypothesis θ = 0 which represents “no change in mean blood pressure”
versus alternative hypothesis θ ∈ R.
Remark: We might take Θ1 = R \ {0} instead but it wouldn’t change the analysis.
A hypothesis test is a procedure for choosing between H0 and H1 in light of the data. But it
is not symmetric: we choose to accept the null hypothesis unless the data suggests sufficiently
strongly that the alternative is a better explanation or fit.
We choose C so that the data values in C correspond to the evidence against θ ∈ Θ0 with
some θ ∈ Θ1 being more comparable with the data. Then if our data x ∈ C, we reject H0
in favour of H1 , otherwise we accept H0 . e
The critical region is usually constructed by using a “test statistic” T (x), which is some
function of the data, such that large values of T are more compatible withe H rather than
1
ST217 Mathematical Statistics B 55
H0 . Then we set
C = {x ∈ Rn : T (x) ≥ λ}
e
where λ is some critical value that we must choose, so that
sup Pθ [X ∈ C] = sup Pθ [T (X ] ≥ λ) = α
θ∈ Θ0 e θ∈ Θ0 e
where α is called the size of the test and must be chosen in advance.
Thus the size of the test is the maximum probability of a type I error, and this we usually
fix in advance at one of a number of typical values, α = 0.05, 0.01.
One general way of choosing a test statistic or critical region is the generalized likelihood
ratio. We may take
sup L(θ; x)
θ∈ Θ1
C = x ∈ Rn : ≥λ
e
e sup L(θ; x)
θ∈ Θ0 e
Here L(θ; x) is observed likelihood of θ if x is the data, and λ is chosen to fix the size of the
test. e e
Find the critical region for the likelihood ratio test of H0 : θ = θ0 = 0 against
H1 : θ = θ1 = 1 with size α = 0.01.
( n
)
1 1X
L(θ : x) = n/2
exp − (xi − θ)2
e (2nπ) 2 i=1
Consequently,
ST217 Mathematical Statistics B 56
( n
)
1X
exp − (xi − 1)2
L(θ1 ) 2
= ( i=1 n )
L(θ2 ) 1X 2
exp − x
2 i=1 i
( n )
X 1
= exp xi − n
i=1
2
1
= exp n(X − )
2
1
So we reject H0 if exp n(X − ) is too large.
2
1
Because z → exp n(z − ) is increasing, this is equivalent to rejecting H0 if X is too
2
large.
So C = x ∈ Rn : X ≥ A for a value of A that is chosen to fix the size of the test.
α = 0.01 = Pθ0 [X ∈ C]
= Pθ0 [X ≥ A]
e
1X n
but under H0 , X = Xi ∼ N 0, 2
n n
√ √ √
so nX ∼ N (0, 1) and Pθ0 [X ≥ A] = Pθ0 [ nX
√ ≥ nA]
= 1 − Φ( nA)
√
Thus A = Φ−1 ((1 − α)/ n).
Definition 4.1. The power of a test is the function β : Θ1 → [0, 1] defined by β(θ) =
Pθ [X ∈ C].
e
ST217 Mathematical Statistics B 57
This describes the probability of correctly rejecting the null hypothesis in favour of the
alternative hypothesis, i.e. ‘1 - type II’ error probability!
Usually if we decrease the size of the test, we will decrease the power too because we become
more willing to accept H0 .
Similarly, if we increase the size of the test, we can increase the power, and we become more
willing to reject H0 .
For a given fixed size of the test, we would like to make the power as high as possible.
So Theorem 4.1 implies that if we want to test two simple hypotheses, the likelihood ratio
test is best.
f (x; θ1 )
n
Proof. Let C0 = x∈R : e ≥ A where A is chosen so that Pθ0 [X ∈ C0 ] = α,
e f (x; θ0 ) e
which is the size of the test. e
Note C0 = B1 ∩ B2 and C1 = B1 ∩ B3 .
Consider
Pθ1 [X ∈ C0 ] − Pθ1 [X ∈ C1 ]
e ∈ B ] − P e[X ∈ B ] (as C = B ∪ B , C = B ∪ B , and B cancels out)
= ZPθ1 [X 2 Z θ0 e 3 0 1 2 1 1 3 1
e
= f (x; θ) dx − f (x; θ) dx (†)
B2 e e B3 e e
f (x; θ1 )
Now B2 ⊆ C0 and so ≥ A for x ∈ B2 and
f (x; θ0 )
e
e
f (x
e; θ )
1
B3 ⊆ C0c and so e < A for x ∈ B3
f (x; θ0 ) e
e
Using these facts, we can see that (†) is greater than
Z Z
A f (x; θ0 ) dx − A f (x; θ0 )dx
B2 e e B3 e e
Z Z
But A f (x; θ0 ) dx − A f (x; θ0 )dx
e e e e
BZ2 B3 Z
=A f (x; θ0 ) dx − A f (x; θ0 ) dx
e e e e
Z2 ∪B1
B Z B3 ∪B1
= A f (x; θ) dx − A f (x; θ0 ) dx
e e e e
C0 C1
= A{Pθ0 [X ∈ C0 ] − Pθ0 [X ∈ C1 ]}
= A{α − eα} = 0 e
The Neyman-Pearson Lemma tells us that the Likelihood Ratio Test is optimal where both
hypotheses are simple. But what about composite hypotheses?
Definition 4.2. A hypothesis test of H0 : θ ∈ Θ0
against H1 : θ ∈ Θ1
with critical region C is called the uniformly most powerful (UMP) test at size α if
sup Pθ [X ∈ C] = α
θ∈Θ0 e
and for any other test with critical region C 0
sup Pθ [X ∈ C 0 ] ≤ α
θ∈Θ0 e
we have
Pθ [X ∈ C] ≥ Pθ [X ∈ C 0 ]
e e
for all θ ∈ Θ1 .
The uniformly most powerful (UMP) test doesn’t always exist. Typically, as we try to
increase the power at one value of θ ∈ Θ1 , we will decrease the power somewhere else.
Corollary 4.1. If H0 : θ = θ0 is simple and H1 : θ ∈ Θ1 , then an UMP test of size α exists
if and only if the Likelihood Ratio Test of H0 : θ = θ0 against H1 : θ = θ1 of size α has the
same critical region for every θ1 ∈ Θ1 .
Exercise 4.1. Suppose the random variables X1 , . . . , Xn are independently identically dis-
tributed with N (0, σ 2 ) where σ is an unknown parameter.
The Neyman-Pearson Lemma then implies that this is the UMP test of H0 : σ = 1 against
H1 : σ 2 > 1.
Once again, our procedure is the same for all values σ1 < 1.
So by the Neyman-Pearson Lemma, this is the UMP test of H0 : σ = 1 against H1 : 0 < σ <
1.
ST217 Mathematical Statistics B 61
c) Since the likelihood ratio test is different for values of σ1 < 1 and values of σ1 > 1 there
can be no UMP test of H0 : σ = 1 against H1 : σ > 0.
The Neyman-Pearson Lemma says that for simple hypotheses, the Likelihood Ratio Test is
optimal. It also implies in certain restricted situations that likelihood ratios give U.M.P.
tests.
Motivated by these good properties of likelihood ratios, we can define what a generalized
likelihood ratio test is.
Definition 4.3. The generalized likelihood ratio test of
H0 : θ ∈ Θ0 against
H1 : θ ∈ Θ1
to be the test with critical region
sup L(θ, x)
θ∈Θ1
C = x ∈ Rn : ≥λ
e
e sup L(θ, x)
θ∈Θ0 e
where λ is chosen to fix the size of the test.
X
The t-test rejects H0 when t(x) = √ is too large (in the +ve or -ve sense).
e s/ n
1X 1 X
Here, x̄ = xi , s 2 = (xi − x̄)2 .
n n−1
ST217 Mathematical Statistics B 62
X −µ
By Fisher’s Theorem, √ ∼ tn−1 distribution.
s/ n
1X 1 X
Here, X = Xi , s2 = (Xi − X̄)2 .
n n−1
X
So t(X ) = √ ∼ tn−1 under H0 .
e s/ n
1X 2
Under H0 , L(µ, σ) is maximized by choosing µ = 0 and σ̂02 = xi .
n
1 n no
So L(0, σ̂0 ; x) = exp − .
e (2π)n/2 σ̂0n 2
1X 1X
Under H1 , the M.L.E. of µ is µ̂ = xi , and M.L.E. of σ 2 is σ̂ 2 = (xi − x̄2 .
n n
1 n no
So L(µ̂, σ̂; x) = exp − .
e (2π)n/2 σ̂ n 2
n/2 n/2
x2i (xi − x̄ + x̄)2
P P
P = P
(xi − x̄)2 (xi − x̄)2
n/2
(xi − x̄)2 + nx̄2
P
= P
(xi − x̄)2
n/2
nx̄2
= 1+ P
(xi − x̄)2
n/2
1 2
= 1+ t(x)
n−1 e
n/2
1
Since z → 1 + z is increasing in z, the L.R. being too large corresponding to
n−1
t(x)2 being too large, or equivalently |t(x)| being too large.
e e
Thus the generalized likelihood ratio test is the same as the t−test.
t(x) = −4.16.
e
We reject H0 if |t| > 2.145 which is the 97.5% of t14 . So here, we reject H0 !
1 nX X o
where s2P L = (xi − x̄)2 + (yi − ȳ)2 .
m+n−2
Under H0 :
1 X X
The m.l.e. of µ = µX = µY is µ̂ = [ xi + yi ].
m+n
1 X X
The m.l.e. of σ 2 is σ̂02 = [ (xi − µ̂)2 + (yi − µ̂)2 ].
m+n
Consequently,
Under H1 :
1 X
The m.l.e. of µX is µ̂X = xi .
mX
1
The m.l.e. of µY is µ̂Y = yi .
n
1 X X
The m.l.e. of σ 2 is σ̂ 2 = [ (xi − x̄)2 + (yi − ȳ)2 ].
m+n
Consequently,
(m+n)/2
σ̂02
The G.L.R. is therefore equal to .
σ̂ 2
1 X X
σ̂02 = { (xi − µ̂)2 + (yi − µ̂)2 }
m+n X
1 X
= { (xi − x̄ + x̄ − µ̂)2 + (yi − ȳ + ȳ − µ̂)2 }
m+n X
1 X
= { (xi − x̄)2 + m(x̄ − µ̂)2 + (yi − ȳ)2 + n(ȳ − µ̂)2 }
m+n
1 X
2
X
2 mn 2
= (xi − x̄) + (yi − ȳ) + (x̄ − ȳ)
m+n m+n
mx̄ + nȳ
as µ̂ = .
m+n
mn (m+n)/2
(x̄ − ȳ)2
Consequently the G.L.R. is 1 + P m + n
P
2
(xi − x̄) + (yi − ȳ) 2
(m+n)/2
t2
= 1+
m+n−2
x̄ − ȳ
where t = q .
sP L m1 + n1
(m+n)/2
z
Since z → 1 + is increasing and consequently the G.L.R.T rejects if |t|
m+n−2
is too large.
2
(m + n − 2)sP L
so ∼ χ2m+n−2 .
σ2
Finally,
A
T =√ √
B/ m + n − 2
where
X̄ − Ȳ
∼ N (0, 1)
A= q
σ m1 + n1 independent
m+n−2 2
sP L ∼ χ2m+n−2
B= 2
σ
so T ∼ tm+n−2 distribution.
Example 4.6. Numerical example
Two groups of female rats placed on high/low protein diets. We want to measure the gain
in weight of the rats after a week. We test the hypothesis that the mean weight gain under
both diets is the same. We have the values m = 12, x̄ = 120, n = 7, ȳ = 101, s2P L = 446.12.
x̄ − ȳ
We get t = q = 1.89.
sP L m1 + 1
n
If the size of test is to be 5% then consider 97.5% point of t−distribution with 17(12 + 7 − 2)
degrees of freedom which is 2.11.
1 X
Under H0 , the m.l.e.: µ̂X = x̄ = xi
mX
1
µ̂Y = ȳ = yi
n
1
σ̂ 2 = (Sxx + Syy )
m+n
1 X
Under H1 , the m.l.e.: µ̂X = x̄ = xi
mX
1
µ̂Y = ȳ = yi
n
2 1
σ̂X = SXX
m
2 1
σ̂Y = SY Y
n
2 1 1 m+n
So L(µ̂X , µ̂Y , σ̂X , σ̂Y2 ) = · m n exp − .
(2π)(m+n)/2 σ̂X σ̂Y 2
n/2 m/2
σ̂ m+n
Sxx Syy
Hence the G.L.R. is m n = c(m, n) · 1 + · 1+ where c(m, n) is some
σ̂X σ̂Y Syy Sxx
function of m and n.
Sxx
Large values of G.L.R. occur if the ratio is either too large or too small.
Syy
α
Because the values of the G.L.R. corresponding to the two critical values F m−1,n−1 2
and
α
Fm−1,n−1 1 − 2 are not equal. So the F − test doesn’t correspond to rejecting H0 if the
G.L.R. exceeds some single critical value.
In the last three examples, we have seen that the G.L.R. can be written as a function of
some simple statistic (t−statistic, F −statistic) whose sampling distribution under the null
hypothesis was known.
In general, we aren’t so lucky and exact distributions under the null hypothesis are not easy
to find.
Suppose there are k possible outcomes of a random trial each occurring with a probability
k
X
θi , i = 1, 2, . . . , k with θi = 1. Suppose we repeat the trial n times. Let Yi , i = 1, 2, . . . , k
i=1
denote the number of times outcome i occurs in the ( n trials. If the trials are independent, )then
k
X
(Y1 , . . . , Yk ) has a joint distribution with support (y1 , . . . , yk ) ∈ Zk , yi ≥ 0, yi = n and
i=1
n!
probability mass function fY1 ,...,Yk = fY (y ) = θ1y1 . . . θkyk .
e e y1 !y2 ! . . . yk !
Comments:
• We say (Y1 , . . . , Yk ) has a multinomial distribution.
• If we specify an ordering for the outcomes, then this gives a probability θ1y1 . . . θkyk and
n!
there are possible orderings.
y1 ! . . . yk !
• The binomial distribution is the special case k = 2. Then (Y1 , Y2 ) = (Y1 , n − Y1 ) and Y1
ha a Binomial distribution with Bin(n, θ1 ).
• In fact, the marginal distribution of Yi is Bin(n, θi ) because we can reclassify outcome i as
a “success” and call all other outcomes “failures” and thus obtain a sequence of n Bernoulli
trials.
4.3.1 χ2 − Tests
k
X k
X
= yi log θ̂i − yi log(θi (φ̂))
i=1 i=1 e
yi
Recall also that θ̂i = .
n
X n yi ei o
So 2r(y ) = 2 yi log ) − log(
n n
ei + δi
e X
=2 (ei + δi ) log
ei
X δi
=2 (ei + δi ) log 1 +
ei
δi 1 δi2
X
=2 (ei + δi ) + 2 + ...
X P 2ei 2 ei
δi
=2 δi + ei
+ higher order moments
| {z }
= 0
k
X (yi − ei )2
= ←− Pearson’s statistic
i=1
ei
δi
We are assuming n is large and is small.
ei
d = dim(Θ) − dim(Θ0 ) = (k − 1) − k0
ST217 Mathematical Statistics B 71
Ability in Maths
low average high
Interest low 63 42 15
in Stats average 58 61 31
high 14 47 29
Let H0 be: ability in maths and interest in statistics are independent, and let H1 be: arbitrary
cell probabilities.
H0 corresponds to assuming θij probability for cell (i, j) being of the form:
θij = pi p0j
for some
p1 , p2 , p3 being probability of low, average, and high ability in Maths
p01 , p02 , p03 being probability of low, average, and high interest in Stats
45 50 25
56.25 62.5 31.25
33.75 37.5 18.75
Example 4.9. We look at the number of errors in a manuscript per page. Each page is a
trial, outcome = number of errors on that page.
More precisely, we assume the observed frequency vector (y0 , y1 , . . . , y8+ ) is multinomial with
λi X λi
θi = e−λ , i = 0, 1, . . . , 7, and θ8+ = e−λ for some λ.
i! i≥8
i!
(18 − 21.9)2
Pearson statistic is + . . . = 6.83.
21.9
We accept null hypothesis at 5% level, and the data fitted reasonably well by Poisson dis-
tribution.
Further example. Go to the casino-coin example in MSA, and do as above, except using
uniform distribution.
5 Linear Models
Definition 5.1. A response variable is a random variable (or sometimes several copies of
a random variable) whose distribution depends on (perhaps several) explanatory variables
whose values are known as well as unknown parameters.
Example 5.1. A patient’s blood pressure after a drug treatment can be considered as
a response variable whose distribution depends on explanatory variables: blood pressure
before treatment, age, and gender, together with an unknown measures of how effective the
drug is which can be described via parameters.
Definition 5.2. A statistical model for the distribution of the response variable(s) is called
a linear model if each response variable has an expectation which is a linear function of
their parameters for any given values of the explanatory variables.
where A is a n×k matrix called the design matrix whose entries depend on the explanatory
variables.
Thus Y = A + β + where
e e e e
= (1 , . . . , n )T is a vector of random variables with E[i ] = 0
e
In a linear model, we often assume that the i are i.i.d.
Remarks:
• Sometimes, we do not treat explanatory variables as random. But then we condition on
their observed values.
• In this case, we might write E[Y |X ] = A(X )β
• Sometimes, E[Y ] is a linear function of β and of the explanatory variables as well e.g. linear
e e e e
P
regression, response random variables Y1 , . . . , Yn , Yi = xij βj + i where xij is the value of
e e
Xn
th th
the j explanatory variable on the i response. But polynomial regression Yi = βj xαij +i
j=1
is a linear model too!
β̂0 , β̂1 are then called the least squared estimates of β0 and β1 .
y = β0 + β1 x
0 x
Solve:
∂Q P
= −2 y i − β̂0 + β̂ x
1 i =0 (†)
∂β0 β0 =β̂0
∂Q P
= −2 y i − β̂0 + β̂ 1 x i xi = 0 (††)
∂β1 β1 =β̂1
X n
X
From (†), yi = nβ̂0 + β̂1 + xi
X X i=1 X
From (††), xi yi = β̂0 xi + β̂1 x2i
ST217 Mathematical Statistics B 76
We get P
xi yi − nx̄ȳ
β̂1 = P 2 β̂0 = ȳ − β̂1 x̄
xi − nx̄2
We can do something similar with general cases of linear models, in which data yi are treated
as observed values of random variables Yi satisfying Y = Aβ + , with
e ee e
being an n−vector of random variables with E[i ] = 0
βe being a k−vector of unknown parameters
e an n × k design matrix depending on the known values of explanatory variables
A
e
n
X
So we want to estimate β by minimizing Q = (yi − (AB )i )2
i=1
ee
T
= y − Aβ y − Aβ
e ee e ee
n
∂Q X
Now = −2 yi − (Aβ )i aij
∂βj i=1
ee
P
T n P x2i
A A= P
xi xi
1
x2i −x̄
P
1
Pnȳ
−1
T n T
A A =P 2 ,; A y =
e e xi − nx̄2 −x̄ 1 e e xi yi
P 2 P
β̂0 −1 T
T 1 xi ȳ − x̄ xi yi
= A A A y =P 2 P
β̂1 e e e e xi − nx̄2 xi yi − nx̄ȳ
Proof.
1) Consider
h −1 T i
E[β̂ ] = E AT A A Y
e −1
e e e
= AT A T
e
A E[Y ]
e T e −1 e T e
= A A A Aβ
e e −1 e ee
= AT A AT A β
= β
e e e e e
e
2) We use the following: P
If X is a random vector with dimension n, having as the corresponding variance-covariance
matrix
e B a non-random (m × n) matrix, then the variance-covariance matrix of X is
and P
given by B e B T . e
e e
ST217 Mathematical Statistics B 78
Such an estimator is called a best linear unbiased estimator if it has a smaller variance
than any other linear unbiased estimator.
Theorem 5.2. Gauss-Markov
The least squares estimator is the best linear unbiased estimator.
Definition 5.3. In the normal linear model, response variables Y = (Y1 , . . . , Yn )T satisfy
Y = Aβ + where e
• A is a k × n design matrix depending on explanatory variables
e ee e
• βe is a k−vector of unknown parameters
• e = (1 , . . . , n ) are independent Gaussian random variables with mean 0 and unknown
common variance σ 2 (homoscedasticity)
So E Y = Aβ , Var[Y ] = σ 2 I and Y ∼ M V N Aβ , σ 2 I
e ee e e e ee e
Theorem 5.3. In the normal linear model the maximum likelihood estimates of β are
given by least squares estimates, and the maximum likelihood estimate for σ is given
e by
n
1 X 2
σ̂ 2 = yi − Aβ̂ .
n i=1 ee i
2 2
l β , σ ; y = log L β , σ ; y
e e e en 1 T
= ‘constant’ − log σ 2 − 2 y − Aβ
y − Aβ
2 2σ | e e e {z e e e }
Xn 2
yi − Aβ
i=1
ee i
n
X 2
To find the maximum likelihood estimates of β , we need to minimize yi − Aβ and
i
i=1
e ee
this will give the least squares estimates.
2 ∂l n 1 T
To find the m.l.e. of σ we solve: 0 = = − + y − Aβ y − A β
∂σ 2 σ2 =σ̂2 ,β=β̂ 2σ̂ 2 2σ̂ 4 e e e e ee
1 T
⇒ σ̂ 2 = y − Aβ y − Aβ
n e ee e ee
Theorem 5.4. Distribution of estimators in normal linear model
−1
a) β̂ ∼ M V N β , σ 2 AT A
e e e e T
b) s2 = “residual sum of squares” = Y − Aβ̂ Y − Aβ̂ is independent of β̂
s2
e ee e e e
2
c) 2 ∼ χn−k where n = dim Y ; k = dim β
σ e e
Proof.
a) β̂ must have a Gaussian distribution because it is constructed from a linear combination
of Yi ’s - which are Gaussian. Mean and variance-covariance matrix of β̂ we already know
e
from Theorem 5.1.
Remarks:
• In fact, Fisher’s Theorem is a special case of Theorem 5.4. Take Yi = µ + i with
i ∼ N (0, σ 2 ).
• An experiment is called an orthogonal design if AT A is a diagonal matrix. In this case,
β̂1 , . . . , β̂k are independent.
e e
ST217 Mathematical Statistics B 80
T n P 0
So A A = .
e e 0 (xi − x̄)
σ2 σ2 s2
By Theorem 5.4, α̂0 ∼ N α0 , , α̂1 ∼ N α1 , P , ∼ χ2n−2 . All these
n (xi − x̄)2 σ2
are independent.
Suppose we wish to test whether the data suggest that the response variable depends on the
explanatory variables value.
So since under H0 ,
σ2
α̂1 ∼ N 0, P
(xi − x̄)2 independently
2
s
and 2 ∼ χ2n−2
σ
then
α̂
T =√ p P ∼ tn−2
s2 / (n − 2) (xi − x̄)2
ST217 Mathematical Statistics B 81
We assume yij , the j th observation from group i is an observed value of a random variable
Yij with Yij = µi + ij where
1
0 ... 0
1 0 0
...
n1
.. .. ... ..
. . .
...
1 0 0
0 1 0
...
0 1 0
...
n2
.. .. ... ..
. .
.
...
0 1 0
..
with A equal to
... n = n1 + . . . + nk
... ... .
..
e
..
.
... ... .
... ... .. ..
. .
... ...
.. ..
. .
0 0 ... 1
0 0 1
...
.. .. .. n
k
. . ... .
...
0 0 1
| {z }
k
n1 0 ...
0
.
0 n2 . . . ..
Then AT A = so an orthogonal design.
. .
e e . . . . . . . . ..
0 . . . . . . nk
ni
1 X
where ȳi = yij = sample mean of group i.
ni j=1
ST217 Mathematical Statistics B 83
H0 : µ1 = µ2 = . . . = µk versus
H1 : µ1 , . . . , µk are unrestricted
Under H0 , the experiment reduces to a sample of size n for some Gaussian distribution with
unknown mean and variance. So we know that m.l.e. are:
k n
k
1 XX
µ̂0 = yij = overall mean ȳ
n i=1 j=1
kk n
1 XX
σ̂02 = (yij − ȳ)2
n i=1 j=1
( k X nk
)
1 1 X
Now the likelihood L µ, σ 2 ; y = 2 )n/2
· exp − 2 (yij − µi )2
e e (2πσ 2σ i=1 j=1
nk
k X
−n/2
X
2
(yij − ȳi )2
L µ̂, σ̂ ; y
i=1 j=1
e e =
k nk
L µ̂ , σ̂02 ; y XX
2
e 0
e
(y ij − ȳ)
i=1 j=1
n/2
s0
=
s
1 n/2
s2
= 1+
s1
nk
k X
X
where s0 = (yij − ȳ)2 = “total sum of squares”
i=1 j=1
Pk Pnk 2
s1 = i=1 j=1 (yij − ȳi ) = “within group sum of squares”
s2 = s0 − s1
Xk
= ni (ȳi − ȳ)2 = “between group sum of squares”
i=1
s2
The likelihood statistic is an increasing function of . We reject H0 when this ratio is too
s1
large.
s2
The final step is to decide what the critical value ofshould be if we wish the size of the test
s1
to be α, say. To do this we consider the distribution of the random variable corresponding
s2
to .
s2
Using ∗n
, we see that s1 ∼ nσ 2 χ2n−k
independent
s2 ∼ nσ 2 χ2k−1
Consequently,
s2 /(k − 1)
∼ Fk−1,n−k .
s1 /(n − k)