Sei sulla pagina 1di 84

Mathematical Statistics B 2010

Typeset by Keegan Kang

1
ST217 Mathematical Statistics B 2

Contents

1 Bivariate and Multivariate Distributions 5

1.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.1 Discrete Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . 7

1.1.2 Continuous Bivariate Distributions . . . . . . . . . . . . . . . . . . . 8

1.1.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.4 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.5 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Multivariate Normal Distribution (and associates) 24

2.1 Transformation of a Multivariate p.d.f. . . . . . . . . . . . . . . . . . . . . . 25

2.2 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4 Distributions Derived from the Gaussian . . . . . . . . . . . . . . . . . . . . 35

3 Maximum Likelihood Estimation 38


ST217 Mathematical Statistics B 3

3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Hypothesis Testing 53

4.1 Power of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Generalized Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.1 The t−test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.2 The two sample t−test . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.3 The F −test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.4 Wilk’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 χ2 − Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Linear Models 73

5.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1.1 The least squares criterion . . . . . . . . . . . . . . . . . . . . . . . . 75

Introduction

I will try a project this term to constantly LaTeX the lecture notes, and send them out
to some people each week for revision purposes. At the end of Week 10, I should have a full
set of lecture notes.
ST217 Mathematical Statistics B 4

Credit goes to Professor Wilfrid Kendall and Dr Jon Warren and Dr Barbel Finkenstädt for
the lecture notes.

Thanks goes to Vainius Indilas for pointing out the many mistakes I have made.

Thanks goes to YSL too for everything :)

Disclaimer: Nothing here is guaranteed to be 100% accurate. If in doubt, check your own
lecture notes.
ST217 Mathematical Statistics B 5

1 Bivariate and Multivariate Distributions

Often, we are interested in several random variables, which may or may not be related to
each other. We want to answer certain questions about them, such as finding out how they
vary, or in which range their values lie in, and so on.

Probability Distribution of Random Vectors

We start by giving a few examples of two random variables that we may be interested in.
Note that this can be extended to three, four, ... random variables, but we keep the examples
at two random variables to make things simpler.

Example 1.1. Bivariate random variables


1) Pick a person currently in Warwick University, and measure their age (denoted by X),
and their height (denoted by Y ). The random variable X is a discrete random variable, and
the random variable Y is a continuous random variable.

2) Pick a random member of the Warwick staff, and measure their blood pressure. We
denote their systolic blood pressure as X1 , and their diastolic blood pressure as X2 . Both
these random variables X1 and X2 are continuous random variables.

3) We wish to find the number of people who are a) live more than 25 miles away from a
theatre (denoted by X), and are b) older than 60 years of age (denoted by Y ) in a sample
of n theatre patrons. The random variables X and Y are both discrete random variables.

4) We want to find the estimates of mean and variance for n independent identically dis-
tributed random variables, X1 , X2 , ..., Xn . Then we want to find (µ̂, σ̂ 2 ), where µ̂ and σ̂ 2 are
the random variables we are interested in.

5) We let θ ∼ Uniform(0, 1) and X|θ ∼ Bin(n, θ). We see here that while this binomial
distribution is dependent on θ, but the random variable θ is continuous, but the random
variable X is discrete.

We will now cover concepts that talk about how the probability of these random variables
are located in some specific region as opposed to another region.
ST217 Mathematical Statistics B 6

1.1 Bivariate Distributions

Definition 1.1. The bivariate cumulative distribution function (c.d.f.), also called
the joint c.d.f., is defined as
FX,Y (x, y) = P[X ≤ x, Y ≤ y]
for any x, y ∈ R. This definition holds no matter whether X, Y are discrete or continuous.

If X and Y are discrete, it is often easier to use Definition 1.2.


Definition 1.2. The bivariate probability mass function (p.m.f.), also called the joint
p.m.f., is defined as
fX,Y (x, y) = P[X = x, Y = y]
for any x, y ∈ R. We often write PXY (x, y) and not fXY (x, y) for p.m.f.s.
Exercise 1.1. We consider a litter of piglets, and let X1 be a random variable denoting the
number of male piglets, and X2 be a random variable denoting the number of female piglets.
Let X1 and X2 be independent Poisson distributions with means λ1 and λ2 respectively.
Find the joint p.m.f. of X1 and X2 .

We know that X1 ∼ Poisson(λ1 ), and X2 ∼ Poisson(λ2 ), and that X1 ⊥


⊥ X2 . (The notation

⊥ here means that X1 is independent of X2 .

Then the joint p.m.f. is

fX1 ,X2 (x1 , x2 ) = P[X1 = x1 , X2 = x2 ]


= P[X1 = x1 ] × P[X2 = x2 ]
λx1 λ x2
= e−λ1 1 × e−λ2 1
x1 ! x2 !
e−(λ1 +λ2 ) λx1 1 λ2 x2
=
x1 !x2 !
Exercise 1.2. We consider a litter of piglets, and let the total number N of piglets be a
random variable with a Poisson distribution with mean λ. Let the number of male piglets
X1 be a random variable with a Binomial distribution of (n, θ), conditional on N = n. Find
the joint p.m.f. of N and X1 .

We know that N = X1 + X2 ∼ Poisson(λ) and X1 |N ∼ Bin(N, θ), where | is used to denote


that X1 is conditional on N .
ST217 Mathematical Statistics B 7

Then the joint p.m.f. is

fX1 ,X2 (x1 , x2 ) = P[X1 = x1 , X2 = x2 ]


= P[X1 = x1 , N = X1 + X2 ]
P[A|B]
= P[X1 = x1 | N = X1 + X2 ] × P[N = X1 + X2 ] as P[A|B] =
P[B]
X1 +X2 −λ
 
X1 + X 2 λ e
= θX1 (1 − θ)X2
X1 (X1 + X2 )!
(x1 + x1 )! e−λ
= (λθ)x1 (λ (1 − θ))x2
x1 !x2 ! (x1 + x2 )!
(λθ)x1 (λ(1 − θ))x2 e−λ
=
x1 !x2 !
Exercise 1.3. Verify that the p.m.f.s in Exercise 1.1 and Exercise 1.2 are identical. i.e.,
the chances of getting a male or a female piglet in each litter are the same.

Consider the p.m.f. of Exercise 1.2 where


(λθ)x1 (λ(1 − θ))x2 e−λ
for x1 , x2 = 0, 1, 2, ... †
x1 !x2 !
e−(λ1 +λ2 ) λx1 1 λ2 x2
As λ1 = λθ , λ2 = λ(1 − θ) gives λ1 + λ2 = λ, we can rewrite † into , which
x1 !x2 !
is the p.m.f. of Exercise 1.1.

What is this saying?

Given λ1 , λ2 , I get a particular list of probability for various outcomes (x1 , x2 ) under the
model of Exercise 1.1. If I choose λ, θ so λ1 = λθ, λ2 = λ(1 − θ), then under the model of
Exercise 1.2, I get exactly the same list of probabilities.

So p.m.f.s are the same, and we are not going to be able to tell the difference between the
two models by observing pig litters.

1.1.1 Discrete Bivariate Distributions

A discrete random variable has a countable state space (without loss of generality repre-
sentable using N = {0, 1, 2, 3, ...}) and so we can always tabulate a bivariate p.m.f.
ST217 Mathematical Statistics B 8

0 1 2 3
0 PX,Y (0, 0) PX,Y (0, 1) PX,Y (0, 2) ...
1 PX,Y (1, 0) PX,Y (1, 1) PX,Y (1, 2) ...
2 PX,Y (2, 0) PX,Y (2, 1) PX,Y (2, 2) ...
.. .. .. ...
3 . . .

Hence if E is any event concerning X, Y (hence E is subset of {r, s} : r, s ∈ N}, then

X
P[(X, Y ) ∈ E] = P[X = x, Y = y]
(x,y)∈E
X
= PX,Y (x, y)
(x,y)∈E

Exercise 1.4. Suppose we work with model of Exercise 1.2 with N ∼ Po(λ), and X1 |N ∼
Bin(N,θ). Then
X
P[X1 = x1 ] = fX1 ,X2 (x1 , x2 )
X2 =0,1,2,...
X (λθ)x1 (λ(1 − θ))x2 −λ
= e
X2 =0,1,2,...
x1 !x2 !

(λθ)x1 −λ X (λ(1 − θ))x2
= e
x1 ! x2 =0
x2 !
| {z }
eλ(1−θ)
(λθ)x1 −λθ
= e
x1 !

So we verify X1 ∼ Po(λθ).

1.1.2 Continuous Bivariate Distributions

Definition 1.3. The random variables X and Y has an (absolutely) continuous R R joint distri-
bution if there is a function (a joint density) fX,Y = f with P[(X, Y ) ∈ A] = A
f (x, y) dx dy
for all A ⊆ R2 . We say fX,Y (x, y) is the joint probability density of X, Y .
ST217 Mathematical Statistics B 9

Remarks:
• RSinceR probability is non-negative, fX,Y (x, y) ≥ 0 everywhere
∞ ∞
• −∞ −∞ f (x, y) dx dy = 1
• f (x, y) Rcould
R always be changed at a countable set of points without changing any of the
integrals A
f (x, y) dx dy

Also , (a) P[(X, Y ) = (u, v)] = P[X = u, Y = v] = 0 for fixed u, v


(b) if C is a smooth curve in R2 , then P[(X, Y ) ∈ C] = 0

We often visualize fX,Y by contour plots or other plots.


Exercise 1.5. The random variables X and Y have joint density
(
c if 0 < x < y < 1
fX,Y (x, y) =
0 otherwise

i) What is C?
Z ∞ Z ∞
We know that f (x, y) dx dy = 1. So we integrate the p.d.f. to see what we get.
−∞ −∞

Z 1 Z 1
f (x, y) dx dy (f is zero elsewhere)
Z0 1 Z0 y
= f (x, y) dx dy (as x < y)
Z0 1 Z0 y
= c dx dy
0Z 0Z
1 y
= c 1 dx dy
Z0 1
0

= c y dy
0
= c/2

But then c/2 = 1 ⇒ c = 2.

ii) What about the relationship of X and Y ?

Similarly, P[X < Y ] = 1, from the definition of the p.d.f.


ST217 Mathematical Statistics B 10

Example 1.2. Suppose we have x1 , x2 , ..., xn being independent identically distributed ran-
dom variables distributed as N(µ, σ 2 ).

Consider µ̂ - sample mean


σ̂ 2 - sample variance

In fact, µ̂, σˆ2 are a) clearly random


b) have joint density
Example 1.3. Let the random variable X denote the systolic blood pressure, and the
random variable Y denote the diastolic blood pressure. We know that X and Y have some
correlation.

So a possible model for the distributions of X and Y would be:

X ∼ N(µs , σS2 )
2
Y |X ∼ N(α + βX, σD )

In fact, we will see fX,Y (x, y) = fX (x)fY |X=x (y|x) which will be discussed later.

If X, Y have a joint distribution (whether discrete or continuous), we are interested in


marginal distributions.

1.1.3 Marginal Distributions

Definition 1.4. IF X, Y have cumulative density functions


FX,Y (x, y) = P[X = x, Y = y]
then the distribution functions of X, Y alone

FX (x) = P[X ≤ x] = lim FX,Y (x, y)


y→∞
FY (y) = P[Y ≤ y] = lim FX,Y (x, y)
x→∞

are the marginal distributions.


ST217 Mathematical Statistics B 11

Definition 1.5. FX (x) = lim FX,Y (x, y) is the marginal cumulative distribution function
y→∞
of X.
Discrete case: PX (x) = P[X = x]
| {z }
marginal p.m.f.
d
Continuous case: fX (x) = FX (x)
|dx {z }
marginal p.d.f.

To see these are sensible definitions, notice that


X
PX (x) = P[X = x] = P[X = x, Y = y]
y
| {z }
summing
X all over possible values of y
= PX,Y (x, y)
y
= P[X = x, −∞ < Y < ∞]

Similarly in the continuous case,

d
fX (x) = FX (x)
dx
Z ∞Z ∞
d
= fX,Y (x, y) dy dx
dx −∞ −∞
Z ∞ Z x
d
= fX,Y (x, y) dx dy
−∞ dx −∞
Z ∞
= fX,Y (x, y) dy
−∞

Very similar results hold for marginal p.m.f. densities for joint distributions of three or more
random variables.

Exercise 1.6. We are given a bag with 5 coins in it.

1 coin is double-headed 1 coin is double-tailed


1 coin is P[head] = 41 2 coins are fair
ST217 Mathematical Statistics B 12

1
We pick a coin at random, each with probability 5
and then toss it twice.

Let θ = P [Head — coin] (so θ varies depending on what coin has been picked)
X = number of heads observed

What is the joint distribution of θ, X?


It is discrete as θ has discrete values, so we find the joint p.m.f.

We construct a table of θ and X values below.

0 1 2

1 1 1 1
0 ×1 ×0 ×0
5 5 5 5
1 1 1 3 3 1 3 1 1 1 1
× × ×2× × × ×
5 4 5 4 4 5 4 4 5 4 4
θ
2 1 2 1 1 2 1 1 2 1 1
× × ×2× × × ×
5 2 5 2 2 5 2 2 5 2 2
1 1 1 1
1 ×0 ×0 ×1
5 5 5 5

Figure 1: A rough table

We calculate out the values, and then tabulate the marginal p.m.f. of X and θ as well. We
call the p.m.f. marginal, because it appears in the margins of columns and rows.

The red row denotes the possible values of X, and the blue column denotes the possible
values of θ.

The magenta column denotes the probability of picking up a particular coin and are the
marginals of θ.

Finally, the cyan row denotes the marginal p.m.f.s of X.


ST217 Mathematical Statistics B 13

0 1 2

1 16
0 0 0
5 80
1 1 9 6 1
5 4 80 80 80
θ
2 1 8 16 8
5 2 80 80 80
1 16
1 0 0
5 80
33 22 25
80 80 80

Figure 2: The final table

1.1.4 Conditional Distributions

Definition 1.6. If X, Y have a discrete joint distribution with p.m.f. PX,Y (x, y), then the
conditional probability mass function of Y given X = x is

PX,Y (x, y)
PY |X (y|x) =
PX (x)

Exercise 1.7. In Exercise 1.6, find the conditional distributions of X|θ = 41 .

1 PX,θ (x, 14 )
PX|θ= 1 (x|θ = ) =
4 4 Pθ ( 41 )

We refer to the second row of Figure 2 (where θ = 14 and divide by its marginal entry 1
5
(on
the LHS) to give us probabilities summing up to 1. Therefore, what we get is:
ST217 Mathematical Statistics B 14

1
X |θ= 4
0 1 2

9 6 1
80 80 80
1 1 1
5 5 5

9 6 1
16 16 16

9 6 1
So the p.m.f. is P[X = 0] = , P[X = 1] = and P[X = 2] = .
16 16 16
Exercise 1.8. In Exercise 1.6, find the conditional distributions of θ|x = 0.
Pθ,X (θ, 0)
Pθ|X=0 (θ|x = 0) =
PX (0)

33
We refer to the first column of Figure 2 (where X = 0 and divide by its marginal entry 80
(on the bottom) to give us probabilities summing up to 1. Therefore, what we get is:

1 1
θ|x=0 0 4 2
1

16 9 8
80 80 80 0
33 33 33 33
80 80 80 80

10 9 8
0
33 33 33

10 9 8
So the p.m.f. is P[θ = 0] = , P[θ = 14 ] = , P[θ = 12 ] = and P[θ = 1] = 0.
33 33 33

We now consider the continuous case.

0
In this case, P[X = x] = 0, so a p.m.f. type definition leads to 0
which is BAD.

Limiting arguments lead to the consideration of conditional probability density functions


(probability density).
Definition 1.7. If X, Y have continuous joint distribution with p.d.f. fX,Y (x, y), then the
conditional probability density function of Y given X = x is
fX,Y (x, y)
fY |X (y|X = x) = fY |X (y|x) =
fX (x)
ST217 Mathematical Statistics B 15

Z ∞
and by the formula for marginal density, it is immediate that fY |X (y|X = x) dy = 1.
−∞

Recall: X, Y are independent random variables if

P[X ∈ A, Y ∈ B] = P[x ∈ A] P[X ∈ B] (†)

Exercise 1.9. Show that the random variables X and Y are independent according to (†)
if and only if
FX,Y (x, y) = FX (x)FY (y) (††)

for all −∞ < x, y < ∞ if and only if

fX,Y (x, y) = fX (x)fY (y) († † †)

for all ∞ < x, y < ∞.

Proof. We will argue that (†) ⇒ (††) ⇒ († † †) ⇒ (†).

For (†),
P[X ∈ A, Y ∈ B] = P[X ∈ A] P[Y ∈ B]

with A = (∞, x], B = (−∞, y] gives

P[X ≤ x, Y ≤ y] = P[X ≤ x] P[Y ≤ y]

which is FX,Y (x, y) = FX (x)FY (y) which is (††). So (†) ⇒ (††).

So we now proceed to densities.


Z x
We know that FX (x) = fX (a) da (we use a dummy variable a instead)
−∞
d
F (x)
dx X
= fX (x).
ST217 Mathematical Statistics B 16

d
Similarly, F (y)
dy Y
= fY (y)
∂2
∂y
FX (x)FY (y) = fX (x)fY (y)
Z x Z y
∂2
F (x, y) =
∂x∂y X,Y
fX,Y (a, b) da db.
−∞ Z x−∞

= ∂dx fX,Y (a, y) da
−∞
= fX,Y (x, y)

so we get fX,Y (x, y) = fX (x)fY (y).

This depends on X, Y having a joint density. Therefore (††) ⇒ († † †).

Finally, if fX,Y (x, y) = fX (x)fY (y), then


Z Z
P[X ∈ A, Y ∈ B] = fX,Y (a, b) db da
A Z
B Z
= fX (a)fY (b) db da
ZA B Z
= fX (a) da fY (b) db
A B
= P[X ∈ A] P[Y ∈ B]

which is what we want. Hence († † †) ⇒ (†).

It is now immediate that if X, Y are independent with joint density, then fX|Y (x|Y = y) =
fX (x).

1.2 Multivariate Distributions

Consider random variables X1 , X2 , ..., Xn . Their joint distribution is multivariate. It can be


useful to view these variables as defining a point in Rn .
ST217 Mathematical Statistics B 17



X1
Vector notation can be helpful, with  ...  = (X1 , ..., Xn )T .
 
Xn

Definition 1.8. The joint cumulative distribution function of X1 , X2 , ..., Xn is

FX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) = P[X1 ≤ x1 , X2 ≤ x2 , ..., Xn ≤ xn ]

We see that Definition 1.8 is very similar to the definition for the bivariate case. In
particular, the marginal and conditional distribution functions are similar.

Definition 1.9. If X = (X1 , X2 , ..., Xn ) has a discrete distribution (only a countable range
of values), then its probability mass function is

PX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) = P[X1 = x1 , X2 = x2 , ..., Xn = xn ]

Note that we use vector notation in this definition.

Definition 1.10. X = (X1 , X2 , ..., Xn ) has (absolutely) continuous distribution if there is


non negative f (x1 , x2 , ..., xn ) such that for any (nice) A ⊆ Rn ,
Z Z
P[(X1 , X2 , ..., Xn ) ∈ A] = . . . fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) dx1 dx2 . . . dxn
A

Definition 1.11. The fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) in Definition 1.10 is called the multivari-
ate probability density function (multivariate p.d.f.).

Sometimes, we use vector notation.


Z
P[X ∈ A] = f (x) dx
A

vectors in Rn
ST217 Mathematical Statistics B 18

By the Fundamental Theorem of Calculus,


∂n
FX ,X ,...,Xn (x1 , x2 , ..., xn ) = fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn )
∂x1 ∂x2 . . . ∂xn 1 2

if X1 , X2 , ..., Xn has multivariate p.d.f.s.

1.3 Review of Notation

• The c.d.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) is FX (x) = FX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).

• The p.m.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) in the discrete case is PX (x) =
PX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).

• The p.d.f. of (X1 , X2 , ...Xn ) at x = (x1 , x2 , ..., xn ) in the continuous case is fX (x) =
fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).

• The marginal c.d.f. of Xj is Fj (xj ) = lim fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
x1 ,x2 ,...,xj−1 ,xj+1 ,...,xn →∞

X X
• The marginal p.m.f. of Xj is Pj (xj ) = ... pX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ).
x1 ,x2 ,...,xj−1 ,xj+1 ,...,xn →∞

•Z The marginal p.d.f. of Xj is fj (xj ) is:


∞ Z ∞
d
... fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) dx1 . . . dxj−1 dxj+1 . . . dn = F j xj .
−∞ −∞ dxj

• The marginal p.d.f. for both Xj and Xk is fj,k (xj , xk ) where we integrate out all variables
excluding xj , xk .

• The marginal p.m.f. for both Xj and Xk is Pj,k (xj , xk ) where we sum all variables excluding
xj , xk .

• The conditional p.d.f. for Xj given Xi for i 6= j is


fX1 ,...,Xn (x1 , x2 , ..., xn ) fX1 ,...,Xn (x1 , ..., xn )
gJ (xj | x1 , x2 , ..., xj−1 , xj+1 , ..., xn ) = = R∞
f (all xs but not xj ) f
−∞ X1 ,...,Xn
(x1 , ..., xn ) dxj
ST217 Mathematical Statistics B 19

1.4 Expectation

Recall that if aX
discrete distribution X has a probability mass function PX (x) = P[X = x],
then E[X] = xPX (x).
x

Similarly, Zif a continuous distribution X has a probability density function fX (x), then

E[X] = xfX (x) dx.
−∞

X
More generally, for a discrete distribution X, E[h(X)] = h(x)PX (x), and for a contin-
Z ∞ x

uous distribution X, E[h(X)] = h(x)fX (x) dx.


−∞

In both cases, we require absolute convergence for expectation to be defined.

The following properties and definitions follow from above.

• Var[X] = E[(X − µ)2 ] where µ = E[X]


= E[X 2 ] − (E[X])2

• E[aX + b] = aE[X] + b for a, b ∈ R

• E[(aX + b)2 ] = E[a2 X 2 + 2ab + b2 ]


= a2 E[X 2 ] + 2abE[X] + b2

• Var[aX + b] = a2 Var[X]

• Cov[X1 , X2 ] = E[(X1 − µ1 )(X2 − µ2 )]


= E[X1 X2 ] − µ1 µ2

If X1 , X2 are independent, then E[X1 X2 ] = E[X1 ]E[X2 ] and consequently Cov[X1 , X2 ] =


E[X1 X2 ] − µ1 µ2 = µ1 µ2 − µ1 µ2 = 0.
p
• s.d.[X] = Var[X]
ST217 Mathematical Statistics B 20

Cov[X1 , X2 ] Cov[X1 , X2 ]
• Cor[X] = p =
Var[X1 ] Var[X2 ] s.d.[X1 ] s.d.[X2 ]
Definition 1.12. Multivariate expectation
Given a n-dimensional vector X = (X1 , X2 , ..., Xn ),
e
E[h(X )] = E[h(X1 , X2 , ..., Xn )]
e

X
• Discrete case: E[h(X )] = h(x)PX (x), where PX (x) is the joint p.m.f. of X
x
e e e e e e e

Z
• Continuous case: E[H(X )] = h(x)fX (x) dx, where fX (x) is the joint p.d.f. of X
e e e e e e e e
Rn
Z ∞ Z ∞
= ... h(x1 , ..., xn )fX1 ,...,Xn (x1 , ..., xn ) dx1 . . . dxn
−∞ −∞
| {z }
n integrals

Example 1.4. If X = (X1 , X2 , X3 ) with continuous distribution, take h(x1 , x2 , x3 ) = x1 ,


then e
Z ∞Z ∞Z ∞
E[h(x)] = E[X1 ] = x1 fX1 ,X2 ,X3 (x1 , x2 , x3 ) dx1 dx2 dx3
e −∞ −∞ −∞

Exercise 1.10. Let X and Y be continuous random variables.

Prove that E[g(X)h(Y )] = E[g(X)] E[h(Y )]


Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)fX,Y (x, y) dx dy
−∞
Z ∞Z ∞ −∞

= g(x)h(y)fX (x)fY (y) dx dy by considering marginals


Z−∞

−∞ Z ∞ 
= h(y)fY (y) g(x)fX (x) dx dy by Fubini’s Theorem
−∞
Z ∞ −∞
 Z ∞ 
= g(x)fX (x) dx h(y)fY (y) dy
−∞ −∞
= E[g(X)] E[h(Y )]
Exercise 1.11. Let X, Y, Z be independent random variables with Poisson distributions and
with means λ, µ and ν respectively. Find E[X 2 Y Z].

By independence, we know that E[X 2 Y Z] = = E[X 2 ]E[Y ]E[Z].


ST217 Mathematical Statistics B 21

We also know Var[X] = E[X 2 ] − (E[X])2 ⇒ λ = E[X 2 ] − (λ)2 ⇒ E[X 2 ] = (λ2 + λ)

Hence E[X 2 Y Z] = E[X 2 ]E[Y ]E[Z] = (λ2 + λ)µν.

Theorem 1.1. Cauchy-Schwartz Inequality p


If X and Y are random variables with finite second means, then E[XY ] ≤ E[X 2 ]E[Y 2 ].

Proof. Define h(t) = E[(tX − Y )2 ], t ∈ R.

We know that h(t) ≥ 0 as (tX − Y )2 ≥ 0

Also, h(t) = E[t2 X 2 − 2tXY + Y 2 ] = t2 E[X 2 ] − 2tE[XY ] + E[Y 2 ] which is a quadratic in


t.

We consider two cases.

Case 1: h(t) > 0 everywhere.

Then the discriminant of the quadratic equation must be negative, as there are no real roots,
i.e.
(2E[XY ])2 − 4E[X 2 ]E[Y 2 ] < 0 ⇒ E[XY ] < E[X 2 ]E[Y 2 ]
p

Case 2: h(t) = 0 for a single unique value of t, say t0 . So the quadratic has repeated roots.
Then
(2E[XY ])2 − 4E[X 2 ]E[Y 2 ] = 0 ⇒ E[XY ] = E[X 2 ]E[Y 2 ]
p

This implies the Cauchy-Schwartz Inequality but with the equality holding.

Note in this case, h(t0 ) = 0 ⇒ E[(t0 X − Y )2 ] = 0 ⇒ P[t0 X = Y ] = 1.

1.5 Conditional Expectation

We usually have a common practical problem: Say we have two random variables X1 and
X2 which may be related to each other. If we observe X2 = x2 , what is the expected value
ST217 Mathematical Statistics B 22

of X1 , given this new information?

Z ∞conditional expectation E[X1 |X2 = x2 ] is defined to be:


Definition 1.13. The
• x1 fX1 |X2 =x2 (x1 |x2 ) dx1 for the continuous case
X−∞
• x1 PX1 |X2 (x1 |x2 ) for the discrete case
x1

Comments:
1) E[X1 , X2 ] is really random because it depends on the value of X2 .
2) If we think of E[X1 |X2 = x2 ] as a function of x2 , then it describes a curve: the regression
curve. For example: a regression line E[X1 |X2 = x2 ] = β0 + β1 x2 .

Since E[X1 |X2 ] is really random, we can ask questions about its distribution.

Theorem 1.2. The marginal expectation theorem states that E[E[X1 |X2 ]] = [X1 ].

Proof. We can show this


Z ∞ for the continuous case by considering these two properties:
• fX1 (x1 ) = fX1 ,X2 (x1 , x2 ) dx2 (†)
−∞
• fX1 ,X2= fX1 |X2 =x2 (x1 |x2 )fX2 (x2 ) (††)
Then: Z ∞
E[X1 ] = x1 fX1 (x1 ) dx1 We are integrating against marginal density
Z −∞
∞ Z ∞
= x1 fX1 ,X2 (x1 , x2 ) dx2 dx1 by (†)
Z−∞
∞ Z −∞

= x1 fX1 |X2 =x2 (x1 |x2 )fX2 (x2 ) dx2 dx1 by (††)
−∞ −∞
Z ∞ Z ∞ 
= x1 fX1 |X2 =x2 (x1 |x2 ) dx1 fX2 (x2 ) dx2
−∞ −∞
| {z }
E[X1 |X2 ]
= E[E[X1 |X2 ]]

We follow the same argument to prove the discrete case, but with sums instead of integrals.

Exercise 1.12. Suppose X ∼ Uniform(0, 1) and Y |X = x ∼ Uniform (x, 1). Find E[Y |X =
x] and hence show E[X] = 43 .
ST217 Mathematical Statistics B 23

1+x
Since Y |X = x is uniform on (x, 1), E[Y |X = x] = .
2

 |X]]
So E[Y ] = E[E[Y 
1+x
=E
2
= 12 + 12 E[X]
  
1 1 1 1 3
As X ∼ Uniform on (0,1), then E[X] = , then E[Y ] = + = .
2 2 2 2 4
Exercise 1.13. Suppose Θ ∼ Uniform (0, 1) and Y |Θ = θ ∼ Binomial (2, θ). Find E[X|Θ]
and hence show E[X] = 1.

E[X|Θ = θ] = 2θ

So E[X] = E[E[X|Θ]] = E[2θ] = 2E[θ] = 2( 12 ) = 1

Here is a simple extension of Theorem 1.2.


Theorem 1.3. Marginal expectation of a transformed random variable
Given random variables X1 , X2 and a function h, E[E[h(X1 )|X2 ]] = E[(h(X1 )].

Proof. The proof is exactly the same as in Theorem 1.2, but we replace x1 by h(x1 )
instead.

Vital special case: h(X1 ) = (X1 − µ1 )2 where µ1 = E[X1 ]. Then we will get the following
theorem.
Theorem 1.4. Marginal variance theorem

Var[X1 ] = E[Var[X1 |X2 ]] + Var[E[X1 |X2 ]]

Proof. We note that Var[X1 |X2 = x2 ] = E[X1 2 |X2 = x2 ] − (E[X1 |X2 = x2 ])2 so
E[Var[X1 |X2 ]] = E[E[X1 2 |X2 ]] − E[(E[X1 |X2 ])2 ] 1n
Also
Var[E[X1 |X2 ]] = E[(E[X1 |X2 ])2 ] − (E[E[X1 |X2 ]])2 2n
ST217 Mathematical Statistics B 24

1n+ 2ngives E[Var[X1 |X2 ]] + Var[E[X1 |X2 ] = E[E[X1 2 |X2 ]] − (E[E[X1 |X2 ]])2 .

So using Theorem 1.2 and Theorem 1.3, we get E[Var[X1 |X2 ]]+Var[E[X1 |X2 ] = E[X1 2 ]−
(E[X1 ])2 = Var[X1 ].
Exercise 1.14. We recall Exercise 1.13, where Θ ∼ Uniform(0, 1) and X|Θ = θ ∼ Bino-
mial (2, θ).

2
Show that Var[X] = 3
and comment on the effect of observing Θ = θ.

Now E[X|Θ = θ] = 2θ

2 2
Var[E[X|Θ]] = E[(2Θ] ] − (E[2θ])
Z 1 
1
= 4θ2 (1) dθ − 1 =
0 3

E[Var[X|Θ]] = E[2Θ[1 − Θ]]


= 2E[Θ]−Z2E[Θ2 ]
1 
2 1
=1−2 θ (1) dθ =
0 3

So Var[X] = E[Var[X|Θ]]+ Var[E[X|Θ]] = 23 .

If we observe Θ = θ, then the relevant measure of uncertainty is Var[X|Θ = θ] = 2θ(1 − θ).

On average, E[Var[X|Θ]] = 13 .

2 1
So we have the variance of X reducing from 3
to an average of 3
if we observe Θ.

2 Multivariate Normal Distribution (and associates)

Recall that a random variable X is normal if it has mean µ, variance σ 2 , and distributed
2
1
with N(µ, σ 2 ) if it has p.d.f. f (x; µ, σ 2 ) = √2πσ exp[ −(x−µ)
2σ 2
], with parameters µ ∈ R, σ > 0.

Consider random variables X1 , X2 , . . . , Xn each independently identically distributed with


ST217 Mathematical Statistics B 25

N(0, 1). As they are independent, their joint p.d.f. is a product of their individual p.d.f.s.

So f (x1 , x2 , . . . , xn ) = f (x1 )f (x2) . . . f (xn )


x21 x22 x2n

1
= n exp − − − ... −
(2π) 2  2 2 2
1 1 T
= n exp − x x
(2π) 2 2e e

where x is a column vector (x1 , x2 , ..., xn )T .


e
Usually, we write x = (x1 , x2 , . . . , xn )T ∼ N(0, I), where 0 is a vector of n zeroes, and I the
n × n identity matrix.
e ee e e

However, there can be different notation for this. We can write:

• x ∼MVN(0, I)
• e ∼MVN e(0e, I)
x n
• x ∼Nn (0, I)e e
e
e ee
which mean the same thing.

We can use this notation to produce MVN random variables with different means, variances,
and covariances.

In a nutshell, we take linear combinations of MVN(0, I ) random variables to produce systems


of quite general MVN random variables. ee

Question: How do we obtain joint p.d.f.s of random variables produced in this kind of way?

2.1 Transformation of a Multivariate p.d.f.

Suppose X1 , X2 , . . . Xn have a joint p.d.f. f (x1 , x2 , . . . , xn ) = fX (x).


e e
Suppose further we have a 1 to 1 transformation of state space for X (the range of possible
e
ST217 Mathematical Statistics B 26

x = (x1 , x2 , . . . , xn )T ) to send X1 , X2 , . . . Xn to new random variables:


e
Y1 = r1 (X1 , X2 , . . . , Xn )
Y2 = r2 (X1 , X2 , . . . , Xn )
..
.
Yn = rn (X1 , X2 , . . . , Xn )

Hence we obtain Y = (Y1 , Y2 , . . . , Yn )T ranging over a new state space of possible Y =


(Y1 , Y2 , . . . , Yn )T . e e

If the r’s are smooth, there is a formula for the joint p.d.f. of Y .
e
As the transformation is 1 to 1, we can invert it to get:

X1 = s1 (Y1 , Y2 , . . . , Yn )
X2 = s2 (Y1 , Y2 , . . . , Yn )
..
.
Xn = sn (Y1 , Y2 , . . . , Yn )

We need the si (. . .) to be continuous and differentiable.

We then want to calculate the Jacobian, where:


∂s1 ∂s1
 
...
 ∂y1 ∂yn 
∂(s1 , . . . , sn )  .. ... .. 
J= = det  . . 
∂(y1 , . . . , yn  
 ∂sn ∂sn 
...
∂y1 ∂yn

(
|J| fX (s(y )) for possible y
Now it can be shown that the joint p.d.f. of Y is fY (y ) = e ee e.
e e e 0 otherwise

Exercise 2.1. The random variables X1 and X2 have joint p.d.f.


(
4x1 x2 0 < x1 < 1, 0 < x2 < 1
fX1 ,X2 (x1 , x2 ) =
0 otherwise
ST217 Mathematical Statistics B 27

X1
Compute the joint p.d.f. of Y1 , Y2 where Y1 = X2
and Y2 = X1 X2 .

x2 y2

y1 = y2
1

y1 y2 = 1

0 x1 y1
1 0

Figure 3: Graph of x2 against x1 Figure 4: Graph of y2 against y1

Note that the shaded area in the figures represent where the values of x1 , x2 , y1 , y2 must lie
in respectively.

Step 1: We express X1 , X2 in terms of Y1 , Y2 .


r
p Y2
X 1 = Y1 Y2 X2 =
Y
| {z 1}
| {z }
s1 (Y1 ,Y2 )
s2 (Y1 ,Y2 )

Step 2: We calculate J.

∂s1 ∂s1
 
 ∂y1 ∂y2 
J = det  ∂s ∂s2 
2
∂y1 ∂y2
 r r 
1 Y2 1 Y1

 2 Y1 2 Y2 

= det 
 
s r 
1 Y2 1 1
 

 
2 Y13 2 Y1 Y2
ST217 Mathematical Statistics B 28

r r ! s r !
1 Y2 1 1 1 Y2 1 Y1
= × − − 3
×
2 Y1 2 Y1 Y2 2 Y1 2 Y2

1
=
2Y1

Step 3: Calculate the joint distribution

So fY1 ,Y2 = |J| fX1 ,X2 (s1 (Y ), s2 (Y )). Hence


e e

 r
 1 p Y2 Y2
× 4 Y1 Y2 × =2 for Y1 , Y2 in appropriate ranges
fY1 ,Y2 = 2Y Y1 Y1
 1
0 otherwise

2.2 Linear Transformations

Suppose A is a n × n non singular matrix. We can use A as a linear transformation where


   
X1 Y1
 ..  . 
A .  =  .. 

Xn Yn

X
and Yi = Aij Xj .
j

As A is non singular, (A−1 ) exists, and hence


   
X1 Y1
 ..  −1  ..  .
 .  = (A )  . 
Xn Yn

1
The Jacobian J = det((A−1 )) = .
det(A)
ST217 Mathematical Statistics B 29

1
Then fY (y ) = fX ((A−1 )y) See Exercise Sheet 2
e e |det(A)|

This leads us to the following definition.

Definition 2.1. If A is a non singular n × n matrix and Y = AX and if X has a joint p.d.f.
e e e
fX (x), then the joint p.d.f. of Y is
e e e
1
fY (y ) = fX ((A)−1 y )
e e |det(A)| e e

Example 2.1. Bivariate normal distribution


Let Z1 , Z2 be two independently identically distributed random variables both distributed
with N(0, 1). They are subjected to the linear transformation of:

X1 = σ1 Z1 + µ1p
X2 = σ2 (ρZi + 1 − ρ2 Z2 ) + µ2

and we want to find the distribution of X1 and X2 .

We can write this as


      
X1 σ1 p0 Z1 µ1
= + .
X2 σ2 ρ σ2 1 − ρ2 Z2 µ2

 
σ1 0
Here, A = p .
σ 2 ρ σ 2 1 − ρ2

p
Then det(A) = 1 − ρ2 σ1 σ2 .

 p 
−1 1 − ρ2 σ2 0 1
So A = .
−σ2 ρ
p
σ1 1 − ρ2 σ1 σ2

1 2
e 2 |Z|
 
1 1 2 2
fZ1 ,Z2 (z1 , z2 ) = exp − (z1 + z2 ) =
2π 2 2π
ST217 Mathematical Statistics B 30

1
Then fX (x) = fZ ((A)−1 (x − µ))
e e |det(A)| e e
| {z }
non negative p.d.f.
 2 
1 1 −1
= exp − (A) (x − µ)

2π|det(A)| 2 e
1 Q(x)
 
1 1
=p exp −
1 − ρ2 σ1 σ2 2π 2 1 − ρ2
e

 2     2
x1 − µ 1 x1 − µ 1 x2 − µ 2 x2 − µ2
where Q = − 2ρ + .
σ1 σ1 σ2 σ2

Arguing from X1 = σ1 Z1 + µ1p


X2 = σ2 (ρZ1 + 1 − ρ2 Z2 ) + µ2

E[X1 ] = µ1 E[X2 ] = µ2
Var[X1 ] = σ12 Var[X2 ] = σ22
Cov[X1 , X2 ] = ρσ1 σ2
   
T µ1 σ12 ρσ1 σ2
Then (x1 , x2 ) ∼ N , .
e e µ2 ρσ1 σ2 σ22
Exercise 2.2. From Example 2.1, compute the
a) Marginal distribution of X1
b) Conditional distribution of X1 if X2 = x2 .

For a), we need to compute the marginal density of X1 .


Z ∞
Hence: fX1 ,X2 (x1 , x2 ) dx2
Z −∞

1 Q(x)
 
∝ exp − dx2 (We complete the square of Q(x1 , x2 ) for x2 )
2 (1 − ρ2 )
e
−∞
Z ∞  2  2  2 !!2
1 x2 − µ 2 x1 − µ 1 x1 − µ 1 x 1 − µ 1
= exp − 2)
−ρ + − ρ2 dx2
−∞ 2(1 − ρ σ 2 σ1 σ1 σ1
 2  2 !!
1 x1 − µ 1 x 1 − µ 1
∝ exp − − ρ2
2(1 − ρ2 ) σ1 σ1
ST217 Mathematical Statistics B 31

 2 !
1 x1 − µ 1
= exp −
2 σ1

which is proportional to the p.d.f. of N (µ1 , σ12 ).

Two proportional p.d.f.s must be equal since they both integrate to 1, so the marginal p.d.f.
is N (µ1 , σ12 ).

For b), we note that fX1 ,X2 (x1 , x2 ) ∝ fX1 |X2 =x2 (x1 , x2 ) as a function of X1 .

Treating all X2 ’s as constant,


  2 !
1 1 x1 − µ 1 x2 − µ 2
fX1 ,X2 (x1 , x2 ) ∝ exp − −ρ
2 (1 − ρ2 ) σ1 σ2

 
σ1 2 2
This is proportional to the p.d.f. of N µ1 + ρ(x2 − µ2 ), σ1 (1 − ρ ) .
σ2

So this is the conditional p.d.f. of X1 |X2 = x2 .

Notice: Marginal distributions are normal and easy to compute, and conditional distributions
are normal.

If correlation is zero, then the random variables are independent. You can substitute ρ = 0
and check.

These features persist with essentially the same proofs for multivariate normal distribution.

Definition 2.2. Multivariate normal distribution


X is multivariate normal if it has p.d.f.
e
 
1 1 1 T −1
fX (x; µ, V ) = p q exp − (x − µ) V (x − µ)
e e e (2π)p |det(V )| 2 e e e e
ST217 Mathematical Statistics B 32

 
µ1
where µ is the mean vector  ... 
 
µp
and V is the p×p, non symmetric, positive definite variance-covariance matrix with (i, j)th
entry as Cov[Xi , Xj ].


  
µ1 E[X1 ]
We can see that µ =  ...  = E[X ] =  ...  . So µi = E[Xi ].
   
µp E[Xp ]
e

 
σ1 2 ρ12 σ1 σ2 . . . . . . ρ1p σ1 σp

 ρ12 σ1 σ2 σ2 2 . . . . . . ρ2p σ2 σp 

Similarly, we see that V = 
 ρ13 σ1 σ3 ρ23 σ2 σ3 σ3 2 . . . ρ3p σ3 σp 
 ... .. 
 ... ... ... . 
2
ρ1p σ1 σp ... ... ... σp

We hence see that σi 2 = E[(Xi − µi )2 ] and ρij σi σj = E[(Xi − µi )(Xj − µj )].

Also, V can be written as E[(X − µ)(X − µ)T ] = E[X X T ] − µµT


e e e e ee ee

Then for our parameters, we have p number of means µ, p number of variances σi 2 , and
1
2
p(p − 1) number of correlations ρij .

We write X ∼ M V N (µ, V ).
e e
Question: Why is V positive definite?

1 2 2
Consider the scalar √ e−u /2σ . We know that if this is a p.d.f., integrating it must give
2πσ 2 Z ∞
2 2
us the value 1. In other words, this implies that e−u /2σ du < ∞. If we replace σ 2 by a
−∞
negative number, this integral would be infinite, and hence it is not possible for the scalar to
be a p.d.f. Similarly, V has to be positive definite or else the joint p.d.f. will not integrate
properly. The positive definiteness of V also means the correlations ρij must satisfy suitable
relationships.
ST217 Mathematical Statistics B 33

When p = 2, the contours of the joint p.d.f.s are ellipses.

When p = 3, the contours of the joint p.d.f.s are ellipsoids.

When p ≥ 4, the contours of the joint p.d.f.s are hyper ellipsoids.

All conditional and marginal distributions of M V N are themselves M V N .


Exercise 2.3. Suppose Z1 , Z2 , ..., Zn are independently identically distributed with N (0, 1).
Consequently, [Z1 , Z2 , ..., Zn ]T is M V N (0, In ) where In is the n × n identity matrix.
e e

Show that Y = a + BZ ∼ M V N (a, BB T ) if B is a non singular n × n matrix.


e e e e

We show this by using calculations similar to the bivariate example.

Z = B −1 (Y − a), and by transformation of variables given, we get


e e e
 
1 1 1 −1 T −1
fY (y ) = exp − (B (y − a)) (B (y − a))
e e (2π)n/2 det(B) 2 e e e e

because the joint p.d.f. of Z1 , Z2 , ..., Zn is


 
1 1 2 2
exp − (Z1 + . . . + Zn )
(2π)n/2 2

and Z12 + . . . + Zn2 = Z T Z .


e e

Now notice det(B) = det(B T ), and det(BB T ) = det(B)det(B T ).

q q q
So det(BB T ) = det(B)det(B T ) = (det(B))2 = det(B).

Also (B −1 (y − a))T (B −1 (y − a)) = (y − a)T (B −1 )T B −1 (y − a).


e e e e e e e e

Furthermore, (B −1 )T = (B T )−1 so (B −1 )T B −1 = (B T )−1 B −1 = (BB T )−1 by matrix


algebra.
ST217 Mathematical Statistics B 34

 
1 1 1 −1 T −1
Therefore fY (y ) = exp − (B (y − a)) (B (y − a))
e e (2π)n/2 det(B) 2 e e e e
 
1 1 T T −1
=q exp − (y − a) (BB ) (y − a)
(2π)n det(BB T ) 2 e e e e
∼ M V N (a, BB T )
e

Direct consequence of this: given a V a n × n, symmetric, non singular, positive definite,


variance covariance matrix, then we can generate a multivariate normal density with this V
using Exercise 2.3 whenever we can find B, BB T = V .

2.3 Maximum Likelihood Estimates

Suppose X (1) , X (2) , ..., X (n) are independent p vectors, each M V N (µ, V ), and we want esti-
mates of µ and V . In other words, we want to estimate each µi , σi 2 and ρij .
e e e

We want to construct statistics from the data, close to the µi , σi 2 , ρij , and we do this by
using maximum likelihood estimates (m.l.e.).

The m.l.e. idea is to take the joint p.d.f. of data X (1) , ..., X (n) , i.e., the joint p.d.f. is
fX (1) (x(1) )fX (2) (x(2) ) . . . fX (n) (x(n) ) and choose µ, V to maximize this. In practice, we differ-
e e

entiate with respect to µi , σi 2 , and ρij , set the derivatives to zero, and solve.
e e e e e e

Eventually, we get

n
1 X (i)
µ̂j = Xj
n i=1
n
1 X  (i) 2
σ̂j2 = Xj − µˆj
n i=1
n
1 X  (i)  
Xj − µ̂j Xj (i) − µ̂k
n i=1
ρ̂jk =
σ̂j σ̂k
ST217 Mathematical Statistics B 35

Recall Exercise 2.3, that if Z ∼ M V Nn (0, In ), and if B is a non singular n × n matrix,


e e
then V = BZ + a ∼ M V Nn (a, BB T )
e e e

In fact, we can get any variance-covariance matrix V = BB T .

M V N arises a lot in statistics.

Typically, if the number of parameters p is fixed, one gets more and more data, and one
uses m.l.e. in regular situations, the p maximum likelihood estimators for p parameters
approximately has a M V N distribution.

Moreover, some significant functions of M V N distributions can be related using ideas in


Exercise 2.3 to classic distributions.

Furthermore, because of Exercise 2.3, we can concentrate on Z ∼ M V Nn (0, I).


e e

2.4 Distributions Derived from the Gaussian

Definition 2.3. χ2 -Distribution


X = Z1 2 + Z2 2 + . . . + Zn 2 is χ2 on n degrees of freedom and we write X ∼ χ2n . This arises
in variance estimation. χ2n is absolutely continuous and has p.d.f.
1

n
−1 −x/2
n x e x>0
 2
n/2
2 Γ( 2 )
fχ2n =
0 otherwise

This is a special case of the Gamma distribution.

Note that Γ( n2 ) = ( n2 − 1)! if n is even.

Here are some diagrams of the shape of the density function with varying n.
ST217 Mathematical Statistics B 36

f (x) f (x)

1
2

0 2 4 x 0 x

Figure 5: χ2 density when n = 1 Figure 6: χ2 density when n = 2


f (x) f (x)

x 0 x
0 k

Figure 7: χ2 density when n = 3 Figure 8: χ2 density when n = k, k large

If X ∼ χ2n and Y ∼ χ2m and X ⊥


⊥ Y , then X + Y ∼ χ2m+n

Special cases:
χ2n = Gamma( n2 , 2)
χ22 = Exponential with parameter 21 , i.e. mean = 2

 n/2
1
If X ∼ χ2n , tx
then E[X] = n, Var[X] = 2n, and Mx (t) = E[e ] = , for nt < 21 .
1 − 2x

This is a positively skewed distribution, and it converges slowly to a normal distribution as


ST217 Mathematical Statistics B 37

n → ∞.

There is no explicit c.d.f. formula for this distribution, but there are tables provided to
approximate the c.d.f. of this distribution.

Definition 2.4. Student t-distribution


Z
If Z ∼ N (0, 1), Y ∼ χ2n , and Z ⊥
⊥ Y , then X = p is the t-distribution with n degrees
Y /n
of freedom and we write X ∼ tn . It has a p.d.f.
 
n+1
Γ −(n+1)/2
x2

2
fX (x) = √ n 1 + for all x
nπ Γ n
2

This distribution is symmetric about 0 and converges to N (0, 1) as n → ∞ but it has heavier
tails.

Special cases:
t1 is Cauchy distribution

For tn , only the first n − 1 moments E[|X|n ] are finite.

There is no explicit c.d.f. formula for this distribution except in the case n = 1, but there
are tables provided to approximate the c.d.f. of this distribution.

Definition 2.5. Fisher F distribution


Y /m
If Y ∼ χ2m , Z ∼ χ2n , and Y ⊥
⊥ Z, then X = is the F distribution on m, n degrees of
Z/n
freedom, and we write X ∼ Fm,n . This distribution is positively skewed for large m, n, and
we find Fm,n concentrated around 1. It has p.d.f.
s
(mx)m nn
(mx + n)m+n
fX (x) = m n
xB ,
2 2
m n Z 1 m n
where B is the beta function, where B , = t 2 −1 (1 − t) 2 −1 dt for m2 , n2 > 0.
2 2 0
ST217 Mathematical Statistics B 38

There is no explicit c.d.f. formula for this distribution, but there are tables provided to
approximate the c.d.f. of this distribution.

Recall Fisher’s Theorem.

Definition 2.6. Fisher’s Theorem


If X1 , X2 , . . . , Xn are independent identically normally distributed random variables with
n n
1X 1 X
2
mean µ and variance σ , and X̄n = 2
xi , and sn = (xi − X̄n )2 , then
n i=1 n − 1 i=1
2
• X̄n ∼ N (µ, σn )
(n − 1)s2n
• ∼ χ2n−1
σ2
• X̄ ⊥⊥ s2n are independent
X̄ − µ
• p ∼ tn−1
s2n /n

This sets the scene for Chapter 3 (Maximum Likelihood Estimation). Maximum Likelihood
Estimators are typically easy to compute, but how good are they, especially when we have
lots of data? We will find out in Chapter 3.

3 Maximum Likelihood Estimation

We get data which leads to a list of cases with each case having several (or just one) response.

We can then put the data into a data matrix, for example

Responses

X1 Y1 Z1 . . .
 
1 ... ... ... ...
.. . .
2 . . ... ... 
 

Cases 3  .. . 
 . . . . .. ... 
.. 
.. .. ..

. ..
. . . .
ST217 Mathematical Statistics B 39

We can then generate a statistical model with parameters θ1 , . . . , θp , yielding joint p.m.f. or
joint p.d.f. for data fθ (x).
ee
So we have the relation:

fθ (x1 ) . . . fθ (xp )
MaximumeLikelihood e Estimators

for observation
Parameter Ωθ Observation X
θ∈ x1 , . . . , x n X∈
space e Subset
e of
e e space e Subset of
Euclidean space Euclidean space
estimators

t(x)
ee

What we want to do is to
1) Construct estimators using fθ (x)
2) Decide how good they are e
e

Idea: We look for estimators which are good when amount n of data becomes large.

Estimators t(x) depend on observations X ≈ x so they are random.


ee
Let Tn be the estimator based on n cases.

We want Tn to get close to θ as n → ∞.


e
For example, the probability that Tn is not close to θ → 0 as n → ∞.
e
Definition 3.1. The random variables X1 , X2 , . . . converge in probability to a random
variable X if for all  > 0, P[|Xn − X| ≥ ] → 0 as n → ∞.

P
We write Xn → X, or Xn → X in probability.
ST217 Mathematical Statistics B 40

Note X can be non random. If X = a, we write X → a in probability.

Definition 3.2. The estimators θˆ1 , θˆ2 , . . . are consistent for θ ∈ Ωθ if, for all  > 0, all
θ ∈ Ωθ , e
θ̂ → θ in prob
e
en e

Important: When calculating P[|θ̂n − θ| ≥ ], we need to use fθ , the p.m.f / p.d.f. given by
θ. e e e

Exercise 3.1. Suppose we have data X1 , X2 , . . . independently identically distributed with


some p.d.f / p.m.f. fθ (x), with E[Xi ] = µ, and Var[Xi ] = σ 2 , σ 2 < ∞.
e
Suppose we want to estimate µ, which will be one component of the parameters θ.
e
n
1X
Try µ̂n = X̄n = Xi
n i=1

Is µ̂n = X̄n consistent for µ? (i.e. does µ̂n → µ in probability)?

We need to show that P[|X̄n − µ| ≥ ] → 0 as n → ∞.

We can show this because of Chebyshev’s Inequality.

Recall: Chebyshev’s Inequality states that if Z has mean µ, variance σ 2 , then


1
P[|Z − µ| ≥ rσ] ≤
r2

σ2
Now, X̄n has mean µ, and variance n
.

Then P[|X̄n − µ| ≥ r √σn ] ≤ 1


r2


rσ n
Fix  > 0, and set √ =  so r = .
n σ
ST217 Mathematical Statistics B 41

1
So we find P[|X̄n − µ| ≥ ] ≤  √ 2
n
σ
σ2
= 2 → 0 as n → ∞
n

Sometimes, we estimate something and then decide we really want to estimate a function of
it.

For example, we may want to estimate the variance σ 2 .

Suppose s2n estimates σ 2 consistently. Then how do we estimate the standard deviation σ?
p √
sn = s2n will then estimate σ = σ 2 consistently.

Theorem 3.1. Suppose Xn → a in probability and the function g is continuous at a. Then

g(Xn ) → g(a) in probability

Proof. Left as exercise.

So if θ̂n → θ in probability, then


2
θˆn → θ2 in probability
1
→ 1θ in probability if θ 6= 0
p θ̂ √
θ̂n → θ in probability if θ > 0
..
.

So consistency respects continuous transformations.

Exercise 3.2. Supposing X1 , . . . Xn are independent identically distributed variables as in


n
1 X
2
Exercise 3.1 and sn = (Xi − X̄)2 , and Var[Xi2 ] < ∞.
n − 1 i=1

We want to show that


a) E[s2n ] = σ 2 , so s2n is an unbiased estimator for σ 2 .
b) s2n is a consistent estimator for σ 2 .
ST217 Mathematical Statistics B 42

Note that in general, sn is not unbiased for σ, but it is still consistent.

a) Before we begin, we note that that Var[Z] = E[Z 2 ] − (E[Z])2 and hence E[Z 2 ] = Var[Z] +
(E[Z])2 (†)

Therefore
" n
!#
1 X 2
E[s2n ] = E Xi − X̄n
n − 1 i=1
" n
!#
1 X
= E X 2 − nX̄n2
n − 1 i=1 i
1
nE[x21 ] − nE[X̄n2 ]

=
n−1 
1 2 2 
= n Var[X1 ] + (E[X1 ]) − nVar[X̄n ] − n E[X̄n ] by (†)
n−1  2 
1 σ
= nσ 2 + nµ2 − n − nµ2
n−1 n
1 2

= (n − 1) σ
n−1
= σ2

Hence s2n is an unbiased estimator for σ 2 .

n n
1X 2 1X
b) We note that X and Xi = X̄n can be subjected to the Weak Law of
n i=1 I n i=1
Large Numbers as in Exercise 3.1. Furthermore, we also know from Exercise 3.1 that
n n
1X prob 1 X 2 prob
Xi −→ E[X1 ] = µ and X −→ E[Xi2 ] = σ 2 + µ2 (††)
n i=1 n i=1 i

Therefore
n
1 X
s2n = (Xi − X̄)2
n − 1 i=1
n
!
1 X
= X 2 − n(X̄)2
n − 1 i=1 i
ST217 Mathematical Statistics B 43

 !2 
n n
n 1 X 1 X
= Xi2 − Xi 
n − 1 n i=1 n i=1
prob n
σ 2 + µ2 − µ2
 
−→ lim
n→∞ n − 1
= σ2

Hence there is consistency.

n
1X 2
Notice that s˜2n = Xi − X̄n is therefore also consistent for σ 2 (but not unbiased).
n i=1

3.1 Consistency

Consistency in MSE (mean square error) is a sufficient (but not necessary) condition for
consistency.

We know that Bias(θ̂n → 0) = E[θ̂n ] − θ and Var[θ̂n ] → 0 for consistency.

In this case, for the mean square error,

MSE(θ̂) = E[(θ̂n − θ)2 ]


= Var[θ̂n ] + (Bias[θ̂n ])2 → 0 as n → ∞

prob
We can use Chebyshev’s Inequality to deduce θ̂n −→ θ, hence θ̂n is consistent.

Example 3.1. Suppose the random variables X1 , X2 , . . . , Xn are independently identically


distributed, with E[Xi ] = µ and Var[Xi ] = σ 2 .

We are given 4 estimates for µ:

n
X n
X
µ̂1 = X̄n µ̂2 = ai Xi for ai = 1
i=1 i=1
n
1 1 X
µ̂3 = (X1 + Xn ) µ̂4 = Xi (for n > 2)
2 n − 2 i=1
ST217 Mathematical Statistics B 44

We want to find out:


a) Which estimators are unbiased? (i.e. E[µ̂] = µ)
b) Which estimators are asymptotically unbiased? (i.e. E[µ̂n ] → µ as n → ∞)
c) Which estimators are consistent in MSE?

We test for unbiasedness. Note that if an estimator is unbiased, then it is naturally asymp-
totically unbiased as well.

n
1X
E[µ̂1 ] = E[Xi ] = µ (unbiased, and hence asymptotically unbiased)
n i=1
Xn Xn
E[µ̂2 ] = E[aXi ] = ai µ = µ (unbiased, and hence asymptotically unbiased)
i=1 i=1
1 µ+µ
E[µ̂3 ] = (E[X1 ] + E[Xn ]) = = µ (unbiased, and hence asymptotically unbiased)
2 n
2
1 X n
E[µ̂4 ] = E[Xi ] = µ 6= µ (biased)
n − 2 i=1 n−2

n n
However, we note that as n → ∞, n−2
→ 1, and therefore n−2
µ → µ as n → ∞, so E[µ̂4 ] is
asymptotically unbiased.

We now check for consistency in MSE. Since all four are asymptotically unbiased, we just
need to show that Var[µ̂i ] → 0 as n → ∞.

σ2
Var[µ̂1 ] = → 0 as n → ∞ (Consistent in MSE)
n !
n
X n
X
Var[µ̂2 ] = Var[ ai X i ] = a2i σ 2 (So this is only consistent in MSE
i=1 i=1
n
X
if ai ’s depend on n in such a way that a2i → 0 as n → ∞)
i=1
1 σ2
Var [µ̂3 ] = (2σ 2 ) = 9 0 as n → ∞ (Not consistent in MSE)
4 2
2
1 2 n σ
Var [µ̂4 ] = (nσ ) = → 0 as n → ∞ (Consistent in MSE)
(n − 2)2 n−2n−2
ST217 Mathematical Statistics B 45

3.2 Estimators

How do we construct estimators?

Suppose X1 , . . . , Xn are independently identically distributed with p.d.f. f (x, θ) depending


on θ1 , θ2 , . . . , θp , and θ ∈ Ωθ ⊂ Rp (open subset of Rp ) e
e
The maximum likelihood idea is to construct a likelihood function that if X1 = x1 , . . . , Xn =
xn , then for
Yn
Ln (θ; x) = f (x1 , θ) . . . f (xn , θ) = f (xi , θ),
i=1
e e e e e

the maximum likelihood estimator is θ̂n such that Ln (θ, x) is as large as possible.
e e
Note that f (x1 , θ), . . . , f (xn , θ) are the joint p.m.f.s of X1 , . . . , Xn if we know θ, viewed the
other way roundea function ofe θ supposing I have observed X = x. e
e e e
Note that the MLE may not be unique, in the sense that there may be different values of θ
that give the same maximum. The values of θ are typically found using calculus. e
e
Notice Ln (θ; x) is maximized exactly when log Ln (θ; x) (the log-likelihood) is maximized.
e e e e
n
X
We write ln (θ; x) = log Ln (θ; x) = log f (xi ; θ)
i=1
e e e e e

For the m.l.e., we use θ̃ which makes L (or equivalently l) as large as possible.

We know if L (or respectively l) is well-behaved, we can do this by solving


l(. . .) = 0
∂θ1

l(. . .) = 0
∂θ2
..
.

l(. . .) = 0
∂θp
ST217 Mathematical Statistics B 46

for θ1 , . . . , θp .

We can solve these equations either analytically, or numerically (eg, methods like Newton
Raphson).


We can write the problem in brief as: Solve for θ in l(x; θ) = 0
e ∂θ e e
e

Properties of resulting m.l.e θ̂


e

In suitably regular situations, θ̂ for n observations → θ in probability, so we have consistency.


e e

If we are interested in g(θ), then η̂ = g(θ̂) is m.l.e. η = g(θ), so we have invariance.


e e e
Theorem 3.2. Let X1 , . . . , Xn be independently identically distributed random variables
with p.m.f. / p.d.f. f (x; θ). Suppose regularity conditions hold. Then
∂ e prob
a) = 0 has a solution θ̂n ; and θ̂n −→ θ.
∂θn e e e e
e √ dist
b) For any sequence θn satisfying a), n(θ̂n − θ) −→ M V Np (0, I(θ)).
e e e e e

I is the Fisher Information matrix where

 

Iii (θ) = Var log f (X; θ)
e ∂θ
 2 i e

= −E log f (X; θ)
∂θi2 e

∂ log f (X; θ) ∂ log f (X; θ)


 
Iij (θ) = Cov e ,
∂θi ∂θj
e
e
∂2
 
= −E log f (X; θ) for i 6= j
∂θi ∂θj e

Comments: 1) The proof for the theorem is given in advanced statistics books. For example,
the proof can be found here at (Lehmann & Casella, 1998, Theory of Point Estimation, 2nd
Edition, Springer).
2) Uniqueness is not assured, but if uniqueness, then consistency and asymptotic normality
ST217 Mathematical Statistics B 47

follow.
3) Regularity conditions:
(R0) The p.m.f / p.d.f. is distinct and different for different θ:
if θ 6= θ0 , then f (X; θ) 6= f (X; θ0 ) e
(R1) The support of the p.d.f. / p.m.f. does not depend on θ. The support is the region
e e e e
of possible values of x. e
(R2) There is an open subset Ω0 ⊆ Ωθ such that the true θ ∈ Ω0 , and all 3rd partial
derivatives of f (x; θ) exist for alle θ ∈ Ω0 .
e
(R3) The expectation
 ofethescore function
e vanishes:

i.e. Eθ log f (x; θ) = 0.
e ∂θj e
(R4) I(θ) is positive definite for all θ ∈ Ω0 (and hence invertible)
e e
(R5) There are functions M jkl (x) such that
∂3


∂θj ∂θk ∂θl log f (x; e
θ) ≤ Mjkl (x) and Eθ [Mjkl (x)] < ∞.
e

Hence we get the below corollary.


Corollary 3.1. The θ̂n are asymptotically efficient, since for j = 1, 2, . . . , p,
√ dist
n(θ̂ij − θj ) −→ N (0, Iij−1 (θ)) as n → ∞
e
where Iij−1 is the Cramer Rao lower bound on variance of unbiased estimator.
∂ log f (x; θ) ∂ log f (x; θ) ∂2
   
Exercise 3.3. Show that Cov e , = −E log f (x; θ)
∂θj ∂θk ∂θj ∂θk
e
e

Recall Cov[U, V ] = E[U V ] − E[U ]E[V ]

∂ log f (x; θ)
 
So consider, e.g. E e = 0 (by regularity conditions)
∂θj
Z Z

f (x; θ) dx = 1, so f (x; θ) dx = 0
e ∂θj e

Under regularity conditions, in particular, range of possible values for X doesn’t depend on
θ.
e
Z  
1 ∂
So f (x; θ) f (x; θ) dx = 0
f (x; θ) ∂θj e
e
ST217 Mathematical Statistics B 48

Z  

log f (x; θ) f (x; θ) dx = 0
∂θj e e
 

and LHS = E log f (x; θ)
∂θj e

So we need only consider

∂ log f (x; θ) ∂ log f (x; θ) ∂ log f (x; θ) ∂ log f (x; θ)


  Z
E e × = e × e f (x; θ) dx
∂θj ∂θk ∂θj ∂θk
e
e
Z  
∂ ∂
Try evaluating: log f (x; θ) f (x; θ) dx = 0
∂θk ∂θj e e

Using integration by parts,

∂2
Z     
∂ ∂
0= log f (x; θ) f (x; θ) + log f (x; θ) f (x; θ) dx
∂θ j ∂θ j e e ∂θ j e ∂θ k e 
∂2
Z    Z 
∂ ∂
⇒ log f (x; θ) f (x; θ) dx = − log f (x; θ) f (x; θ) dx
∂θj e ∂θk e  ∂θ2j ∂θk e e

= −E log f (x; θ)
∂θj ∂θk e

1 ∂ ∂
Now use f = (log f ), so
f ∂θk ∂θk
Z   
∂ ∂
LHS = log f (x; θ) log f (x; θ) f (x; θ) dx
 ∂θj e ∂θk  e e
∂ ∂
=E log f (x; θ) log f (x; θ)
∂θ j e ∂θ k e
∂2
 
= −E log f (x; θ)
∂θj ∂θk e

which settles the result.

Exercise 3.4. Find the Maximum Likelihood Estimate and Fisher information matrix if f
is N (µ, σ 2 ) density.
ST217 Mathematical Statistics B 49

θ = (µ, σ 2 )T
e

State asymptotic distribution of m.l.e. θ̂.


e
Ω = (−∞, ∞) × (0, ∞)
| {z } | {z }
θ1 =µ θ2 =σ

n
X
l(µ, σ) = log f (xi ; µ, σ 2 )
i=1
n  
X 1 1 2
= − log 2π − log σ − 2 (xi − µ)
i=1
2 2σ
n
n 1 X
= − log 2π − n log σ − 2 (xi − µ)2
2 2σ i=1

n
∂ 1 X
l(µ, σ) = 2 (xi − µ) = 0
∂µ σ i=1
n
1X
⇒ µ̂ = xi = x̄
n i=1

n
∂ n 1 X
l(µ, σ) = − 2 + 3 (xi − µ)2 = 0
∂σ σ σ i=1
s n
1X
⇒ σ̂ = (xi − x̄)2
n i=1

We now compute the information matrix.

1 1
log f (x; µ, σ 2 ) = − log 2π − log σ − 2 (x − µ)2
2 2σ
∂ 1
log f = (x − µ)
∂µ σ2
∂ 1 1
log f = − + 3 (x − µ)2
∂σ σ σ
∂2 1
2
log f = − 2
∂µ σ
ST217 Mathematical Statistics B 50

∂2 1 3
2
log f = 2
− 4 (x − µ)2
∂σ σ σ
∂2 2
log f = − 3 (x − µ)
∂µ∂σ σ

Replace x by X ∼ N (µ, σ 2 ) and compute expectations.

2 2
 −
E[X
2
µ] = 0 ; E[(x −  µ) ] = σ
∂ 1
E 2
log f (x; µ, σ 2 ) = − 2
 ∂µ2  σ
∂ 1 3 2
E 2
log f (x; µ, σ 2 ) = 2
− 4 E[(x − µ)2 ] = − 2
 ∂σ 2  σ σ σ
∂ 2
E log f (x; µ, σ 2 ) = − 3 E[x − µ] = 0
∂µ∂σ σ
 
1
 σ2 0 
I(µ, σ) =  2 
0
σ2
 
σ2 0
V = (I(µ, σ))−1 =  σ2 
0
2

Summary:
2
• Nn (µ̂ − µ) ∼ N (0,
 σ )2 for
 large n
σ
• Nn (σ̂ − σ) ∼ N 0, and asymptotically µ̂, σ̂ are independent.
σ

Since we don’t know µ, σ, if we need them to estimate the variance terms we plug in the
m.l.e.

So for large n, suppose:


σ̂ 2
• µ̂ ∼ normal, mean µ, variance
n
σ̂ 2
• σ̂ ∼ normal, mean σ, variance
2n
ST217 Mathematical Statistics B 51


We know that n(θ̂ − θ) = M V N (0, I(θ)−1 ), with θ1 = µ, θ2 = σ.
e e e e

What happens if we replace θ by some smooth function η = g(θ)?


e e

If we reparameterize, then we can say η̂ = g (θ̂).


ee
If we compute
∂gi (θ)
 
B =
∂θk
e
i,k

then so long as B depends continuously on θ, and doesn’t become singular in a neighbourhood


√ e
dist
of θ, then it can be shown that n(η̂ − η) −→ M V N (0, BI −1 (θ)B T ).
e e e e
   
µ µ
Exercise 3.5. Let η = g = .
e e σ σ2
 
µ̂
Now find the asymptotic distribution of η̂ = g (θ̂) = .
e ee σ̂ 2

∂g1 ∂g1
 
 ∂µ ∂σ 
B =  ∂g ∂g2 
2
∂µ ∂σ
   
µ µ
g1 = µ, g2 = σ2
σ σ
 
1 0
so B =
0 2σ
 
σ2 0
I −1 =  σ2 
0
2
 

1 0
 σ2 0 
1 0

So BI −1 B T =  σ2 
0 2σ 0 0 2σ
2
ST217 Mathematical Statistics B 52

  
σ2 0 1 0
= 3
 02 σ  0 2σ
σ 0
=
0 2σ 4
 
X
n
Consequently for large n, η̂ = g (θ̂) =  1 X 2  has approximate distribution
 
(X i − X n )
e ee n i=1
   
µ 1 σ2 0
MV N , .
σ2 n 0 2σ 4

Remarks:
Using the Cramer-Rao lower bound, the lowest possible variance for an unbiased estimator
2σ 4
of σ 2 can be seen to be .
n
n
1X
Here, our estimator σ̂ = 2
(Xi − X n )2 is biased, though asymptotically unbiased and
n i=1
consistent. Its asymptotic variance is the Cramer-Rao lower bound.

n
2 1 X 2σ 4 2σ 4
The estimator s = (Xi − X n )2 is unbiased, with variance is > .
n − 1 i=1 n−1 n

1 1
In fact s2 has greater mean square error that σ̂ 2 . It is not much greater, for large n, ≈
n−1 n
but this demonstrates that exactly unbiased estimators need not be best.

In general,
√ dist
n(θ̂j − θj ) −→ N (0, Ij (θj )−1 )
so we deal with the unknown θ in variance term by plugging in the maximum likelihood
estimate. e

This way or another, we can estimate Ijj (θ)−1 by Iˆjj (θ)−1 .


e e
Hence define confidence intervals for our estimator.
ST217 Mathematical Statistics B 53

Iˆjj
−1
We can get asymptotic 95% confidence intervals for θj , as θ̂j ± 1.96 √
n

Another important application:

Likelihood function is proportional to joint density of data, given the parameters. In Bayesian
Inference, model uncertainty about θ by supposing it random with density π(θ) and then we
can compute the conditional distribution
e of parameter θ given data e
e
f (x | θ)π(θ)
X : π(θ |x) = e e e ∝ f (x | θ)π(θ)
e e f (x) e e
e

The m.l.e. theory tells us what f (x | θ) looks like when we have large amount n of data.
e e

For example, m.l.e. θ̂ is where f (x | θ) is biggest as function of θ.


e e e e
Pros and cons for m.l.e.:
+ It is a defined procedure and can be calculated.
+ In regular cases, we know asymptotic distribution (hence can get confidence intervals for
our estimates).
+ Behaves well if we reparameterize.

− Sometimes there can be several maximum likelihood estimators.


− There can be correlation between parameters though reparameterization can help.
− m.l.e. cannot help if you have chosen a bad model for your data.

4 Hypothesis Testing

Recall the set up: in a statistical model with data x1 , . . . , xn modelled as observed values of
random variables X1 , . . . , Xn .

We consider several possible distributions for X1 , . . . , Xn parametrized by θ, some unknown


parameter(s) belonging to a set of possible values Θ.
ST217 Mathematical Statistics B 54

A hypothesis test is a procedure for comparing two statements about the true value of the
parameter θ.

The null hypothesis, usually denoted by H0 , is a statement of the form that the true value
of Θ belongs to a given subset Θ0 ⊆ Θ.

The alternative hypothesis, usually denoted by H1 , is a statement of the form that the
true value of Θ belongs to a given subset Θ1 .

Usually, the null hypothesis corresponds to some simplifying assumptions or natural special
case.

Example 4.1. Data on blood pressure where women at age 35 and then at age 40 get their
blood pressure measured.

The parameter θ might represent the mean change in blood pressure between the two ages.

Natural null hypothesis θ = 0 which represents “no change in mean blood pressure”
versus alternative hypothesis θ ∈ R.

i.e. Θ0 = {0} compared to Θ1 = R.

Remark: We might take Θ1 = R \ {0} instead but it wouldn’t change the analysis.

A hypothesis test is a procedure for choosing between H0 and H1 in light of the data. But it
is not symmetric: we choose to accept the null hypothesis unless the data suggests sufficiently
strongly that the alternative is a better explanation or fit.

A hypothesis data is specified by giving a critical region (rejection region) C ⊆ Rn .

We choose C so that the data values in C correspond to the evidence against θ ∈ Θ0 with
some θ ∈ Θ1 being more comparable with the data. Then if our data x ∈ C, we reject H0
in favour of H1 , otherwise we accept H0 . e

The critical region is usually constructed by using a “test statistic” T (x), which is some
function of the data, such that large values of T are more compatible withe H rather than
1
ST217 Mathematical Statistics B 55

H0 . Then we set
C = {x ∈ Rn : T (x) ≥ λ}
e
where λ is some critical value that we must choose, so that

sup Pθ [X ∈ C] = sup Pθ [T (X ] ≥ λ) = α
θ∈ Θ0 e θ∈ Θ0 e
where α is called the size of the test and must be chosen in advance.

A type I error is made if H0 is rejected but in fact is true,

A type II error is made if H0 is accepted but in fact is false.

Thus the size of the test is the maximum probability of a type I error, and this we usually
fix in advance at one of a number of typical values, α = 0.05, 0.01.

One general way of choosing a test statistic or critical region is the generalized likelihood
ratio. We may take
sup L(θ; x)
 
 
θ∈ Θ1
C = x ∈ Rn : ≥λ
e
e sup L(θ; x) 
θ∈ Θ0 e

Here L(θ; x) is observed likelihood of θ if x is the data, and λ is chosen to fix the size of the
test. e e

This is called a (Generalized) likelihood ratio test.

Example 4.2. x1 , . . . , xn are modellled as observed values of X1 , . . . , Xn which are indepen-


dently identically distributed as N (θ, 1).

Find the critical region for the likelihood ratio test of H0 : θ = θ0 = 0 against
H1 : θ = θ1 = 1 with size α = 0.01.
( n
)
1 1X
L(θ : x) = n/2
exp − (xi − θ)2
e (2nπ) 2 i=1

Consequently,
ST217 Mathematical Statistics B 56

( n
)
1X
exp − (xi − 1)2
L(θ1 ) 2
= ( i=1 n )
L(θ2 ) 1X 2
exp − x
2 i=1 i
( n )
X 1
= exp xi − n
i=1
2
 
1
= exp n(X − )
2
 
1
So we reject H0 if exp n(X − ) is too large.
2
 
1
Because z → exp n(z − ) is increasing, this is equivalent to rejecting H0 if X is too
2
large.

So C = x ∈ Rn : X ≥ A for a value of A that is chosen to fix the size of the test.

α = 0.01 = Pθ0 [X ∈ C]
= Pθ0 [X ≥ A]
e

1X  n
but under H0 , X = Xi ∼ N 0, 2
n n
√ √ √
so nX ∼ N (0, 1) and Pθ0 [X ≥ A] = Pθ0 [ nX
√ ≥ nA]
= 1 − Φ( nA)

Thus A = Φ−1 ((1 − α)/ n).

4.1 Power of a test

Definition 4.1. The power of a test is the function β : Θ1 → [0, 1] defined by β(θ) =
Pθ [X ∈ C].
e
ST217 Mathematical Statistics B 57

This describes the probability of correctly rejecting the null hypothesis in favour of the
alternative hypothesis, i.e. ‘1 - type II’ error probability!

Usually if we decrease the size of the test, we will decrease the power too because we become
more willing to accept H0 .

Similarly, if we increase the size of the test, we can increase the power, and we become more
willing to reject H0 .

For a given fixed size of the test, we would like to make the power as high as possible.

Theorem 4.1. The Neyman-Pearson Lemma


Suppose X1 , . . . , Xn have a joint distribution with density f (x; θ).
e
The likelihood ratio test of H0 : θ = θ0
against H1 : θ = θ1

is at least as powerful as any other test of the same size.

Remark: H0 : θ = θ0 , H1 : θ = θ1 are called simple hypotheses because they specify a


single value for θ. A hypothesis that allows for more than one value of θ is called composite.

So Theorem 4.1 implies that if we want to test two simple hypotheses, the likelihood ratio
test is best.

f (x; θ1 )
 
n
Proof. Let C0 = x∈R : e ≥ A where A is chosen so that Pθ0 [X ∈ C0 ] = α,
e f (x; θ0 ) e
which is the size of the test. e

So C0 is the rejection region for the Likelihood Ratio Test (LRT).

Let C1 be another subset of Rn with Pθ0 [X ∈ C1 ] = α.


e
We wish to show that
Pθ1 [X ∈ C0 ] ≥ Pθ1 [X ∈ C1 ]
e e
ST217 Mathematical Statistics B 58

Let B1 = C0 ∩ C1 (both tests rejects null hypothesis)


B2 = C0 ∩ C1c (the LRT test rejects the null hypothesis, but the other test does not)
B3 = C0c ∩ C1 (the LRT does not reject the null hypothesis, the other test does)

then B1 , B2 , B3 are disjoint.

Note C0 = B1 ∩ B2 and C1 = B1 ∩ B3 .

Consider

Pθ1 [X ∈ C0 ] − Pθ1 [X ∈ C1 ]
e ∈ B ] − P e[X ∈ B ] (as C = B ∪ B , C = B ∪ B , and B cancels out)
= ZPθ1 [X 2 Z θ0 e 3 0 1 2 1 1 3 1
e
= f (x; θ) dx − f (x; θ) dx (†)
B2 e e B3 e e

f (x; θ1 )
Now B2 ⊆ C0 and so ≥ A for x ∈ B2 and
f (x; θ0 )
e
e
f (x
e; θ )
1
B3 ⊆ C0c and so e < A for x ∈ B3
f (x; θ0 ) e
e
Using these facts, we can see that (†) is greater than
Z Z
A f (x; θ0 ) dx − A f (x; θ0 )dx
B2 e e B3 e e

Z Z
But A f (x; θ0 ) dx − A f (x; θ0 )dx
e e e e
BZ2 B3 Z

=A f (x; θ0 ) dx − A f (x; θ0 ) dx
e e e e
Z2 ∪B1
B Z B3 ∪B1
= A f (x; θ) dx − A f (x; θ0 ) dx
e e e e
C0 C1
= A{Pθ0 [X ∈ C0 ] − Pθ0 [X ∈ C1 ]}
= A{α − eα} = 0 e

So Pθ1 [X ∈ C0 ] − Pθ1 [X ∈ C1 ] > 0 which proves the theorem.


e e
ST217 Mathematical Statistics B 59

The Neyman-Pearson Lemma tells us that the Likelihood Ratio Test is optimal where both
hypotheses are simple. But what about composite hypotheses?
Definition 4.2. A hypothesis test of H0 : θ ∈ Θ0
against H1 : θ ∈ Θ1
with critical region C is called the uniformly most powerful (UMP) test at size α if
sup Pθ [X ∈ C] = α
θ∈Θ0 e
and for any other test with critical region C 0
sup Pθ [X ∈ C 0 ] ≤ α
θ∈Θ0 e
we have
Pθ [X ∈ C] ≥ Pθ [X ∈ C 0 ]
e e
for all θ ∈ Θ1 .

The uniformly most powerful (UMP) test doesn’t always exist. Typically, as we try to
increase the power at one value of θ ∈ Θ1 , we will decrease the power somewhere else.
Corollary 4.1. If H0 : θ = θ0 is simple and H1 : θ ∈ Θ1 , then an UMP test of size α exists
if and only if the Likelihood Ratio Test of H0 : θ = θ0 against H1 : θ = θ1 of size α has the
same critical region for every θ1 ∈ Θ1 .
Exercise 4.1. Suppose the random variables X1 , . . . , Xn are independently identically dis-
tributed with N (0, σ 2 ) where σ is an unknown parameter.

a) Find a UMP test of H0 : σ 2 = 1 against H1 : σ 2 > 1.


b) Find a UMP test of H0 : σ 2 = 1 against H1 : 0 < σ 2 < 1.
c ) Find a UMP test of H0 : σ 2 = 1 against H1 : σ 2 > 0.

Write down the likelihood:


 n ( n
)
1 1 X 2
L(σ, x) = √ exp − 2 xi
e 2πσ 2σ i=1

Consequently, the likelihood ratio to compare σ = σ1 with σ = 1 is


( n
)
L(σ1 ; x)
X
1 1
e = σ1 n exp − xi 2
L(1; x) 2 2σ 2 i=1
e
ST217 Mathematical Statistics B 60

a) If σ1 > 1, the LRT of H0 : σ = 1 against H1 : σ = σ1 is of the form


X
reject H0 if xi 2 ≥ A
1 1
 
because Z → σ1 n exp 2
− 2σ 2
Z is an increasing function of Z.

We choose A to fix the size of the test. We want


" n #
X
P Xi 2 ≥ A = α
i=1

where α is the size of the test under the null hypothesis.


P 2
So since Xi ∼ χ2n distribution under the null hypothesis, we see that A = 100(1 − α)%
of the χ2n distribution.

The Neyman-Pearson Lemma then implies that this is the UMP test of H0 : σ = 1 against
H1 : σ 2 > 1.

b) If σ1 < 1, the LRT of H0 : σ = 1 against H1 : σ = σ1 is of the form


X
reject H0 if xi 2 ≤ A0
1 1
 
because Z → σ1 n exp 2
− 2σ 2
Z is an decreasing function of Z.

We choose A to fix the size of the test. We want


" n #
X
P Xi 2 ≤ A0 = α
i=1

where α is the size of the test under the null hypothesis.


P 2
So since Xi ∼ χ2n distribution under the null hypothesis, we see that A0 = 100α% of the
2
χn distribution.

Once again, our procedure is the same for all values σ1 < 1.

So by the Neyman-Pearson Lemma, this is the UMP test of H0 : σ = 1 against H1 : 0 < σ <
1.
ST217 Mathematical Statistics B 61

c) Since the likelihood ratio test is different for values of σ1 < 1 and values of σ1 > 1 there
can be no UMP test of H0 : σ = 1 against H1 : σ > 0.

4.2 Generalized Likelihood Ratio Tests

The Neyman-Pearson Lemma says that for simple hypotheses, the Likelihood Ratio Test is
optimal. It also implies in certain restricted situations that likelihood ratios give U.M.P.
tests.

Motivated by these good properties of likelihood ratios, we can define what a generalized
likelihood ratio test is.
Definition 4.3. The generalized likelihood ratio test of
H0 : θ ∈ Θ0 against
H1 : θ ∈ Θ1
to be the test with critical region
sup L(θ, x)
 
 
θ∈Θ1
C = x ∈ Rn : ≥λ
e
e sup L(θ, x) 
θ∈Θ0 e
where λ is chosen to fix the size of the test.

4.2.1 The t−test

Example 4.3. The t−test


Data x1 , x2 , . . . , xn is modelled as observed values of X1 , . . . , Xn independent identically
distributed random variables with N (µ, σ 2 ), with µ and σ unknown. We want to test H0 :
µ = 0 against H1 : µ ∈ R.

X
The t-test rejects H0 when t(x) = √ is too large (in the +ve or -ve sense).
e s/ n

1X 1 X
Here, x̄ = xi , s 2 = (xi − x̄)2 .
n n−1
ST217 Mathematical Statistics B 62

X −µ
By Fisher’s Theorem, √ ∼ tn−1 distribution.
s/ n

1X 1 X
Here, X = Xi , s2 = (Xi − X̄)2 .
n n−1

X
So t(X ) = √ ∼ tn−1 under H0 .
e s/ n

So if the size of the test is to be α, we reject H0 if:


 α  α 
t(x) ∈
/ tn−1 , tn−1 1 −
e 2 2

Question: Is the “t-test” a generalized likelihood ratio test?


 
1 1 X 2
Likelihood: L(µ, σ; x) = exp − 2 (xi − µ)
e (2π)n/2 σ n 2σ

1X 2
Under H0 , L(µ, σ) is maximized by choosing µ = 0 and σ̂02 = xi .
n

1 n no
So L(0, σ̂0 ; x) = exp − .
e (2π)n/2 σ̂0n 2

1X 1X
Under H1 , the M.L.E. of µ is µ̂ = xi , and M.L.E. of σ 2 is σ̂ 2 = (xi − x̄2 .
n n

1 n no
So L(µ̂, σ̂; x) = exp − .
e (2π)n/2 σ̂ n 2

Consequently, the G.L.R. to compare H0 and H1 is:


n n/2
x2i
P
L(µ̂, σ̂; x)
 
σ̂0
e = = P
L(0, σ̂0 ; x) σ̂ (xi − x̄)2
e

So what has this got to do with t(x)?


e
ST217 Mathematical Statistics B 63

n/2 n/2
x2i (xi − x̄ + x̄)2
 P P
P = P
(xi − x̄)2 (xi − x̄)2
n/2
(xi − x̄)2 + nx̄2
P
= P
(xi − x̄)2
n/2
nx̄2

= 1+ P
(xi − x̄)2
 n/2
1 2
= 1+ t(x)
n−1 e

 n/2
1
Since z → 1 + z is increasing in z, the L.R. being too large corresponding to
n−1
t(x)2 being too large, or equivalently |t(x)| being too large.
e e
Thus the generalized likelihood ratio test is the same as the t−test.

Example 4.4. Numerical example


Let xi = change in blood pressure of ith patient after treatment by a new drug. We are given
the values n = 15, x̄ = −9.267, s2 = 74.21. Do we reject H0 : µ = 0 at 5%?

t(x) = −4.16.
e
We reject H0 if |t| > 2.145 which is the 97.5% of t14 . So here, we reject H0 !

4.2.2 The two sample t−test

Example 4.5. The two sample t−test (


 i.i.d.
x1 , . . . , x m X1 , . . . , Xn ∼ N (µX , σ 2 )
Let the data be modelled as observations of r.v.s i.i.d. ,
y1 , . . . , y n Y1 , . . . , Yn ∼ N (µY , σ 2 )
with µX , µY , σ unknown parameters.

Construct the G.L.R.T. of H0 : µX = µY against H1 : (µX , µY ) ∈ R2 and show it can be


based on:
x̄ − ȳ
t(x, y ) = q
e e sP L m1 + n1
ST217 Mathematical Statistics B 64

1 nX X o
where s2P L = (xi − x̄)2 + (yi − ȳ)2 .
m+n−2

Here, PL means pool statistic.

Let’s write down the likelihood.


 i
2 1 1 1 hX 2
X
2
L(µX , µY , σ ; x, y ) = · · exp − 2 (xi − µX ) + (yi − µY )
e e (2π)(m+n)/2 σ m+n 2σ

Under H0 :

1 X X
The m.l.e. of µ = µX = µY is µ̂ = [ xi + yi ].
m+n
1 X X
The m.l.e. of σ 2 is σ̂02 = [ (xi − µ̂)2 + (yi − µ̂)2 ].
m+n

Consequently,

sup L(µX , µY ; σ 2 ) = L(µ̂X , µ̂Y ; σ̂02 )


µX =µY , σ 2 >0  
1 1 m+n
= · · exp −
(2π)(m+n)/2 σ̂0(m+n) 2

Under H1 :

1 X
The m.l.e. of µX is µ̂X = xi .
mX
1
The m.l.e. of µY is µ̂Y = yi .
n
1 X X
The m.l.e. of σ 2 is σ̂ 2 = [ (xi − x̄)2 + (yi − ȳ)2 ].
m+n

Consequently,

sup L(µX , µY , σ 2 ) = L(µ̂X , µ̂Y , σ̂ 2 )


µX , µY , σ 2 >0  
1 1 n+m
= · · exp −
(2π)(m+n)/2 σ̂ m+n 2
ST217 Mathematical Statistics B 65

(m+n)/2
σ̂02

The G.L.R. is therefore equal to .
σ̂ 2

We want to relate this quantity to the t statistic, so we have to rewrite σ̂02 .

1 X X
σ̂02 = { (xi − µ̂)2 + (yi − µ̂)2 }
m+n X
1 X
= { (xi − x̄ + x̄ − µ̂)2 + (yi − ȳ + ȳ − µ̂)2 }
m+n X
1 X
= { (xi − x̄)2 + m(x̄ − µ̂)2 + (yi − ȳ)2 + n(ȳ − µ̂)2 }
m+n 
1 X
2
X
2 mn 2
= (xi − x̄) + (yi − ȳ) + (x̄ − ȳ)
m+n m+n

mx̄ + nȳ
as µ̂ = .
m+n
 mn (m+n)/2
(x̄ − ȳ)2
Consequently the G.L.R. is 1 + P m + n
 P 
2
(xi − x̄) + (yi − ȳ) 2 

(m+n)/2
t2

= 1+
m+n−2

x̄ − ȳ
where t = q .
sP L m1 + n1

 (m+n)/2
z
Since z → 1 + is increasing and consequently the G.L.R.T rejects if |t|
m+n−2
is too large.

Question: How large is too large?

We need to find the distribution of T = t(X , Y ) under the null hypothesis.


e e
  
2
X̄ ∼ N µ, σm 
  
2 1 1
 2
 independent ⇒ X̄ − Ȳ ∼ N 0, σ +
Ȳ ∼ N µ, σn  m n
ST217 Mathematical Statistics B 66

Furthermore, by Fisher’s Theorem,


(Xi − X̄)2
P 
∼ χ2m−1


P σ2 independent
(Yi − Ȳ )2
∼ χ2n−1


σ 2

2
(m + n − 2)sP L
so ∼ χ2m+n−2 .
σ2

Finally,
A
T =√ √
B/ m + n − 2
where 
X̄ − Ȳ
∼ N (0, 1)

A= q 


σ m1 + n1 independent
m+n−2 2 
sP L ∼ χ2m+n−2

B= 2


σ
so T ∼ tm+n−2 distribution.
Example 4.6. Numerical example
Two groups of female rats placed on high/low protein diets. We want to measure the gain
in weight of the rats after a week. We test the hypothesis that the mean weight gain under
both diets is the same. We have the values m = 12, x̄ = 120, n = 7, ȳ = 101, s2P L = 446.12.

x̄ − ȳ
We get t = q = 1.89.
sP L m1 + 1
n

If the size of test is to be 5% then consider 97.5% point of t−distribution with 17(12 + 7 − 2)
degrees of freedom which is 2.11.

So we accept the null hypothesis.

4.2.3 The F −test

Example 4.7. The F −test


i.i.d. 2 i.i.d.
Suppose X1 , . . . , Xm ∼ N (µX < σX ) and Y1 , . . . , Yn ∼ N (µY , σY2 ) where µX , µY , σX , σY
are all unknown parameters.
ST217 Mathematical Statistics B 67

We want to test: H0 : σX = σY against H1 : σX 6= σY .

1 X
Under H0 , the m.l.e.: µ̂X = x̄ = xi
mX
1
µ̂Y = ȳ = yi
n
1
σ̂ 2 = (Sxx + Syy )
m+n

(xi − x̄)2 , SY Y = (yi − ȳ)2 .


P P
where Sxx =
 
2 2 1 1 m+n
So L(µ̂X , µ̂Y , σ̂ , σ̂ ) = · · exp − .
(2π)(m+n)/2 σ̂ m+n 2

1 X
Under H1 , the m.l.e.: µ̂X = x̄ = xi
mX
1
µ̂Y = ȳ = yi
n
2 1
σ̂X = SXX
m
2 1
σ̂Y = SY Y
n
 
2 1 1 m+n
So L(µ̂X , µ̂Y , σ̂X , σ̂Y2 ) = · m n exp − .
(2π)(m+n)/2 σ̂X σ̂Y 2

n/2  m/2
σ̂ m+n

Sxx Syy
Hence the G.L.R. is m n = c(m, n) · 1 + · 1+ where c(m, n) is some
σ̂X σ̂Y Syy Sxx
function of m and n.

Sxx
Large values of G.L.R. occur if the ratio is either too large or too small.
Syy

Under the null hypothesis, σX = σY = σ say,



Sxx
∼ χ2m−1


σ 2
Syy independent
2
∼ χ2n−1 

σ
Sxx /(m − 1)
so ∼ Fm−1,n−1 .
Syy /(n − 1)
ST217 Mathematical Statistics B 68

The F − test rejects H0 at size α if:


Sxx /(m − 1) α
≤ Fm−1,n−1
Syy /(n − 1) 2
Sxx /(m − 1)  α
or ≥ Fm−1,n−1 1 −
Syy /(n − 1) 2
where as usual fk,l (α) is the 100α% of the Fk,l distribution.

Actually, the F − test isn’t the G.L.R. test. Why not?

α

Because the values of the G.L.R. corresponding to the two critical values F m−1,n−1 2
and
α

Fm−1,n−1 1 − 2 are not equal. So the F − test doesn’t correspond to rejecting H0 if the
G.L.R. exceeds some single critical value.

4.2.4 Wilk’s Theorem

In the last three examples, we have seen that the G.L.R. can be written as a function of
some simple statistic (t−statistic, F −statistic) whose sampling distribution under the null
hypothesis was known.

In general, we aren’t so lucky and exact distributions under the null hypothesis are not easy
to find.

But there is an amazing approximation result.


Theorem 4.2. Wilk’s Theorem
Suppose data x1 , . . . , xn is modelled as observed values of i.i.d. random variables X1 , . . . , Xn
having a distribution depending on unknown parameters θ = (θ1 , . . . , θd ) ∈ Θ ⊆ Rd .

Suppose we wish to test: H0 : θ ∈ θ0 against H1 : θ ∈ Θ where θ ∈ Θ0 has (m < d) ‘free’


parameters.

Write L(θ; x) for the observed likelihood and let


e 
sup L(θ; x)

θ∈Θ
T (x) = 2 log 
e 
e sup L(θ, x)
θ∈Θ0 e
ST217 Mathematical Statistics B 69

Then T (X ) converges in distribution under the null hypothesis as n → ∞ to the χ2 distri-


e (d − m) degrees of freedom.
bution with

4.3 Multinomial Distribution

Suppose there are k possible outcomes of a random trial each occurring with a probability
k
X
θi , i = 1, 2, . . . , k with θi = 1. Suppose we repeat the trial n times. Let Yi , i = 1, 2, . . . , k
i=1
denote the number of times outcome i occurs in the ( n trials. If the trials are independent, )then
k
X
(Y1 , . . . , Yk ) has a joint distribution with support (y1 , . . . , yk ) ∈ Zk , yi ≥ 0, yi = n and
i=1
n!
probability mass function fY1 ,...,Yk = fY (y ) = θ1y1 . . . θkyk .
e e y1 !y2 ! . . . yk !

Comments:
• We say (Y1 , . . . , Yk ) has a multinomial distribution.
• If we specify an ordering for the outcomes, then this gives a probability θ1y1 . . . θkyk and
n!
there are possible orderings.
y1 ! . . . yk !
• The binomial distribution is the special case k = 2. Then (Y1 , Y2 ) = (Y1 , n − Y1 ) and Y1
ha a Binomial distribution with Bin(n, θ1 ).
• In fact, the marginal distribution of Yi is Bin(n, θi ) because we can reclassify outcome i as
a “success” and call all other outcomes “failures” and thus obtain a sequence of n Bernoulli
trials.

4.3.1 χ2 − Tests

Suppose our data is modelled by a multinomial distribution with unknown parameters.


Suppose we wish to test the null hypothesis that θ is some function θ(φ) of another parameter
φ = (φ1 , . . . , φk0 ) with k0 ≤ k versus alternative hypothesis that θ isearbitrary.
e e
e e
 
L(θ̂; y )
The log-likelihood ratio is given by r(y ) = log  ee 
e L(θ (φ̂); y )
ee e
ST217 Mathematical Statistics B 70

k
X k
X
= yi log θ̂i − yi log(θi (φ̂))
i=1 i=1 e

Let’s write ei = nθi (φ̂)


= “expected number of type i under null hypothesis”
and write δi = yi − ei

yi
Recall also that θ̂i = .
n
X n  yi ei o
So 2r(y ) = 2 yi log ) − log(
n n
ei + δi
e X
=2 (ei + δi ) log
 ei 
X δi
=2 (ei + δi ) log 1 +
ei
δi 1 δi2
X  
=2 (ei + δi ) + 2 + ...
X P 2ei 2 ei
δi
=2 δi + ei
+ higher order moments
| {z }
= 0
k
X (yi − ei )2
= ←− Pearson’s statistic
i=1
ei

δi
We are assuming n is large and is small.
ei

By Wilk’s theorem, the corresponding random variable has a χ2 distribution approximately.

Pearson’s χ2 −test rejects H0 if


k
X (yi − ei )2
≥ χ2d (1 − α)
i=1
ei

where χ2d (1 − α) is 100(1 − α)th percentile of χ2d −distribution.

d = dim(Θ) − dim(Θ0 ) = (k − 1) − k0
ST217 Mathematical Statistics B 71

Example 4.8. Suppose we are given a survey of 360 students.

Ability in Maths
low average high
Interest low 63 42 15
in Stats average 58 61 31
high 14 47 29

Such a table of data is called a contingency table.

Let yij = count in (i, j)th cell of table.

Let H0 be: ability in maths and interest in statistics are independent, and let H1 be: arbitrary
cell probabilities.

H0 corresponds to assuming θij probability for cell (i, j) being of the form:

θij = pi p0j

for some

p1 , p2 , p3 being probability of low, average, and high ability in Maths
p01 , p02 , p03 being probability of low, average, and high interest in Stats

dim(Θ) = 8 If not independent, will just be (9-1)

dim(Θ0 ) = 4 We have p1 + p2 + p3 = 1, p01 + p02 + p03 = 1, so it is 2 restrictions. 6- 2 = 4

The m.l.e. of pi , p0j is given by:


3
1X
p̂i = yij
n j=1
3
1X
p̂0j = yij
n i=1

We have eij = “expected cell count under H0 ”


ST217 Mathematical Statistics B 72

= n · p̂i p̂0j given by

45 50 25
56.25 62.5 31.25
33.75 37.5 18.75

(63 − 45)2 (29 − 18.75)2


Pearson statistic = + ... + = 32.14
45 18.75

If α = 0.01, χ24 (0.99) = 13.277

We reject H0 at 1% significance level.

Example 4.9. We look at the number of errors in a manuscript per page. Each page is a
trial, outcome = number of errors on that page.

Number of errors Observed Frequency


0 18
1 53
2 103
3 107
4 82
5 46
6 18
7  10 
8 2
8+ 3
9 1

Let H0 : # errors per page follows a Poisson distribution


H1 : # errors per page has an arbitrary distribution

More precisely, we assume the observed frequency vector (y0 , y1 , . . . , y8+ ) is multinomial with
λi X λi
θi = e−λ , i = 0, 1, . . . , 7, and θ8+ = e−λ for some λ.
i! i≥8
i!

dim(Θ) = 8, dim(Θ0 ) = 1 (as there’s only one parameter to be decided)


ST217 Mathematical Statistics B 73

Under H0 , m.l.e. of λ is λ̂ = 3.0, so expected counts are

Number of errors Poisson Probability Expected Count


0 0.0498 21.9
1 0.1494 65.7
.. .. ..
. . .

(18 − 21.9)2
Pearson statistic is + . . . = 6.83.
21.9

χ27 (0.95) = 14.06

We accept null hypothesis at 5% level, and the data fitted reasonably well by Poisson dis-
tribution.

Further example. Go to the casino-coin example in MSA, and do as above, except using
uniform distribution.

5 Linear Models

Definition 5.1. A response variable is a random variable (or sometimes several copies of
a random variable) whose distribution depends on (perhaps several) explanatory variables
whose values are known as well as unknown parameters.
Example 5.1. A patient’s blood pressure after a drug treatment can be considered as
a response variable whose distribution depends on explanatory variables: blood pressure
before treatment, age, and gender, together with an unknown measures of how effective the
drug is which can be described via parameters.
Definition 5.2. A statistical model for the distribution of the response variable(s) is called
a linear model if each response variable has an expectation which is a linear function of
their parameters for any given values of the explanatory variables.

If Y = (Y1 , . . . , Yn )T is our vector of response variables,


βe = (β1 , . . . , βk )T is our vector of unknown parameters
e
ST217 Mathematical Statistics B 74

then E[Y ] = Aβ (A is a known constant)


e ee e

where A is a n×k matrix called the design matrix whose entries depend on the explanatory
variables.

Thus Y = A + β +  where
e e e e
 = (1 , . . . , n )T is a vector of random variables with E[i ] = 0
e
In a linear model, we often assume that the i are i.i.d.

Remarks:
• Sometimes, we do not treat explanatory variables as random. But then we condition on
their observed values.
• In this case, we might write E[Y |X ] = A(X )β
• Sometimes, E[Y ] is a linear function of β and of the explanatory variables as well e.g. linear
e e e e
P
regression, response random variables Y1 , . . . , Yn , Yi = xij βj + i where xij is the value of
e e
Xn
th th
the j explanatory variable on the i response. But polynomial regression Yi = βj xαij +i
j=1
is a linear model too!

5.1 Simple Linear Regression

Let the data be (xi , yi ), i = 1, 2, . . . , n.

We treat yi as observed values of a random variable Yi with


Yi = β0 + β1 xi + i , i = 1, 2, . . . , n
i mean = 0, independent random variables
β0 unknown parameter
β1 unknown parameter
xi is the value of the explanatory variable corresponding to the ith observation
of the response

How do we estimate β0 > β1 ?


ST217 Mathematical Statistics B 75

5.1.1 The least squares criterion

We estimate β0 and β1 by β̂0 and β̂1 hicha re chosen so that


Xn  2
Q= yi − (β̂0 + β̂1 xi ) is minimized
i=1

β̂0 , β̂1 are then called the least squared estimates of β0 and β1 .

y = β0 + β1 x

0 x

The sum of squares of vertical distances is minimized in some sense.

We are calculating a “best fitting” straight line.

To calculate β̂0 , β̂1 , we proceed as follows.

Solve:
∂Q P  
= −2 y i − β̂0 + β̂ x
1 i =0 (†)
∂β0 β0 =β̂0
∂Q P  
= −2 y i − β̂0 + β̂ 1 x i xi = 0 (††)
∂β1 β1 =β̂1

X n
X
From (†), yi = nβ̂0 + β̂1 + xi
X X i=1 X
From (††), xi yi = β̂0 xi + β̂1 x2i
ST217 Mathematical Statistics B 76

We get P
xi yi − nx̄ȳ
β̂1 = P 2 β̂0 = ȳ − β̂1 x̄
xi − nx̄2

We can do something similar with general cases of linear models, in which data yi are treated
as observed values of random variables Yi satisfying Y = Aβ + , with
e ee e
 being an n−vector of random variables with E[i ] = 0
βe being a k−vector of unknown parameters
e an n × k design matrix depending on the known values of explanatory variables
A
e

n
X
So we want to estimate β by minimizing Q = (yi − (AB )i )2
i=1
ee
T  
= y − Aβ y − Aβ
e ee e ee
n
∂Q X 
Now = −2 yi − (Aβ )i aij
∂βj i=1
ee

So to find β̂ the least square estimates, we solve the k−equation


X n  
yi − (Aβ̂ )i aij = 0, j = 1, 2, . . . , k
i=1
ee
or in matrix form  
AT y − AT Aβ̂ = 0
e e e ee

So provided AT A is an invertible k × k matrix, (we say the experiment is non-singular), then


T
−1  T 
β̂ = A A A y
e e e e
Example 5.2. Check that the general result is consistent with what we did for simple linear
regression. What is A in this case?
 
1 x1
 1 x2 
A=
 
.. .. 
 . . 
1 xn
ST217 Mathematical Statistics B 77

 P 
T n P x2i
A A= P
xi xi

1
   
x2i −x̄
P
1
Pnȳ
−1  
T n T
A A =P 2 ,; A y =
e e xi − nx̄2 −x̄ 1 e e xi yi

   P 2 P 
β̂0 −1  T 
T 1 xi ȳ − x̄ xi yi
= A A A y =P 2 P
β̂1 e e e e xi − nx̄2 xi yi − nx̄ȳ

Theorem 5.1. Properties of least squares estimators


Suppose Yi = Aβ +  where
ee e
E[i ] = 0 (
0 i 6= j
E[i j ] =
σ2 i = j
Assume AT A is non-singular and let
e e
−1 T 
β̂ = AT A A Y be the least squares estimator of β
e e e e
Then:
1) E[β̂i ] = βi , i.e. β̂ is unbiased.
−1
2) Cov[β̂i , β̂j ] = σ 2 AT A ij
e
e e

Proof.
1) Consider
h −1 T i
E[β̂ ] = E AT A A Y
e −1
e e e
= AT A T
e
A E[Y ]
e T e −1 e T e
= A A A Aβ
e e −1 e ee
= AT A AT A β
= β
e e e e e
e
2) We use the following: P
If X is a random vector with dimension n, having as the corresponding variance-covariance
matrix
e B a non-random (m × n) matrix, then the variance-covariance matrix of X is
and P
given by B e B T . e
e e
ST217 Mathematical Statistics B 78

Now: Var[Y ] = σ 2 I and


e e
h −1 T i
T
Var[β̂ ] = Var A A A Y
e e e
−1 T
e e −1 T T
= AT A A · σ 2 I · AT A A
e e e
−1 T e  e e −1eT
= σ 2 · AT A A · A AT A
e e−1 e e e −1 e
= σ 2 AT A AT · A AT A
e e −1 e e e e
= σ 2 AT A
e e
 −1 
T T
Notice our estimator β̂ of β i , A A A Y is a linear function of Y and it is unbiased
i e e e e i e
by Theorem 5.1. e e

Such an estimator is called a best linear unbiased estimator if it has a smaller variance
than any other linear unbiased estimator.
Theorem 5.2. Gauss-Markov
The least squares estimator is the best linear unbiased estimator.
Definition 5.3. In the normal linear model, response variables Y = (Y1 , . . . , Yn )T satisfy
Y = Aβ +  where e
• A is a k × n design matrix depending on explanatory variables
e ee e
• βe is a k−vector of unknown parameters
• e = (1 , . . . , n ) are independent Gaussian random variables with mean 0 and unknown
common variance σ 2 (homoscedasticity)

   
So E Y = Aβ , Var[Y ] = σ 2 I and Y ∼ M V N Aβ , σ 2 I
e ee e e e ee e
Theorem 5.3. In the normal linear model the maximum likelihood estimates of β are
given by least squares estimates, and the maximum likelihood estimate for σ is given
e by
n
1 X   2
σ̂ 2 = yi − Aβ̂ .
n i=1 ee i

Proof. Let y = (y1 , . . . , yn )T be data.


e
 

2
 1 1  T 
The likelihood is given by: L β , σ ; y = exp − 2 y − Aβ y − Aβ
e e (2π)n/2 σ n 2σ e e e e ee
ST217 Mathematical Statistics B 79

   
2 2
l β , σ ; y = log L β , σ ; y
e e e en 1  T  
= ‘constant’ − log σ 2 − 2 y − Aβ

y − Aβ
2 2σ | e e e {z e e e }
Xn    2
yi − Aβ
i=1
ee i

n 
X   2
To find the maximum likelihood estimates of β , we need to minimize yi − Aβ and
i
i=1
e ee
this will give the least squares estimates.

2 ∂l n 1  T  
To find the m.l.e. of σ we solve: 0 = = − + y − Aβ y − A β
∂σ 2 σ2 =σ̂2 ,β=β̂ 2σ̂ 2 2σ̂ 4 e e e e ee
1   T  
⇒ σ̂ 2 = y − Aβ y − Aβ
n e ee e ee
Theorem 5.4. Distribution of estimators in normal linear model
−1
a) β̂ ∼ M V N β , σ 2 AT A
e e e e  T  
b) s2 = “residual sum of squares” = Y − Aβ̂ Y − Aβ̂ is independent of β̂
s2
e ee  e e e
2

c) 2 ∼ χn−k where n = dim Y ; k = dim β
σ e e

Proof.
a) β̂ must have a Gaussian distribution because it is constructed from a linear combination
of Yi ’s - which are Gaussian. Mean and variance-covariance matrix of β̂ we already know
e
from Theorem 5.1.

b) and c) omitted. Similar to Fisher’s Theorem.

Remarks:
• In fact, Fisher’s Theorem is a special case of Theorem 5.4. Take Yi = µ + i with
i ∼ N (0, σ 2 ).
• An experiment is called an orthogonal design if AT A is a diagonal matrix. In this case,
β̂1 , . . . , β̂k are independent.
e e
ST217 Mathematical Statistics B 80

Example 5.3. Data y1 , . . . , yn is modelled as observed values of Y1 , . . . , Yn satisfying


i.i.d.
Yi = β0 + β1 xi + i with i ∼ N (0, σ 2 )
xi − known constants
β0 , β1 , σ 2 unknown parameters

Can also write Yi = α0 + α1 (xi − x̄) + i where α1 = β1 , α0 − α1 x̄ = β0 .


 
1 x1 − x̄
 1 x2 − x̄ 
Then Y = Aα +  with A =  .
 
.. ..
e ee e e  . . 
1 xn − x̄

 
T n P 0
So A A = .
e e 0 (xi − x̄)

σ2 σ2 s2
   
By Theorem 5.4, α̂0 ∼ N α0 , , α̂1 ∼ N α1 , P , ∼ χ2n−2 . All these
n (xi − x̄)2 σ2
are independent.

Suppose we wish to test whether the data suggest that the response variable depends on the
explanatory variables value.

More precisely, we wish to test H0 : α1 = 0 against H1 : α1 ∈ R.

So since under H0 ,

σ2
  
α̂1 ∼ N 0, P 

(xi − x̄)2 independently
2
s
and 2 ∼ χ2n−2


σ

then
α̂
T =√ p P ∼ tn−2
s2 / (n − 2) (xi − x̄)2
ST217 Mathematical Statistics B 81

We reject H0 at a significance level 100α% if


 α
|T | > tn−2 1 −
2
More generally, we find
  α  α 
tn−2 1 − tn−2 1 −
αˆ1 − √ p
2
P2 2
, αˆ1 + √ p P2 
s / (n − 2) (xi − x̄) s / (n − 2) (xi − x̄)2
2

is a 100(1 − α)% confidence interval for α1 .

Example 5.4. Analysis of variance (ANOVA)


Suppose we have k groups with ni observations from group i : n1 + n2 + . . . + nk = n the
total number of observations.

We assume yij , the j th observation from group i is an observed value of a random variable
Yij with Yij = µi + ij where

ij are i.i.d. N (0, σ 2 ) with σ unknown


µi = ith group mean ← unknown

This is a normal linear model. To put in standard form, let:

Ỹ = (y11 , y12 , . . . , y1n1 , y21 , . . . , y2n2 , . . . , yknk )T


e | {z }
group 1
˜ = (11 , . . . , knk )T
β = (µ1 , µ2 , . . . , µk )T
e
e
h i
Now Ỹ = Aβ + ˜
e ee e
ST217 Mathematical Statistics B 82


  

1 
 0 ... 0 



1 0 0

...
   

n1
 
 .. .. ... .. 
 
 . . . 


...
  
 1  0 0  

  
0 1  0 
...
 
  

0 1  0

...
  

n2
 
 .. .. ... .. 
 . . 
 . 


...
  
0 1  0
  


..

with A equal to 
 ...  n = n1 + . . . + nk
... ... . 
..
e
..
 
. 
... ... .
 
 

... ... .. .. 
 . . 


... ...
 
 .. ..  


 . . 





 0 0 ... 1 





0 0 1
 
...
   

.. .. .. n
 
k 


 . . ... . 
 

...
 
0 0 1  


| {z }
k

 
n1 0 ...
0
.
0 n2 . . . ..
 
Then AT A =   so an orthogonal design.
 
. .
e e  . . . . . . . . .. 
0 . . . . . . nk

The m.l.e. of β is:  


ȳ1
A y =  ... 
−1 T
β̂ = AT A
 
ȳk
e e e e

ni
1 X
where ȳi = yij = sample mean of group i.
ni j=1
ST217 Mathematical Statistics B 83

The m.l.e. of σ 2 is:


1 T  
σ̂ 2 = y − Aβ̂ y − Aβ̂
n e ee e ee
k ni
1 XX
= (yij − ȳi )2
n i=1 j=1

Moreover, Theorem 5.4 implies corresponding estimators have a distribution satisfying:


  
σ2
µ̂i ∼ N µi , ni 


2 2 2

σ̂ ∼ σ χn−k ∗n
e 


Moreover µ̂1 , µ̂2 , . . . , µ̂k , σ̂ 2 are independent

An important hypothesis test is:

H0 : µ1 = µ2 = . . . = µk versus
H1 : µ1 , . . . , µk are unrestricted

Let us derive the G.L.R. statistic to compare these hypotheses.

We just wrote down the m.l.e. under H1 .

Under H0 , the experiment reduces to a sample of size n for some Gaussian distribution with
unknown mean and variance. So we know that m.l.e. are:
k n
k
1 XX
µ̂0 = yij = overall mean ȳ
n i=1 j=1
kk n
1 XX
σ̂02 = (yij − ȳ)2
n i=1 j=1

( k X nk
)
  1 1 X
Now the likelihood L µ, σ 2 ; y = 2 )n/2
· exp − 2 (yij − µi )2
e e (2πσ 2σ i=1 j=1

and the G.L.R. is given by:


ST217 Mathematical Statistics B 84

 nk
k X
−n/2
X

2
  (yij − ȳi )2 
L µ̂, σ̂ ; y 
i=1 j=1

e e = 
 k nk


L µ̂ , σ̂02 ; y  XX
2

e 0
e
 (y ij − ȳ) 
i=1 j=1
 n/2
s0
=
s
 1 n/2
s2
= 1+
s1

nk
k X
X
where s0 = (yij − ȳ)2 = “total sum of squares”
i=1 j=1
Pk Pnk 2
s1 = i=1 j=1 (yij − ȳi ) = “within group sum of squares”
s2 = s0 − s1
Xk
= ni (ȳi − ȳ)2 = “between group sum of squares”
i=1

s2
The likelihood statistic is an increasing function of . We reject H0 when this ratio is too
s1
large.

s2
The final step is to decide what the critical value ofshould be if we wish the size of the test
s1
to be α, say. To do this we consider the distribution of the random variable corresponding
s2
to .
s2

Using ∗n

, we see that s1 ∼ nσ 2 χ2n−k
independent
s2 ∼ nσ 2 χ2k−1

Consequently,
s2 /(k − 1)
∼ Fk−1,n−k .
s1 /(n − k)

Potrebbero piacerti anche