Sei sulla pagina 1di 13

3.

Multivariate Normal Distribution


The MVN distribution is a generalization of the univariate normal distribution which has the
density function (p.d.f.)
) (r) =
1
_
2o
exp
_

(r j)
2
2o
2
_
< r <
where j = mean of distribution, o
2
= variance. In jdimensions the density becomes
) (x) =
1
(2)
p=2
[[
1=2
exp
_

1
2
(x )
T

1
(x )
_
(3.1)
Within the mean vector there are j (independent) parameters and within the symmetric co-
variance matrix there are
1
2
j (j + 1) independent parameters [
1
2
j (j + 3) independent parameters
in total]. We use the notation
x s N
p
(; ) (3.2)
to denote a RV x having the jvariate MVN distribution with
E(x) =
Co (x) =
Note that MVN distributions are entirely characterized by the rst and second moments of the
distribution.
3.1 Basic properties
If x (j 1)is MVN with mean and covariance matrix
Any linear combination of x is MVN
Let y = Ax +c with A( j) and c ( 1) then
y s N
q
_

y
,
y
_
where
y
= A +c and
y
= AA
T
.
Any subset of variables in x has a MVN distribution.
If a set of variables is uncorrelated, then they are independently distributed. In particular
i) if o
ij
= 0 then r
i
, r
j
are independent.
1
ii) if x is MVN with covariance matrix , then Ax and Bx are independent if and only if
Co (Ax; Bx) = AB
T
(3.3)
= 0
Conditional distributions are MVN.
Result
For the MVN distribution, variable are uncorrelated = variable are independent.
Proof
Let x (j 1) be partitioned as
x =
_
x
1
x
2
_

j
with mean vector
=
_

2
_

j
and covariance matrix
j
=
_

11

12

21

22
_

j
i) Independent = uncorrelated (always holds).
Suppose x
1
, x
2
are independent. Then ) (x
1
, x
2
) = /(x
1
) q (x
2
) is a factorization of the
multivariate p.d.f.and
12
= Co (x
1
, x
2
) = E
_
(x
1

1
) (x
2

2
)
T
_
factorizes into the
product of E[(x
1

1
)] and E
_
(x
2

2
)
T
_
which are both zero since E(x
1
) =
1
and
E(x
2
) =
2
. Hence
12
= 0.
ii) Uncorrelated = independent (for MVN)
This result depends on factorizing the p.d.f. (3.1) when
12
= 0.
In this case (x )
T

1
(x ) has the partitioned form
_
x
T
1

T
1
, x
T
2

T
2
_
_

11
0
0
22
_
1
_
x
1

1
x
2

2
_
=
_
x
T
1

T
1
, x
T
2

T
2
_
_

1
11
0
0
1
22
__
x
1

1
x
2

2
_
= (x
1

1
)
T

1
11
(x
1

1
) + (x
2

2
)
T

1
22
(x
2

2
)
2
so that exp(x )
T

1
(x ) factorizes into the product of
exp
_
(x
1

1
)
T

1
11
(x
1

1
)
_
and exp
_
(x
2

2
)
T

1
22
(x
2

2
)
_
.
Therefore the p.d.f. can be written as
) (x) = q (x
1
) /(x
2
)
proving that x
1
and x
2
are independent.
3.2 Conditional distribution
Let X =
_
X
1
X
2
_

j
be a partitioned MVN random jvector,
with mean =
_

2
_
and covariance matrix
=
_

11

12

21

22
_
The conditional distribution of X
2
given X
1
= x
1
is MVN with
E(X
2
[X
1
= x
1
) =
2
+
21

1
11
(x
1

1
) (3.4a)
Co (X
2
[X
1
= x
1
) =
22

21

1
11

12
(3.4b)
Note: the notation X
1
to denote the r.. and x
1
to denote a specic constant value (realization
of X
1
) will be very useful here.
Proof of 3.4a
Dene a transformation from (X
1
, X
2
) to new variables X
1
and X
0
2
= X
2

21

1
11
X
1
.
Note that this can be achieved by the linear transformation
_
X
1
X
0
2
_
=
_
I 0

21

1
11
I
__
X
1
X
2
_
(3.5a)
Y = AX say. (3.5b)
This linear relationship shows that X
1
, X
0
2
are jointly MVN (by rst property of MVN stated
above.)
We now show that X
0
2
and X
1
are independent by proving that X
1
and X
0
2
are uncorrelated.
3
Co
_
X
1
, X
0
2
_
= Co
_
X
1
, X
2

21

1
11
X
1
_
= Co(X
1
, X
2
) Co (X
1
, X
1
)
1
11

12
=
12

11

1
11

12
= 0
Note also that in (3.3), we may write
A =
_
B
C
_
where B =
_
I 0
_
and C =
_

21

1
11
I
_
Hence again we see that
Co
_
X
1
, X
0
2
_
= Co (BX; CX)
= BC
T
=
_
I 0
_
_

11

12

21

22
__

1
11

12
I
_
=
_

11

12
_
_

1
11

12
I
_
= 0
Since X
0
2
and X
1
are MVN and uncorrelated they are independent. Thus
E
_
X
0
2
[X
1
= x
1
_
= E
_
X
0
2
_
= E
_
X
2

21

1
11
X
1
_
=
2

21

1
11

1
Now, as X
0
2
= X
2

21

1
11
X
1
and X
1
= x
1
is given, we have
E(X
2
[X
1
= x
1
) = E
_
X
0
2
[X
1
= x
1
_
+
21

1
11
x
1
=
2

21

1
11

1
+
21

1
11
x
1
=
2
+
21

1
11
(x
1

1
)
as required.
Proof of 3.4b
Because X
0
2
is independent of X
1
Co
_
X
0
2
[X
1
= x
1
_
= Co
_
X
0
2
_
4
The left hand side is
1Ho = Co
_
X
0
2
[X
1
= x
1
_
= Co
_
X
2

21

1
11
x
1
[X
1
= x
1
_
= Co (X
2
[X
1
= x
1
)
The right hand side is
1Ho = Co
_
X
0
2
_
= Co
_
X
2

21

1
11
X
1
_
=
22

21

1
11

12
following from the general expansion
Co (X
2
DX
1
) = Co (X
2
, X
2
) DCo (X
1
, X
2
)
Co (X
2
, X
1
) D
T
+DCo (X
1
, X
1
) D
T
with D =
21

1
11
. Therefore
Co (X
2
[X
1
= x
1
) =
22

21

1
11

12
as required.
Example
Let x have a MVN distribution with covariance matrix
=
_

_
1 j j
2
j 1 0
j
2
0 1
_

_
Show that the conditional distribution of (A
1
, A
2
) given A
3
= r
3
is also MVN with mean
=
_
j
1
+j
2
(r
3
j
3
)
j
2
_
and covariance matrix
_
1 j
4
j
j 1
_
5
Solution
Let Y
1
=
_
A
1
A
2
_
and Y
2
= (A
3
) then
EY
1
=
_
j
1
j
2
_
EY
2
= (j
3
) .
We have Co
_
Y
1
Y
2
_
=
_

11

12

21

22
_
where

11
=
_
1 j
j 1
_

12
=
_
j
2
0
_
=
T
21

22
= [1]
Hence
E[Y
1
[Y
2
= r
3
] =
1
+
12

1
22
(r
3
j
3
)
=
_
j
1
j
2
_
+
_
j
2
0
_
(x
3

3
)
=
_
j
1
+j
2
(r
3
j
3
)
j
2
_
and .
Co [Y
1
[Y
2
= r
3
] =
11

12

1
22

21
=
_
1 j
j 1
_

_
j
2
0
_
_
j
2
0
_
=
_
1 j
4
j
j 1
_
3.3 Maximum-likelihood estimation
Let X
T
= (x
1
, ..., x
n
) contain an independent random sample of size : from
p
(; ) .
The maximum likelihood estimates (MLE s) of , are the sample mean and covariance matrix
(with divisor :)
^ = x (3.6a)
^
= S (3.6b)
6
The likelihood function is a function of the parameters ; given the data X
1(; jX) =
n

r=1
) (x
r
[; ) (3.7)
The RHS is evaluated by substituting the individual data vectors x
1
, ..., x
n
in turn into the
p.d.f. of
p
(; ) and taking the product.
n

r=1
) (x
r
[; ) = (2)

np
2
jj
n=2
exp
_

1
2
n

r=1
(x
r
)
T

1
(x
r
)
_
Maximizing 1 is equivalent to minimizing the "log likelihood" function
| (; ) = 2 log 1
= 2
n

r=1
log ) (x
r
[; )
= 1 +:log jj+
n

r=1
(x
r
)
T

1
(x
r
) (3.8)
where 1 is a constant independent of ; :
Result 3.3
| (; ) = :
_
log [ j+tr
_

1
_
S +dd
T
__
(3.9)
up to an additive constant, where d = x :
Proof
Noting that x
r
= (x
r
x) +d the nal term in the likelihood expression (3.8) becomes
n

r=1
(x
r
)
T

1
(x
r
)
=
n

r=1
(x
r
x)
T

1
(x
r
x) +:d
T

1
d
= :tr
_

1
S
_
+:d
T

1
d
= :tr
_

1
_
S +dd
T
_
proving the expression (3.9). Note that the cross-product terms have vanished because

n
r=1
x
r
=
7
: x and therefore
n

r=1
d
T

1
(x
r
x) = d
T

1
n

r=1
(x
r
x)
=
n

r=1
(x
r
x)
T

1
d
= 0
In (3.9) the dependence on is entirely through d. Now assume that is positive denite (p.d.),
then so is
1
as

1
= V
1
V
T
where = V V
T
is the eigenanalysis of . Thus \d 6= 0 we have d
T

1
d > 0. Hence | (; )
is minimized with respect to for xed when d = 0 i.e.
^ = x
Final part of proof: to minimize the log-likelihood | (^ ; ) w.r.t. let
| (^ ; ) = :
_
log [[ +tr
_

1
S
__
= () , , say (3.10)
We show that S minimizes () by proving that (S) _ () , \
() (S) = :
_
log [[ log [S[ +tr
_

1
S
_
j
_
= :
_
tr
_

1
S
_
log [
1
S[ j
_
(3.11)
_ 0
Lemma 1

1
S is positive semi-denite (proved elsewhere). Therefore the eigenvalues of
1
S are
positive.
Lemma 2
For any set of positive numbers
_ log G+ 1
where and G are the arithmetic, geometric means respectively.
Proof
8
For all r we have c
x
_ 1 +r (simple exercise).Consider a set of : strictly positive numbers j
i

j
i
_ 1 + log j
i

j
i
_ : +

log j
i
_ 1 + log
_

j
i
_1
n
= 1 + log G
as required.
Recall that for any (: :) matrix A; we have tr (A) =

n
i=1
`
i
the sum of the eigenvalues,
and [ Aj =

`
i
the product of the eigenvalues. Let `
i
(i = 1, ..., j) be the positive eigenvalues of

1
S and substitute in (3.11)
log [
1
S[ = log
_

`
i
_
= j log G
tr
_

1
S
_
=

`
i
= j
Hence
() (S) = :j log G1
_ 0
This proves that the MLEs are as stated in (3.6) .
3.3 Sampling distribution of x and S
3.3.1 The Wishart distribution (Denition)
If M (j j) can be written M = X
T
X where X (:j) is a data matrix from
p
(0, ) then
M is said to have a Wishart distribution with scale matrix and degrees of freedom :. We write
M s \
p
(;:) (3.12)
When = I
p
the distribution is said to be in standard form.
Note:
The Wishart distribution is the multivariate generalization of the chi-square
2
distribution
9
Additive property of matrices with a Wishart distribution
Let M
1
, M
2
be matrices having the Wishart distribution
M
1
s \
p
(;:
1
)
M
2
s \
p
(;:
2
)
independently, then
M
1
+M
2
s \
p
(;:
1
+:
2
)
This property follows from the denition of the Wishart distribution because data matrices are
additive in the sense that if
X =
_
X
1
X
2
_
is a combined data matrix consisting of :
1
+:
2
rows then
X
T
X = X
T
1
X
1
+X
T
2
X
2
is matrix (known as the "Gram matrix") formed from the combined data matrix X:
Case of j = 1
When j = 1 we know from the denition of
2
r
as the distribution of the sum of squares of r
independent (0, 1) variates that if A
i
~
_
0, o
2
_
then
M =
m

i=1
A
2
i
s o
2

2
m
so that
\
1
_
o
2
, :
_
= o
2

2
m
3.3.2 Sampling distributions
Let x
1
, x
2
, ..., x
n
be a random sample of size : from
p
(, ). Then
1. The sample mean x has the normal distribution
x s
p
_
,
1
:

_
2. The (scaled) sample covariance matrix has the Wishart distribution:
(: 1) S
u
s \
p
(;: 1)
3. The distributions of x and S
u
are independent.
10
3.4 Estimators for special circumstances
3.4.1 j proportional to a given vector
Sometimes is known to be proportional to a given vector, so = /
0
with
0
being a known
vector.
For example if x represents a sample of repeated measurements then = /1where 1 =
(1, 1, ..., 1)
T
is the jvector of 1
0
:.
We nd the MLE of / for this situation. Suppose is known and = /
0
. Let d
0
= x/
0
.
The log likelihood is
| (/) = 2 log 1
= :
_
log [ j+ tr
_

1
_
S +d
0
d
T
0
__
= :
_
log [ j+ tr
_

1
S
_
+ ( x/
0
)
T

1
( x/
0
)
_
= :
_
x
T

1
x2/
T
0

1
x+/
2

T
0

+ constant terms indept of /


Set
d|
d/
= 0 to minimize | (/) w.r.t. /
2
T
0

1
x+2
_

T
0

0
_
/ = 0
from which
^
/ =

T
0

1
x

T
0

0
(3.13)
Properties
We now show that
^
/ is an unbiased estimator of / and determine the variance of
^
/
In (3.13)
^
/ takes the form
1
c
c
T
x with c
T
=
T
0

1
and c =
T
0

0
so
E
_
^
/
_
=
c
T
E[ x]
c
=
/c
T

0
c
.
=
/
T
0

0
c
since E[ x] = /
0
. Hence
E
_
^
/
_
= / (3.14)
showing that
^
/ is an unbiased estimator.
11
Note that \ ar [ x] =
1
:
and therefore that \ ar
_
c
T
x

=
1
:
c
T
c we have
\ ar
_
^
/
_
=
1
:c
2
c
T
c
=
1
:

T
0

0
_

T
0

0
_
2
=
1
:
T
0

0
(3.15)
3.4.2 Linear restriction on j
We determine an estimator for to satisfy a linear restriction
A = b
where A (:j) and b (:1) are given constants and is assumed to be known.
We write the restriction in vector form g () = 0 and form the Lagrangean
/(; ) = | () + 2
T
g ()
where
T
= (`
1
, ..., `
m
) is a vector of Lagrange multipliers (the factor 2 is inserted just for
convenience).
/(; ) = | () + 2
T
(A b)
= :
_
( x )
T

1
( x ) + 2
T
(A b)
_
ignore constant terms involving
Set
d
d
/(; ) = 0 using results from Example Sheet 2:
2
1
( x ) + 2A
T
= 0
x = A
T
(3.16)
We use the constraint A = b to evaluate the Lagrange multipliers : Premultiply by A
A x b = AA
T

=
_
AA
T
_
1
(A x b)
Substitute into (3.16)
^ = x A
T
_
AA
T
_
1
(A x b) (3.17)
12
3.4.3 Covariance matrix proportional to a given matrix
We consider estimating / when = /
0
, where
0
is a given.constant matrix. The likelihood
(3.8) takes the form when d = 0 (^ = x)
| (/) = :
_
log [/
0
[ +tr
_
1
/

1
0
S
__
plus constant terms (not involving /).
| (/) =
_
j log / +
1
/
tr
_

1
0
S
_
_
+ constant terms
d|
d/
= 0 =
j
/

1
/
2
tr
_

1
0
S
_
= 0
Hence
^
/ =
tr
_

1
0
S
_
j
(3.18)
13

Potrebbero piacerti anche