Sei sulla pagina 1di 261

STAT446/614

Multivariate Methods/Analysis
Lecture Notes

Lecturer
Samuel Iddi (PhD)

Department of Statistics
University of Ghana
isamuel@ug.edu.gh

April 16, 2015

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 1 / 261


Course Information

Course Objective:
 To understand problems associated with multi-dimensional data.
 To study basic multivariate distribution theory and methods.
 To discuss some fundamental and important multivariate statistical
techniques.
 To understand inference and how to apply them to real-life problem
arising from various scientific fields.
 To learn methods for data reduction.
 To implement and apply the techniques in R.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 2 / 261


Course Information

Textbook:
1 Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate
Statistical Analysis. 3rd Ed. Prentice-Hall.
Reference:
1 Muirhead R. J. (2005). Aspects of Multivariate Statistical Theory. John
Wiley and Sons.
2 Hardle, W. K. and Simar, L. (2012). Applied Multivariate Statistical
Analysis. 3rd Ed. Springer.
3 Everitt, B. and Hothorn T. (2011). An Introduction to Applied
Multivariate Analysis with R Use R. Springer.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 3 / 261


Course Information

Office hour:
Thursday: 10:30am to 2:00pm or by appointment.
Phone Number: 0506783155, Office: 208
E-mail: isamuel@ug.eud.gh; isamuel.gh@gmail.com (Sending an email is the
easiest way to contact me)
Personal Website: www.samueliddi.com

Teaching Assistant
TA: Enouch Sakyi-Yeboah
Tutorial: Thursday, 9:30am to 10:30am
Phone number: 0274873770
E-mail: enochsakyi10@gmail.com

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 4 / 261


Course Information

Grading
Homework Assignments (20%), Interim Assessment (30%) and Final Exams
(50%).
Guidelines
 Homeworks should be submitted one week from the day assigned.
 R program codes will also be submitted via email.
 Late submission will not be accepted.
 Duplicate solutions will not be graded.
 TA will discussion solutions during tutorial hours.
 Interim assessment could take the form of project, presentation and
defense.
Computing
The main software package for this course is regular R version 2.3.1 (or
higher). Install R by visiting the website: www.r-project.org. You may also
use RStudio but this only works after you have installed R first.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 5 / 261


Introduction

Introduction

 This course concerns the use of statistical methods use for describing and
analyzing multivariate data.

 Research in biological physical and social sciences frequently involve


the collection of measurements on several variables.

 We will review some concepts of matrix and of matrix manipulation


(matrix algebra) relevant for multivariate analysis.

 Also, we will acquire knowledge to properly interpret models, select


appropriate techniques and understand their strength and weakness.

 The R software for complex statistical analysis will be used.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 6 / 261


Introduction Levels of Complexity

Levels of Complexity

 Simplex statistical analysis - single outcome variable on a sample from


homogeneous population.
◦ Standard procedures: location parameters (mean and median),
dispersion parameters (standard errors or interquartile ranges)
◦ Example: the height of students in the class.
 First level of complexity - sample out of two larger populations
◦ Example - the height of male and female students in the class.
◦ Scientific question - we may be interested to know whether mean
response from the two population differ
◦ The outcome variable is called dependent variable
◦ Predictor variable is often called covariate or independent variable
◦ Statistical tools - ANOVA, t-test, Wilcoxon text.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 7 / 261


Introduction Levels of Complexity

Level of Complexity

 Next level - more than two levels of predictor.


 This lead to family of modes called regression models
 The choice of the statistical analysis is driven by the outcome or
dependent variables rather than by the predictor variables
◦ When the outcome is continuous (eg. weight), we use the linear
regression.
◦ Binary dependent variable - logistic regression
◦ For count outcomes - Poisson regression
◦ Predictor variables can be continuous, binary, categorical or
discrete.
 More than one predictor variables recorded
◦ Example: when effect of gender and religion on the height of
students in the class is of interest.
◦ Two-way or multi-way ANOVA
◦ Multiple linear regression - extension to one-way and simple
regression.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 8 / 261
Introduction Levels of Complexity

Level of Complexity

 Finally, when several dependent variables are recorded and studied


simultaneously.
 Example - the weight and height simultaneously recorded for a group of
male and female students.
 Arguably, sex will influence height as well as weight. The two are likely
to be correlated or associated.
 Multivariate analysis is required for simultaneous measurements on
many dependent variables.
 Note that
◦ Multiple refers to several independent variables.
◦ Multivariate refers to several dependent variables.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 9 / 261


Introduction Multivariate Analysis

Multivariate Analysis

 This refers to a set of techniques which allow the presence of more than
one outcome variable.

 The challenges in learning from large data has led to the development
and evolution in this field of statistical sciences.

 The job of statistician is to extract useful information through extraction


of
◦ important patterns
◦ rules and trends in multivariate data
◦ relationship among various features
◦ classification of multidimensional patterns.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 10 / 261


Introduction Multivariate Analysis

Objectives of Multivariate Methods

 The objectives of scientific investigations to which multivariate methods


lend themselves include the following:
◦ Hypothesis construction and testing (Multivariate Analysis of
Variance (MANOVA), Profile Analysis (PA))
◦ Prediction (Multivariate Linear Regression)
◦ Data reduction or structural simplification (Principal Component
Analysis (PCA), Factor Analysis (FA))
◦ Investigation of the dependence among variables (Canonical
Correlation, CC)
◦ Sorting or grouping (Classification and Discriminate Analysis
(DA))

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 11 / 261


Aspects of Multivariate Analysis Organization of Data

Data Setup

Let i = 1, . . . , n denote subjects, individuals, items, experimental units etc.,


j = 1, 2, . . . , p denotes variables, characteristics measured for each individual.
The jth measurement for the ith subject will be denoted by xij . For n
measurements on p variables, the data is setup as follows:

Variable 1 Variable2 ... Variable j ... Variable p


Item 1 x11 x12 ... x1j ... x1p
Item 2 x21 x22 ... x2j ... x2p
.. .. .. .. ..
. . . . .
Item j xi1 xi2 ... xij ... xip
.. .. .. .. ..
. . . . .
Item n xn1 xn2 ... xnj ... xnp
Table: Data Setup

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 12 / 261


Aspects of Multivariate Analysis Organization of Data

The data can be grouped into a matrix form as:


 
x11 x12 . . . x1p
x21 x22 . . . x2p 
X = (xij )i,j =  .
 
.. .. 
 .. . . 
xn1 xn2 . . . xnp

Example 2.1
A selection of four receipts from the university bookstore was obtained in
order to investigate the nature of book sales. Each receipt provided, among
other things, the number of books sold and the total amount of each sale. Let
the first variable be the total Cedi sales and the second variable be the
number of books sold. Then we can regard the corresponding numbers on the
receipts as four measurements on two variables.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 13 / 261


Aspects of Multivariate Analysis Organization of Data

Example 2.1 (cont.)


Suppose the data, in tabular form, are

Variable 1 (Cedi sales): 42 52 48 58


Variable 2 (number of books): 4 5 4 3

Solution 2.1
Using the notation just introduced, we have

x11 = 42, x21 = 52, x31 = 48, x41 = 58


x12 = 4, x22 = 5, x32 = 4, x42 = 3
and the data array X is  
42 4
52 5
X=
46

4
58 3

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 14 / 261


Aspects of Multivariate Analysis Organization of Data

Descriptive statistics

 Sample mean
n
1X
x̄j = xij , j = 1, . . . , p.
n
i=1
 
x̄1
x̄2 
Mean vector x̄ = (x̄1 , x̄2 , . . . , x̄p )0 =  . 
 
 .. 
x̄p
 Sample variance
n
1X
s2j = (xij − x̄j )2 .
n
i=1
A very common definition is
n
1 X
s2j = (xij − x̄j )2 .
n−1
i=1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 15 / 261


Aspects of Multivariate Analysis Organization of Data

 When sample variance are put along the main diagonal


q of a matrix, they
2 2
will be denoted by sjj = sj . Standard deviation: sj
 Linear association
◦ Sample covariance:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n
i=1

We denote the p × p symmetric covariance matrix Sn = (sjk )j,k and


the matrix of squares and cross products as W = nSn = nSn−1
◦ Sample correlation: (Pearson’s product moment correlation)
Pn
sjk (xij − x̄j )(xik − x̄k )
rjk = √ = pPn i=1 Pn
sjj skk i=1 (xij − x̄j )
2
i=1 (xik − x̄k )
2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 16 / 261


Aspects of Multivariate Analysis Organization of Data

Remarks on correlation:
 unitless
 covariance of standardized variables
 range −1 ≤ rij ≤ 1
 rjk = 0 implies no linear association
 rjk > 0 implies both pair of variables tend to deviate from their
respective means in the same direction.
 The correlation is invariant under the following transformation

yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n

provided a and c have the same sign (ac > 0). What happens if ac < 0?

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 17 / 261


Aspects of Multivariate Analysis Organization of Data

 Do not overestimate the importance of linear association measures.


◦ They do not capture other kinds of association (eg. quadratic
relationship between variables)
◦ Correlation coefficient is very sensitive to outliers.
Matrix representation:
 Sample mean: x̄ = 1n ni=1 xj = (x̄1 , x̄2 , . . . , x̄p )0
P

 Sample covariance matrix: Sn = (sjk ) = 1n ni=1 (xj − x̄)(xk − x̄)0 . i.e.


P
 
s11 s12 . . . s1p
s21 s11 . . . s2p 
Sn =  .
 
.. .. 
 .. . . 
sp1 sp2 . . . spp
 Sample correlation matrix: R = (rjk ) with rjj = 1. i.e.
 
1 r12 . . . r1p
r21 1 . . . r2p 
R= .
 
.. .. 
 .. . . 
rp1 rp2 . . . 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 18 / 261
Aspects of Multivariate Analysis Organization of Data

Example 2.2
Find the arrays x̄, Sn , R from the example data above.

Solution 2.2

4
1X 1
x̄1 = xi1 = (42 + 52 + 48 + 58) = 50
4 4
i=1
4
1 X 1
x̄2 = xi2 = (4 + 5 + 4 + 3) = 4
4 4
i=1
 
50
x̄ =
4

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 19 / 261


Aspects of Multivariate Analysis Organization of Data

The sample variance and covariances


Solution 2.2 (cont.)

4
1X
s11 = (xi1 − x̄1 )2
4
i=1
1
(42 − 50)2 + (52 − 50)2 + (48 − 50)2 + (58 − 50)2 = 34

=
4
s22 = 0.5
4
1X
s12 = s21 = (xi1 − x̄1 )(xi2 − x̄2 )
4
i=1

and  
34 −1.5
Sn =
−1.5 0.5

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 20 / 261


Aspects of Multivariate Analysis Organization of Data

The sample variance and covariances


Solution 2.2 (cont.)

s12 −1.5
r12 = r21 = √ =√ = −0.36
s11 s22 34 × 0.5
so  
1 −0.36
R=
−0.36 1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 21 / 261


Aspects of Multivariate Analysis Graphical Techniques

Visualization of Multidimensional data

 Graphical techniques help to visualize data and aid in data analysis.

 They increase understanding, suggest explanations or at the very least


suggest where to start looking for explanations.

 It is relatively easy to visualize 1, 2 and 3 dimensional data using


sophisticated computer programs and display equipment.

 1 dimension - dot diagrams; 2 dimensions - scatter plots; 3 dimensions -


3d plots.

 Often information from pairs of variables is valuable (eg. scatter plot


matrix).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 22 / 261


Aspects of Multivariate Analysis Graphical Techniques

Two types of multivariate plots

 For complicated multivariate data, visualization is impossible without


some prior processing.

 Dimensionality reduction up to which visualization is possible (eg.


PCA).

 Variable space: n points in a p−dimensional space - each individual is


represented through p coordinates, its outcome on each of the p variables.

 Data space: p points in an n−dimensional space - each individual


represents a dimension in the space.

 see JW for examples.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 23 / 261


Aspects of Multivariate Analysis Distance

Euclidean Distance

 Most multivariate techniques are based upon the simple concept of


distance.

 We should already be familiar with Straight-line or Euclidean distance

 Consider the point P = (x1 , x2 ) in the plane, the straight line distance
d(O, P), from P to the origin O = (0, 0) is according to the Pythagorean
theorem q
d(O, P) = x12 + x22

 In general, for P = (x1 , x2 , . . . , xp ), the straight-line distance from P to


the origin O = (0, 0, . . . , 0) is
q
d(O, P) = x12 + x22 + · · · + xp2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 24 / 261


Aspects of Multivariate Analysis Distance

Euclidean Distance

 All points (x1 , x2 , . . . , xp ) that lie a constant square distance such as c2


from the origin satisfy the equation

d2 (O, P) = x12 + x22 + · · · + xp2 = c2 (hypersphere)

 If p = 2, the equation is a circle.

 Points equidistance from the origin lies on a hypersphere.

 The straight-line distance from P = (x1 , x2 , . . . , xp ) to


Q = (y1 , y2 , . . . , yp ) is given by
q
d(P, Q) = (x1 − y1 )2 + (x2 − y2 )2 + · · · + (xp − yp )2 .

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 25 / 261


Aspects of Multivariate Analysis Distance

Statistical Distance

 Statistical distance is fundamental to multivariate analysis.

 Account for differences in variation.

 When coordinates represent measurements that are subject to random


fluctuations of differing magnitude.

 Depends on sample variances and covariances.

 For P(x1 , x2 ) and O = (0, 0), statistical distance is computed from its
standardized coordinates x1∗ = √xs111 , x2∗ = √xs222 as
s 2 2 s
x12 x2

x1 x2
d(P, Q) = √ + √ = + 2
s11 s22 s11 s22

 If s11 = s22 , it is convenient to ignore the common divisor and use the
Euclidean distance.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 26 / 261
Aspects of Multivariate Analysis Distance

Statistical Distance

 Ellipse of constant statistical distance

x12 x2
d2 (O, P) = + 2 = c2
s11 s22
 All points which have coordinates (x1 , x2 ) and are constant squared
x12 x22
distance c2 from the origin must satisfy s11 + s22 = c2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 27 / 261


Aspects of Multivariate Analysis Distance

Generalized statistical distance

 Generalized statistical distance: From an arbitrary point P = (x1 , x2 ) to


any fixed point Q = (y1 , y2 ), the distance is given by
s
(x1 − y1 )2 (x2 − y2 )2
d(P, Q) = +
s11 s22
where s11 and s22 is sample variance constructed from x1 and x2
respectively.

 Extension to more than two dimensions: From P(x1 , x2 , . . . , xp ) to


Q(y1 , y2 , . . . , yp ) we have
s
(x1 − y1 )2 (x2 − y2 )2 (xp − yp )2
d(P, Q) = + + ··· +
s11 s22 spp
where s11 , s22 , . . . , spp are sample variances constructed from n
measurements on x1 , x2 , . . . , xp respectively.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 28 / 261
Aspects of Multivariate Analysis Distance

Alternative formulation

 Alternatively, let P(x11 , x12 , . . . , x1p ) and Q(x11 , x22 , . . . , x2p ), the
Euclidean distance is given by
q
d(P, Q) = (x11 − x21 )2 + (x12 − x22 )2 + · · · + (x1p − x2p )2
v
u p
uX
p
= t (x1j − x2j )2 = (x1 − x2 )0 (x1 − x2 )
j=1

= ||x1 − x2 ||2

 Drawback
◦ All components contribute equally, although there may be random
fluctuations of a different magnitude in the components.
◦ Different pairs of components may be correlated differently.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 29 / 261


Aspects of Multivariate Analysis Distance

Alternative formulation

 Statistical distance → take variance into account


v
u p (x1j − x2j )2
uX
d(P, Q) = t
sjj
j=1

 Next we take correlation into account. The covariance matrix Sn will


play a key role.

 Transform the variable xi to uncorrelated ones, say yi


yi = Cxi ⇒ xi = C−1 yi
 Requirement is satisfied if
 
1 0 ... 0
 0 1 ... n
0  1X
Sy = I =  . . . = (yi − ȳ)(yi − ȳ)0
 
..
 .. .. .. 
.  n
i=1
0 0 ... 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 30 / 261
Aspects of Multivariate Analysis Distance

Alternative formulation

n
1X
Sy = C(xi − x̄)(xi − x̄)0 C0
n
i=1
= CSx C0
which is satisfied by S = Sx = (C0 C)−1
 Measuring distance between P and Q can be performed in Euclidean
distance using coordinate system yi as they represent uncorrelated
variables with unit variance.
d2 (P, Q) = (y1 − y2 )0 (y1 − y2 )
= [C(x1 − x2 )]0 [C(x1 − x2 )]
= (x1 − x2 )0 C0 C(x1 − x2 )
= (x1 − x2 )0 S−1
x (x1 − x2 )

 The last expression is the Mahalanobis distance.


Dr. S. Iddi (UG) STAT446/614 April 16, 2015 31 / 261
Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Definition 3.1
Let A be a square matrix of dimension k × k and let λ be the eigenvalue of A.
If x is a k × 1 nonzero vector (x 6= 0) such that

Ax = λx

then x is said to be an eigenvector (characteristic vector) of the matrix A


associated with the eigenvalue λ.

Example 3.1
 
1 0
Let A = , find the eigenvalues and associated eigenvectors of A.
1 3

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 32 / 261


Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Solution 3.1
     
1 0 1 0 1−λ 0
|A − λI| = −λ =
1 3 0 1 1 3−λ
= (1 − λ)(3 − λ) = 0.

implies the two roots are λ1 = 1 and λ2 = 3. The eigenvectors associated


with these eigenvalues are obtained by solving the following equations:
    
1 0 x1 x
= 1 1
1 3 x2 x2
Ax = λ1 x
    
1 0 x1 x
= 3 1
1 3 x2 x2
Ax = λ2 x
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 33 / 261
Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Solution 3.1 (cont.)


From the first expression,
x1 = −2x2 .
There are many
  solutions for x1 and x2 . Setting x2 = 1 gives x1 = −2, and
−2
hence x = is an eigenvector corresponding to eigenvalue 1. From the
1
second expression,

x1 = 3x1
x1 + 3x2 = 3x2
 
0
implies that x1 = 0 and x2 = 1 (arbitrarily), and hence, x = is the
1
eigenvector corresponding to the eigenvalue 3.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 34 / 261


Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Solution 3.1 (cont.)


It is normal√
practice to determine an eigenvector so that it has unit length. We
take e = x/ x0 x as the eigenvector √
corresponding
√ to λ. For example, the
0
eigenvector for λ1 = 1 is e = (−2/ 5, 1/ 5).

Definition 3.2
A quadratic form Q(x) in the k variables x1 , x2 , . . . , xk is Q(x) = x0 Ax, where
x0 = (x1 , x2 , . . . , xk ) and A is a k × k symmetric matrix.

Note that a quadratic form can be written as Q(x) = ki=1 kj=1 aij xi xj . For
P P
example,
  
1 1 x1
Q(x) = (x1 , x2 ) = x12 + 2x1 x2 + x22 .
1 1 x2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 35 / 261


Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Result 3.1 (Spectral Decomposition.)


Let A be a k × k symmetric matrix. Then A can be expressed in terms of its k
eigenvalue-eigenvector pairs (λi , ei ) as
k
X
A= λi ei e0i = PΛP0
i=1

where P = (e1 , e2 , . . . , ek ) is another matrix with normalized eigenvectors as


its columns.
A−1 = PΛ−1 P0 = ki=1 λ1i ee0
P

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 36 / 261


Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Result 3.2 (Square-root matrix)


The square-root matrix, of a positive matrix A,
k p
X
1/2
A = λi ee0 = PΛ1/2 P0
i=1

has the following properties:


a. (A1/2 )0 = A1/2 (that is, A1/2 is symmetric)
b. A1/2 A1/2 = A
c. (A1/2 )−1 = ki=1 √1λ ee0 = PΛ−1/2 P0 , where Λ1/2 is a diagonal matrix
P
√ i
with 1/ λi as the ith diagonal element.
d. A1/2 A−1/2 = A−1/2 A1/2 = I, and A−1/2 A−1/2 = A−1 , where
A−1/2 = (A1/2 )−1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 37 / 261


Matrix Algebra Some useful results of matrix theory

Some useful results of matrix theory

Result 3.3 (Singular-Value Decomposition.)


Let A be a m × k symmetric matrix of real numbers. Then there exist an
m × m orthogonal matrix U and a k × k orthogonal matrix V such that

A = UΛV 0

where the m × k matrix Λ has (i, i) entry λi ≥ 0 for i = 1, 2, . . . , min(m, k)


and the other entries are zero. The positive constants λi are called the
singular values of A.

U has m orthogonal eigenvectors of AA0 as its columns and V has k


orthogonal eigenvectors of A0 A as its columns.
For every square matrix A with eigenvalues λ1 , λ2 , . . . , λk , we have that
k
X k
Y
tr(A) = λi , and |A| = λi
i=1 i=1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 38 / 261


Random Vectors and Matrices Random Vectors and Matrices

Random vectors and matrices

 When several random variables are considered simultaneously, it is


convenient to group them into vectors.
 If we observe the same random vectors for n individuals, we can group
the result in an n × p matrix.
 A random vector is a vector whose elements are random variables.
 Similarly, random matrix is a matrix whose elements are random
variables.
 For an n × p random matrix X, the expected value of X is an n × p matrix
of numbers (if they exist) given by

E(X01 )
     
E(X11 ) E(X12 ) . . . E(X1p ) E(X1 )
E(X21 ) E(X22 ) . . . E(X2p ) E(X2 ) E(X0 )
2 
E(X) =  . ..  =  ..  =  .. 
    
. .. . .
 . . . .   .   . 
E(Xn1 ) E(X11 ) . . . E(Xnp ) E(Xp ) E(X0n )

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 39 / 261


Random Vectors and Matrices Random Vectors and Matrices

Random vectors and matrices

where
( R∞
−∞ xij fij (xij )dxij if Xij is a continuous random variable on <
E(Xij ) = P
xij xij pij (xij ) if Xij is a discrete random variable

with f the probability density function (pdf) and p the probability mass
function.
Example 3.2
Consider the random vector X0 = (X1 , X2 ). Let the discrete random variable
X1 have the following probability function:

x1 -1 0 1
p1 (x1 ) 0.3 0.3 0.4

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 40 / 261


Random Vectors and Matrices Random Vectors and Matrices

Random vectors and matrices

Example 3.2 (cont.)


and the discrete random variable X2 have the probability function

x2 0 1
p2 (x2 ) 0.8 0.2

Find E(X)

Solution 3.2
  P   
E(X1 ) x x1 p 1 (x 1 ) −1 × 0.3 + 0 × 0.3 + 1 × 0.4
E(X) = = P1 =
E(X2 ) x2
x2 p2 (x2 ) 0 × 0.8 + 1 × 0.2
 
0.1
Thus, E(X) =
0.2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 41 / 261


Random Vectors and Matrices Random Vectors and Matrices

Notation for Random Vectors

Immediate results:

E(X + Y) = E(X) + E(Y)


E(AXB) = AE(X)B

Let X be a random Vector (X = (Xj )j ).


Definition 3.3
 The (marginal) mean of X are defined by µj = E(Xj )
 The population mean vector is given by
 
µ1
 µ2 
µ = E(X) =  . 
 
 .. 
µp

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 42 / 261


Random Vectors and Matrices Random Vectors and Matrices

Covariance Matrices

Definition 3.3 (cont.)


 The (marginal) (co)variances of X are defined by
( R∞ R∞
−∞P−∞ (xj − µj )(xk − µk )fjk (xj , xk )dxj dxk
σjk = E(Xj −µj )(Xk −µk ) = P
xj xk (xj − µj )(xk − µk )pjk (xj , xk )

where fjk (xj , xk ) and pjk (xj , xk ) are the joint density function and joint
probability mass functions of (xj , xk ) respectively.
 The population variance-covariance matrix is defined as
 
σ11 σ12 . . . σ1p
 σ21 σ22 . . . σ2p 
0
Σ = Cov(X) =  . ..  = E(X − µ)(X − µ) .
 
.. ..
 .. . . . 
σp1 σp2 . . . σpp

Clearly, Σ is symmetric matrix.


Dr. S. Iddi (UG) STAT446/614 April 16, 2015 43 / 261
Random Vectors and Matrices Random Vectors and Matrices

Covariance and Correlation Matrices

Definition 3.3 (cont.)


 The population correlations are constructed as follows:
σjk −1/2 −1/2
ρjk = √ √ = σjj σjk σkk ,
σjj σkk
σ1p
√ σ11 √ σ12
 

σ11 σ11

σ22 σ11 ... √ √
σ11 σpp

1 ρ12 . . . ρ1p

 √ σ21√ σ
 σ11 σ22 √ σ22
√ ... √ 2p√
 ρ12 1 . . . ρ2p 
σ22 σ22 σ22 σpp 
ρ= = 
.. .. .. .. . .. .. .. 
  ..
  .
. . . 

 . . .
σp1 σpp
√ √
σ11 σpp ... ... √ √
σpp σpp
ρ1p ρ2p . . . 1

which yields Σ = V 1/2 ρV 1/2 , V = diag(σ11 , σ22 , . . . , σpp ) where ρ is


the population correlation matrix, V is diagonal matrix of variances,
V 1/2 is diagonal matrix of standard deviations. Σ can be obtained from
V 1/2 and ρ and ρ can be obtained from Σ.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 44 / 261
Random Vectors and Matrices Random Vectors and Matrices

Covariance and Correlation Matrices

Example 3.3
Consider the previous example. Let the joint probability of X1 and X2 be
p12 (x1 , x2 ) represented by the entries in the following table.

x2
x1 0 1 p1 (x1 )
-1 0.24 0.06 0.3
0 0.16 0.14 0.3
1 0.40 0.00 0.4
p2 (x2 ) 0.8 0.2 1

Find the covariance matrix for the two random variables X1 and X2 .

Solution 3.3
We have already shown that µ1 = E(X1 ) = 0.1 and µ2 = E(X2 ) = 0.2.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 45 / 261
Random Vectors and Matrices Random Vectors and Matrices

Covariance and Correlation Matrices

Solution 3.3 (cont.)


We have already shown that µ1 = E(X1 ) = 0.1 and µ2 = E(X − 2) = 0.2.
X
σ11 = E(X1 − µ1 ) = (x1 − 0.1)2 p1 (x1 )
x1

= (−1 − 0.1) (0.3) + (0 − 0.1)2 (0.3) + (1 − 0.1)2 (0.4) = 0.69


2
X
σ22 = E(X2 − µ2 ) = (x2 − 0.2)2 p2 (x2 )
x2

= (0 − 0.2) (0.8) + (1 − 0.2)2 (0.2) = 0.16


2
X
σ12 = E(X1 − µ1 )(X2 − µ2 ) = (x1 − 0.1)(x2 − 0.2)p12 (x1 , x2 )
(x1 ,x2 )
= (−1 − 0.1)(0 − 0.2)(0.24) + · · · + (1 − 0.1)(1 − 0.2)(0.0)
= −0.08 = σ21

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 46 / 261


Random Vectors and Matrices Random Vectors and Matrices

Covariance and Correlation Matrices

Solution 3.3 (cont.)


     
E(X1 ) µ1 0.1
⇒ µ = E(X) = = = and
E(X2 ) µ2 0.2
(X1 − µ1 )2
 
0 (X1 − µ1 )(X2 − µ2 )
Σ = E(X − µ)(X − µ) = E .
(X2 − µ2 )(X1 − µ1 ) (X2 − µ2 )2
Thus,    
σ11 σ12 0.69 −0.08
Σ= =
σ21 σ22 −0.08 0.16

Example 3.4
   
4 1 2 σ11 σ12 σ13
Suppose Σ = 1 9 −3 = σ21 σ22 σ23 . Obtain V 1/2 and ρ.
2 −3 25 σ31 σ32 σ33

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 47 / 261


Random Vectors and Matrices Random Vectors and Matrices

Covariance and Correlation Matrices

Solution 3.4
√   
σ11 0 0 2 0 0

V 1/2 =  0 σ22 0  = 0 3 0 

0 0 σ33 0 0 5
1 
 −1 2 0 0
1/2
and V = 0 13 0  . Thus, the correlation matrix ρ is given by

0 0 51
1   1 
 −1  −1 0 0 2 4 1 2 2 0 0
V 1/2 Σ V 1/2 =  0 31 0  1 9 −3  0 13 0 
0 0 15 2 −3 25 0 0 15
1 16 1 

5
=  16 1 − 51 
1 1
5 −5 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 48 / 261
Random Vectors and Matrices Partitioning Random Variables

Partitioning Random Variables

 Suppose in a given study, 4 medical measurements are recorded,


X1 , X2 , X3 , X4 together with 6 socio-economic variables
X5 , X6 , X6 , X8 , X9 , X10 .
 It is then natural to partition X into two parts, say
 
X1
 .. 
 . 
 
 X4 
 
X= −−

 X5 
 
 .. 
 . 
X10

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 49 / 261


Random Vectors and Matrices Partitioning Random Variables

Partitioning Random Variables

In general, we will assume p dimensional random vector X is partitioned into


two subvectors of dimension q and p − q respectively,
 
X1
 .. 
 . 
 
 (1)   Xq 
X  
X= (2) =  −−  .
 
X Xq+1 
 
 .. 
 . 
Xp−q

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 50 / 261


Random Vectors and Matrices Partitioning Random Variables

Partitioning Random Variables

 The mean vector and the covariance matrix can be partitioned


accordingly as
 
µ1
 .. 
 . 
..
 
 
 (1)   µq 
µ   Σ11 . Σ12 
X= =  −−  and Σ =  . . . . . . . . .  .
µ(2) 
µq+1 
 
..


 .. 
 Σ21 . Σ22
 . 
µp−q
 The meaning of the component matrices Σ is

Σ11 = Cov(X(1) ), Σ22 = Cov(X(2) )

Σ21 = Σ12 = Cov(X(1) , X(2) )


Dr. S. Iddi (UG) STAT446/614 April 16, 2015 51 / 261
Random Vectors and Matrices Partitioning Random Variables

Partitioning Random Variables

 Thus, we have

Σ = E(X − µ)(X − µ)0


E(X(1) − µ(1) )(X(1) − µ(1) )0 E(X(1) − µ(1) )(X(2) − µ(2) )0
 
=
E(X(2) − µ(2) )(X(1) − µ(1) )0 E(X(2) − µ(2) )(X(2) − µ(2) )0

 These results also hold if the population quantities are replaced by their
appropriate sample counterparts.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 52 / 261


Random Vectors and Matrices Linear Combinations of Random Variables

Linear combination of random vectors

 Let X be a p-dimensional random vector and c a p-dimensional vector of


constants.
 The linear combination c0 X = c1 X1 + c2 X2 + · · · + cp Xp has
mean = E(c0 X) = c0 µ
variance = Var(c0 X) = c0 Σc
where µ = E(X) and Σ = Cov(X).
 In general, consider the q linear combination of the p random variables
X1 , . . . , Xp :
Z1 = c11 X1 + c12 X2 + · · · + c1p Xp
Z2 = c21 X1 + c22 X2 + · · · + c2p Xp
.. .
. = ..
Zq = cq1 X1 + cq2 X2 + · · · + cqp Xp
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 53 / 261
Random Vectors and Matrices Linear Combinations of Random Variables

Linear combination of random vectors

 or     
Z1 c11 c12 . . . c1p X1
Z2  c21 c22 . . . c2p  X2 
 
Z=.= . ..   ..  = CX
   
.. ..
 ..   .. . . .   .
Zq cq1 cq2 . . . cqp Xp
 The linear combinations Z = CX have

µZ = E(Z) = E(CX) = CµX


ΣZ = Cov(Z) = Cov(CX) = CΣX C0

where µX and ΣX are the mean vector and variance-covariance matrix


of X, respectively.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 54 / 261


Random Vectors and Matrices Linear Combinations of Random Variables

Linear combination of random vectors

Example 3.5
Let X0 = (X1 , X2 ) be a random vector 0
 vector µx = (µ1 , µ2 ) and
 with mean
σ11 σ12
variance-covariance matrix ΣX = . Find the mean vector and
σ12 σ22
covariance matrix for the linear combinations

Z1 = X1 − X2
Z2 = X1 + X2

Solution 3.5
    
Z1 1 −1 X1
Z= = CX = = CX
Z2 1 1 X2
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 55 / 261
Random Vectors and Matrices Linear Combinations of Random Variables

Linear combination of random vectors

Solution 3.5
   
1 −1 σ11 σ12
0 1 1
µZ = Cov(Z) = CΣX C =
1 1 σ12 σ22 −1 1
 
σ11 − 2σ12 + σ22 σ11 − σ22
=
σ11 − σ22 σ11 + 2σ12 + σ22

 If σ11 = σ22 , the off diagonals vanishes


 The sum and differences of two random variables with identical
variances are uncorrelated.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 56 / 261


Random Vectors and Matrices Matrix Inequalities and Maximization

Matrix Inequalities

 Several multivariate techniques utilize the principle of maximization.


 eg in PCA - maximum variability; DA - maximize separation
 Matrix inequalities can help derive certain maximization results.
 Cauchy-Schwarz inequality: For any two p−dimentional vectors b and
d, we have that
(b0 d)2 ≤ (b0 b)(d0 d)
with equality if and only if b = cd for some c.
 Extension: (b0 d)2 ≤ (b0 Bb)(d0 B−1 d) with equality if and only if
b = cB−1 d for a p × p positive definite matrix B and some constant c.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 57 / 261


Random Vectors and Matrices Matrix Inequalities and Maximization

Maximization

 Maximization Lemma: Let B be a p × p positive definite matrix and d a


p × 1 vector. Then, for an arbitrary nonzero vector x,

(x0 d)2
max 0 ≤ d0 B−1 d
x6=0 x Bx

with the maximum attained when x = cB−1 d for some nonzero constant
c.
 Maximization of quadratic forms: For a p × p positive definite matrix
B with eigenvalues λ1 , λ2 , . . . , λp ≥ 0 and corresponding eigenvectors
ei , i = 1, 2, . . . , p, then
◦ maxx6=0 xxBx
0
0 x = λ1 (attained when x = e1 )
x 0 Bx
◦ maxx6=0 x0 x = λp (attained when x = ep )
◦ maxx⊥e1 ,...,ek xxBx
0
0 x = λk+1 (attained when

x = ek+1 , k = 1, 2, . . . , p − 1)
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 58 / 261
Sample Geometry and Random Samples Sample Geometry

Sample Geometry

 The data matrix X consist of


◦ n observations (representing
  individuals) xi
x1j
x2j 
◦ p variables yj , i.e. yj =  . 
 
 .. 
xnj
 Projection of yj on the unit vector 1n is written as

y0j 1n
 
1n = x̄j 1n
10n 1n
 Denote the deviation vector from the mean vector by
dj = yj − x̄j 1n = (xi1 − x̄j , xi2 − x̄j , . . . , xin − x̄j )0
 The sum of squared deviation (sum of square residuals):
Lj2 = d0j dj = ni=1 (xij − x̄j )2 = (n − 1)sjj
P

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 59 / 261


Sample Geometry and Random Samples Sample Geometry

Sample Geometry

 Also, the Pcross-products (dot product) of the jth and kth variable:
d0j dk = ni=1 (xij − x̄j )(xik − x̄k )0 = (n − 1)sjk
 Angle between deviation vectors dj and dk , is obtained from
d0 d (n−1)sjk sjk
cos(θjk ) = h 0 j 0k i1/2 = 1/2 = (s s )1/2 = rjk
(dj dj )(dk dk ) [(n−1)s jj (n−1)s kk ] jj kk

 Thus, the cosine of the angle between two deviation vectors is the
correlation between the two random variables Xj , Xk
 The following results can be observed:
◦ if θjk = 0 then rjk = 1, i.e. if the deviation vectors are in the same
direction, the the correlation between the two random variables has
a perfect linear correlation.
◦ if θjk = π2 , then rjk = 0, i.e. if they two r.v are orthogonal, the
correlation between them is zero.
◦ if θjk = π, i.e. if they move in opposite directions, then rjk = −1 for
every pair j, k of variables from X.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 60 / 261
Sample Geometry and Random Samples Mean and Variance Estimators

Mean and variance estimators

 Suppose we have a multivariate population with unknown mean vector µ


and unknown covariance matrix Σ.
 The number of unknown parameters are p + p(p + 1)/2
 To make inference about these two quantities and to have estimators
corresponding to µ and Σ, we take samples from the population.

Result 4.1
Assume that we take a random sample X1 , X2 , . . . , Xn from a multivariate
population with unknown mean vector, µ and covariance matrix, Σ. Then,
1 X̄ is an unbiased estimator for µ and has covariance matrix 1n Σ.
n
2 Sn−1 = n−1 Sn is an unbiased estimator for Σ.
3 Sn is a biased estimator for Σ, the bias is equal to − 1n Σ

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 61 / 261


Sample Geometry and Random Samples Mean and Variance Estimators

Mean and variance estimators

Proof 4.1
1 Pn 1 Pn
= 1n (nµ) = µ

(1) E(X̄) = E n i=1 X i = n i=1 E(X i )

n
! n
!
0 1X 1X 0
Cov(X̄) = E(X̄ − µ)(X̄ − µ) = E (Xi − µ) (Xi − µ)
n n
i=1 i=1
 
1
= E [(X1 − µ) + (X2 − µ) + · · · + (Xn − µ)] ×
n
 
1
[(X1 − µ)0 + (X2 − µ)0 + · · · + (Xn − µ)0 ]
n
1
E (X1 − µ)(X1 − µ)0 + · · · + (Xn − µ)(Xn − µ)0

= 2
n
1 1 Σ
= 2
(Σ + · · · + Σ) = 2 nΣ =
n n n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 62 / 261


Sample Geometry and Random Samples Mean and Variance Estimators

Mean and variance estimators

Proof 4.1 (cont.)


Note that for i 6= `, E(Xi − µ)(X` − µ)0 = 0 because the entry is the
covariance between a component of Xi and a component E(X` ), and these are
independent.
 
1 Pn 0
(2) E(Sn−1 ) = E n−1 i=1 (X i − µ)(X i − µ)

n
! n
0 0
X X
⇒ (n − 1)E(Sn−1 ) = E XX0i − nX̄X̄ = E(XX0i ) − nX̄X̄
i=1 i=1
but Cov(Xi ) = Σ = E(Xi − µ)(Xi − µ) = E(Xi X0i ) − µµ0 0

⇒ E(Xi X0i ) = Σ + µµ0


Σ 0
also Cov(X̄) = = E(X̄ − µ)(X̄ − µ)0 = E(X̄X̄ ) − µµ0
n
0 Σ
⇒ E(X̄X̄ ) = + µµ0
n
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 63 / 261
Sample Geometry and Random Samples Mean and Variance Estimators

Mean and variance estimators

Proof 4.1 (cont.)

n  
X
0 Σ 0
hence (n − 1)E(Sn−1 ) = (Σ + µµ ) − n + X̄X̄
n
i=1
= nΣ + nµµ0 − Σ − nµµ0 = (n − 1)Σ
⇒ E(Sn−1 ) = Σ.

Thus, Sn−1 is an unbiased estimator of Σ.


(3) Now, we show that Sn is not an unbias estimator of Σ.
n−1
(n − 1)Sn−1 = nSn ⇒ Sn = Sn−1
n
n−1 n−1
E(Sn ) = E(Sn−1 ) = Σ 6= Σ
n n
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 64 / 261
Sample Geometry and Random Samples Mean and Variance Estimators

Mean and variance estimators

Proof 4.1 (cont.)

n−1
now E(Sn ) = Σ,
n
n−1 1
bias = E(Sn ) − Σ = Σ−Σ=Σ− Σ−Σ
n n
1
= − Σ.
n
The bias estimator: Sn = n1 ni=1 (Xi − X̄)(Xi − X̄)0 . Note that, as n → ∞, Sn
P
will be unbias (in the limit). We introduce a special notation for the unbiased
estimator:
n
n 1 X
S= Sn = (Xi − X̄)(Xi − X̄)0 .
n−1 n−1
i=1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 65 / 261


Sample Geometry and Random Samples Generalized and Total Variances

Generalized variance

 With a single variable, the sample variance is often used to describe the
amount of variation in the measurements on that variable.
 When p variables are observed on each unit, the variation is described by
the sample-variance covariance matrix,
 
s11 s12 . . . s1p ( )
s21 s22 . . . s2p  n
1 X
S= . ..  = sjk = (xij − x̄j )(xik − x̄k )
 
.. ..
 .. . . .  n−1
i=1
sp1 sp2 . . . spp

 The sample covariance matrix contains p variances and p(p + 1)/2


potentially different covariances.
 It is sometimes desirable to assign a single numerical value for the
variation expressed by S.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 66 / 261


Sample Geometry and Random Samples Generalized and Total Variances

Generalized variance

 One choice of a value is the determinant of S which reduces to the usual


sample variance of a single characteristics when p = 1.
 This determinant is called the generalized sample variance. Thus,
generalized sample variance = |S|.
 Generalized sample variance can also be based on standardized variables
and is given by the determinant of the correlation matrix, |R|.
 The quantities |S| and |R| are connected by the relationship
|S| = (s11 + s22 + · · · + spp )|R|.
 Another choice to summarize information in S is the total sample
variance (variation). It is define as
total sample variance = tr(S) = s11 + s22 + · · · + spp .
 Note that these are not sufficient to replace the p × p covariance matrix
but these quantities are of interest at times.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 67 / 261
Sample Geometry and Random Samples Generalized and Total Variances

Generalized variance

Example 4.1
The sample covariance
 matrix obtained
 from a data with two variables is
252.04 −68.43
given by S = . Evaluate the generalized and total
−68.43 123.67
sample variance.

Solution 4.1

Generalized variance = |S| = (252.04)(123.67)−(−68.43)(−68.43) = 26487

Total variance = tr(S) = 252.04 + 123.67.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 68 / 261


Sample Geometry and Random Samples Generalized and Total Variances

Generalized variance

Example 4.2
Find the generalized
 variance
 and total
 variance
 of population covariance
2 1 2 −1
matrices Σ1 = and Σ2 = . Comment.
1 3 −1 3

Solution 4.2
tr(Σ1 ) = tr(Σ2 ) = 5, |Σ1 | = |Σ2 | = 5. The two covariance matrices have
the same measures as far as total variance and generalized variance are
concerned but clearly, the two matrices are different.

Exercise 4.1
(a) Verify the relationship between |S| and |R| using
 
4 3 1
S = 3 9 2
1 2 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 69 / 261
Sample Geometry and Random Samples Generalized and Total Variances

Exercise 4.2
(b) Using S = D1/2 RD1/2 , show that |S| = (s11 + s22 +
· · · + spp )|R|.

1 2 5
(c) Show that the generalized variance |S| = 0 for X = 4 1 6
4 0 4

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 70 / 261


Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations

Sample Mean

 1 Pn   
n Pi=1 xi1 x11 + x21 + . . . +xn1
 1 n xi2  1 x12 + x22 + . . . +xn2 
 n i=1 
x̄ =  =  ..
 
 .. .. .. .. 
 .  n  . . . . 
1 Pn x1p + x1p + . . . +xnp
n i=1 xip
  
x11 x21 . . . xn1 1
x
1  12 22
 x . . . x n2  1
   1
0
=  . .. .. ..   ..  = X 1n
n  .. . . .   .  n
x1p x1p . . . xnp 1
1 0
⇒ x̄0 = 1 X
n n
1 1
⇒ 1n x̄0 = 1n 10n X = J0n X
n n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 71 / 261


Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations

Sample Covariance

1
S = (X − 1n x̄0 )0 (X − 1n x̄0 )
n−1
 0  
1 1 0 1 0
= X − (Jn X) X − (Jn X)
n−1 n n
  
1 0 1 0 1 0
= X − (X Jn ) X − (Jn X)
n−1 n n
  
1 0 1 1 0
= X In − J n In − J n X
n−1 n n
 
1 1
= X 0 In − J n X
n−1 n
Let D = diag(sjj ) then
R = D−1/2 SD−1/2
S = D1/2 RD1/2
Observe that
Dr. S. these equations are analogous
Iddi (UG) STAT446/614 to their population counterparts.
April 16, 2015 72 / 261
Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations

Sample values of linear combinations of variables

Result 4.2
The linear combinations

b0 X = b1 X1 + b2 X2 + · · · + bp Xp
c0 X = c1 X1 + c2 X2 + · · · + cp Xp

have sample means, variances, and covariances that are related to x̄ and S by

Sample mean of b0 X = b0 x̄
Sample mean of c0 X = c0 x̄
Sample variance of b0 X = b0 Sb
Sample variance of c0 X = c0 Sc
Sample variance of b0 X and c0 X = b0 Sc

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 73 / 261


Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations

Example 4.3
Consider the data array  
42 4
52 5
X=
48

4
58 3
Find x̄, S, R.

Solution 4.3

1 0
x̄ = X 1n
n
 
  1  
1 42 52 48 58 
  = 50
1 
=
4 4 5 4 3 1 4
1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 74 / 261


Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations

Solution 4.3 (cont.)

1 1
S = X0 (1n − Jn )X
n−1 n
 0      
42 4 1 0 0 0 1 1 1 1 42 4
1 52 5  0 1 0 0 1
− 
1 1 1 1 52 5
=      
3 48 4 0 0 1 0 4 1 1 1 1 48 4
58 3 0 0 0 1 1 1 1 1 58 3
 
34 −1.5
=
−1.5 0.5
   
−1/2 5.83
−1/2 0 1/2 −1/2 0.17 0
R = D SD ,D = ,D =
0 0.71 0 1.41
   
0.17 0 34 −1.5 0.17 0
R =
0 1.41 −1.5 0.5 0 1.41
 
1 −0.36
=
−0.36 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 75 / 261
Multivariate Normal Distribution

Introduction

 A generalization of the familiar bell-shaped density to several dimension


plays a fundamental role in multivariate analysis.
 While real data are never exactly multivariate normal, the normal density
is often a useful approximation to the ‘true’ population distribution.
 One advantage of the multivariate distribution is that it has got
mathematically elegant properties.
 The normal distributions are useful in practice for two reasons:
◦ it is a bona fide model for practice
◦ the sampling distribution of a lot of statistics is approximately
normal (central limit effect)
 The univariate normal distribution
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ 2
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 76 / 261
Multivariate Normal Distribution

Introduction

 Write the exponent as (x − µ)(σ 2 )−1 (x − µ) which can be interpreted as


the squared distance between x and µ in standard deviation units.
 A multivariate version is immediate for a p × 1 vector X of observation
on several variable as

(x − µ)0 Σ−1 (x − µ)

 The p × 1 vector µ represents the expected value of the random vector X


and the p × p matrix Σ is the variance-covariance matrix of X.
 We require the symmetric matrix Σ to be positive definite, so that
(x − µ)0 Σ−1 (x − µ) is the square of the generalized distance from X to
µ.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 77 / 261


Multivariate Normal Distribution

Introduction

 The corresponding density function is


 
1 1
f (x) = f (x; µ, Σ) = √ p exp (x − µ)0 Σ−1 (x − µ)
( 2π)p |Σ| 2

where −∞ < x < ∞.


 Often the notation φp is used to denote the multivariate normal density,
while Φp denotes the multivariate cumulative distribution function.
 A multivariate normal random variable will be indicated by

X ∼ Np (µ, Σ).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 78 / 261


Multivariate Normal Distribution Bivariate Normal Density

Bivariate normal density

 Let p = 2-variate normal density in terms of the individual parameters:


µ1 = E(X1 ), µ2 = E(X2 ), σ11 = Var(X1 ), σ22 = Var(X2 ),
√ √
ρ12 = √σ11σ12√σ22 = Corr(X1 , X2 ) ⇒ σ12 = Cov(X1 , X2 ) = ρ12 σ11 σ22 .
 
σ11 σ12
 Write Σ = ⇒ |Σ| = σ11 σ22 − (σ12 )2
σ12 σ22
 Then,
 
−1 1 σ22 −σ12
Σ =
σ11 σ22 − (σ12 )2 −σ12 σ11
 √ √ 
1 σ22 −ρ12 σ11 σ22
= √ √
2
σ11 σ22 − σ12 −ρ12 σ11 σ22 σ11

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 79 / 261


Multivariate Normal Distribution Bivariate Normal Density

Bivariate normal density

 The squared distance becomes

(x − µ)0 Σ−1 (x − µ) =

1
(x1 − µ1 , x2 − µ2 ) 2
σ11 σ22 − σ12
 √ √  
σ22 −ρ12 σ11 σ22 x1 − µ 1
× √ √
−ρ12 σ11 σ22 σ11 x2 − µ 2

 This simplifies to
1  √ √
= 2 √ (x1 − µ1 )σ22 − ρ12 σ11 σ 22 (x2 − µ2 ),
σ11 σ22 − ρ12 σ11 σ22
 
√ √ x1 − µ1
−ρ12 σ11 σ22 (x2 − µ2 ) + σ11 (x2 − µ2 )]
x2 − µ2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 80 / 261


Multivariate Normal Distribution Bivariate Normal Density

Bivariate normal density

1 √ √
(x1 − µ1 )2 σ22 − ρ12 σ11 σ22 (x2 − µ2 )(x1 − µ1 )

= 2
σ11 σ22 (1 − ρ12 )
√ √
ρ12 σ11 σ22 (x1 − µ1 )(x2 − µ2 ) + (x2 − µ2 )2 σ11


"
x 1 − µ1 2 x2 − µ2 2
    
1 x1 − µ1
= √ + √ − 2ρ12 √
(1 − ρ212 ) σ11 σ22 σ11
 
x2 − µ2
× √
σ22

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 81 / 261


Multivariate Normal Distribution Bivariate Normal Density

Bivariate normal density

 This implies f (x1 , x2 )

1
= q
(2π)2/2 σ11 σ22 (1 − ρ212 )
( "
x 1 − µ1 2 x2 − µ2 2
  
1
× exp − √ + √
2(1 − ρ212 ) σ11 σ22
  
x1 − µ1 x 2 − µ2
− 2ρ12 √ √
σ11 σ22

 X1 and X2 are uncorrelated, so that ρ12 = 0, then f (x1 , x2 ) = f (x1 )f (x2 ).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 82 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Properties of normal distribution

 If X has a multivariate normal distribution, then


◦ All linear combination of the components of X are normally
distributed.
◦ All subsets of X are multivariate normally distributed.
◦ cov(Xj , Xk ) = 0 implies Xj and Xk are independent
◦ Conditional distributions of the components are multivariate
normal.

Example 5.1
For X distribution as N3 (µ, Σ), find the distribution of
 
 X
1 −1 0  1 
  
X1 − X2
= X2 = AX
X2 − X3 0 1 −1
X3

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 83 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Bivariate normal density

 
    X1
Y1 1 −1 0
 Let Y = =  X2  = AX
Y2 0 1 −1
X3
Y ∼ N2 (Aµ, AΣA0 )
 
 µ
0  1 
  
1 −1 µ1 − µ 2
 E(Y) = AE(X) = Aµ = µ2 =
0 1 −1 µ2 − µ 3
µ3

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 84 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Bivariate normal density

and covariance matrix, cov(Y) is given by


  
  σ11 σ12 σ13 1 0
1 −1 0 
cov(Y) = AΣA0 = σ21 σ22 σ23   −1 1 
0 1 −1
σ31 σ32 σ33 0 −1
 
σ11 − 2σ12 + σ22 σ12 + σ23 − σ22 − σ13
=
σ12 + σ23 − σ22 − σ13 σ22 − 2σ23 + σ33

Example 5.2 (The distribution of a subset of a normal random vector)


 
X2
If X is distributed as N5 (µ, Σ), find the distribution of .
X4
     
(1) X2 (1) µ2 σ22 σ24
Set X = ,µ = , Σ11 =
X4 µ4 σ24 σ44

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 85 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Properties

Note that with this assignment X, µ and Σ can respectively be rearranged and
partitioned as
σ22 σ24 | σ12 σ23 σ25
     
X2 µ2
 X4   µ4   σ24 σ44 | σ14 σ34 σ45 
     
 −   −   − − − − − − 
X= 
,µ =  µ1 ,Σ =  σ12 σ14 | σ11 σ13 σ15 
    
 X1     
 X3   µ3   σ23 σ34 | σ13 σ33 σ35 
X5 µ5 σ25 σ45 | σ15 σ35 σ55
 
X2
Thus X(1) = has the distribution
X4
   
(1) µ2 σ22 σ24
N2 (µ , Σ11 ) = N2 ,
µ4 σ24 σ44
It is clear from this example that the normal distribution for any subset can be
expressed by simply selecting the appropriate means and covariance from the
original µ and Σ.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 86 / 261
Multivariate Normal Distribution Additional Properties of the Normal Distribution

Properties

Example 5.3 (The equivalence of zero covariance and covariance


independence for normal variables)
Let X be N2 (µ, Σ) with

4 1 0
Σ= 1 3 0 
0 0 2
Are X1 and X2 independent? What about (X1 , X2 ) and X3 ?

Since X1 and X2 have covariance σ12 = 1, they are not independent.


Partition X and Σ as
   
X1 4 1 | 0  
Σ 11 | Σ 12
 X2   1 3 | 0 
X=  −−  , Σ =  −− −− | −−  = −− | −−
    
Σ21 | Σ22
X3 0 0 | 2
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 87 / 261
Multivariate Normal Distribution Additional Properties of the Normal Distribution

Properties
  
X1 0
See that X(1) = and X3 have covariance matrix Σ = . Therefore
X2 0
(X1 , X2 ) and X3 are independent. This implies that X3 is independent of X1
and also of X2 .
Example 5.4 (Result)
     
X1 µ1 Σ11 | Σ12
Let  −−  ∼ Np  −−  ,  −− | −− 
X2 µ2 Σ21 | Σ22
Then (X1 |X2 = x2 ) ∼ Nq (µ1|2 , Σ1|2 ) with

µ1|2 = µ1 + Σ12 Σ−1


22 (x2 − µ2 )

Σ1|2 = Σ11 − Σ12 Σ−1


22 Σ21 .

Note two important features: the conditional mean is a linear function in x2


and the covariance is independent of x2 .
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 88 / 261
Multivariate Normal Distribution Additional Properties of the Normal Distribution

Conditional distribution

Indirect 
proof
| −Σ12 Σ−1

I 22
Let A =  −− | −− . We know that X − µ ∼ N(0, Σ). This
0 | I
implies

| −Σ12 Σ−1
 
I 22
 
X1 − µ2
A(X − µ) = −− |
 −−  ∼ N(0, AΣA0 )
X2 − µ2
0 | I

i.e.
X1 − µ2 − Σ12 Σ−1
 
22 (X 2 − µ2 ) ∼ N(0, AΣA0 )
(X2 − µ2 )
I −Σ12 Σ−1
   
0 22 Σ11 Σ12 I 0
but AΣA =
0 I Σ21 Σ22 −Σ12 Σ−1
22 I

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 89 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Conditional distribution

Σ11 − Σ12 Σ−1 Σ21 Σ12 − Σ12 Σ−1


  
0 22 22 Σ22 I 0
AΣA =
Σ12 Σ22 −Σ12 Σ−1
22 I
−1 −1
   
Σ11 − Σ12 Σ22 Σ21 0 Σ11 − Σ12 Σ22 Σ21 0
= =
Σ12 − Σ22 Σ−1
22 Σ12 Σ22 0 Σ22

Thus,

X1 − µ2 − Σ12 Σ−1 Σ11 − Σ12 Σ−1


    
22 (X 2 − µ2 ) ∼ N 0, 22 Σ21 0
X2 − µ2 0 Σ22

This implies

µ1 + Σ12 Σ−1 Σ11 − Σ12 Σ−1


     
X1
∼N 22 (X 2 − µ) , 22 Σ21 0
X2 µ2 0 Σ22

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 90 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Conditional distribution

This result implies that both components are independent. As a result, we


may condition the first on the value of the second, without affecting the
distribution.

µ1 + Σ12 Σ−1
   
X1 |X2 = x2 22 (x2 − µ)
∼ N ,
X2 |X1 = x1 µ2
Σ11 − Σ12 Σ−1
 
22 Σ21 0
0 Σ22

Hence,

(X1 |X2 = x2 ) ∼ N(µ1 + Σ12 Σ−1 −1


22 (x2 − µ), Σ11 − Σ12 Σ22 Σ21 )

or
(X1 |X2 = x2 ) ∼ N(µ1|2 , Σ1|2 )

where µ = µ1 + Σ12 Σ−1 −1


22 (x2 − µ2 ) and Σ1|2 = Σ11 − Σ12 Σ22 Σ21
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 91 / 261
Multivariate Normal Distribution Additional Properties of the Normal Distribution

Conditional distribution of bivariate normal

Example 5.5 (The conditional distribution of a bivariate normal distribution


using densities directly)
By using the bivariate normal density, f (x1 , x2 ) derived above and taking
(x2 −µ2 )2
1 −
f (x2 ) = √ √
2π σ22
e 2σ22
, it can easily be shown that

2
(x1 −µ1 − σσ12 (x −µ2 ))
22 2
f (x1 , x2 ) 1 −
2σ11 (1−ρ2 )
f (x1 |x2 ) = =√ q e 12
f (x2 ) 2π σ11 (1 − ρ212 )

(See textbook for proof)h i


σ12
Thus, X1 |X2 = x2 ∼ N µ1 + σ22 (x2 − µ2 ), σ11 (1 − ρ212 ) .

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 92 / 261


Multivariate Normal Distribution Additional Properties of the Normal Distribution

Distributional properties

Example 5.6 (Property)


Let X ∼ Np (µ, Σ) with |Σ| > 0. Then
 (X − µ)0 Σ−1 (X − µ) ∼ χ2p .
 Np (µ, Σ) assign probability (1 − α) to the solid ellipsoid.
{x|(x − µ)0 Σ−1 (x − µ) ≤ χ2p } where χ2p (α) denotes the (100α)th
percentile of the χ2p distribution.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 93 / 261


Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Multivariate normal likelihood function

 Let a random sample X1 , X2 , . . . , Xn be distributed as Np (µ, Σ). The


multivariate normal likelihood function is defined as:
n
Y
L(µ, Σ, x) = φp (xi ; µ, Σ)
i=1
n  
Y 1 1 0 −1
= p 1 exp − (xi − µ) Σ (xi − µ)
2
i=1 (2π) 2 |Σ| 2
!n ( n
)
1 1X 0 −1
= p 1 exp − (xi − µ) Σ (xi − µ)
(2π) 2 |Σ| 2 2
i=1

 Note: If A be a k × k symmetric matrix and x is a k × 1 vector then


x0 Ax = tr(x0 Ax) = tr(Axx0 ).
 Also, tr(AB) = tr(BA).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 94 / 261


Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Multivariate normal likelihood function

 Observe that
X n n
X
(xi − µ)0 Σ−1 (xi − µ) = tr[(xi − µ)0 Σ−1 (xi − µ)]
i=1 i=1
Xn
= tr[Σ−1 (xi − µ)(xi − µ)0 ]
i=1
" n
#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
" ( n )#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
 Now,
(xi − µ)(xi − µ)0 = (xi − x̄ + x̄ − µ)(xi − x̄ + x̄ − µ)0
= [(xi − x̄) + (x̄ − µ)][(xi − x̄) + (x̄ − µ)]0
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 95 / 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Multivariate normal likelihood function

 thus,
n
X n
X
0
(xi − x̄)(xi − x̄)0 + 2(xi − x̄)(x̄ − µ)0

(xi − µ) (xi − µ) =
i=1 i=1
(x̄ − µ)(x̄ − µ)0

+
Xn
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1
Xn
+ 2 (xi − x̄)(x̄ − µ)0
i=1
n
X
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 96 / 261


Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Multivariate normal likelihood function

 This fact enables as to rewrite the likelihood as,


( " n
1 1 X
L(µ, Σ; x) = np n exp − trΣ−1 (xi − x̄)(xi − x̄)0
(2π) |Σ|
2 2 2
i=1
#)
+ n(x̄ − µ)(x̄ − µ)0
"( n
#
1 1 X
= np n exp − tr Σ−1 (xi − x̄)(xi − x̄)0
(2π) 2 |Σ| 2 2
i=1
)
n
+ (x̄ − µ)0 Σ−1 (x̄ − µ)
2

The first part reminds of the variance-covariance matrix S, and is free of


µ, whiles the second part contains µ.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 97 / 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Maximum Likelihood Estimation

 The most likely parameter value is defined as the one yielding maximum
L, given the data.
 The corresponding parameter value is called the maximum likelihood
estimator or mle (which is a random variable, and function of the data)
 It is common practice to maximize log-likelihood.
 In our case, the log-likelihood is
np n
l(µ, Σ) = log L(µ, Σ) = − log(2π) − log |Σ|
2" 2 !#
n
1 −1
X
0
− tr Σ (xi − x̄)(xi − x̄)
2
i=1
n 0 −1
− (x̄ − µ) Σ (x̄ − µ)
2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 98 / 261


Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Maximum Likelihood Estimation

 The derivative of the log-likelihood w.r.t the mean vector µ is


∂l
= −nΣ−1 (x̄ − µ) = 0
∂µ
Thus, the MLE for µ is x̄, the sample mean.
 The derivative w.r.t the covariance component is more complicated. The
maximization can be achieved in an alternative way.
Lemma 5.1
Let B and Σ be p × p symmetric p.d matrices and let b > 0, then
 
1 1 −1 1
exp − tr(Σ B) ≤ (2b)pb e−pb
|Σ|b 2 |B|b

Equality holds iff


1
Σ= B
2b
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 99 / 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Covariance Estimation

Theorem 5.2
Let X1 , X2 , . . . , Xn be independent and identically distributed Np (µ, Σ). Then
the maximum likelihood estimators for µ and Σ are
n
1X (n − 1)
µ̂ = X̄ and Σ̂ = (Xi − X̄)(Xi − X̄)0 = S = Sn
n n
i=1

The maximum likelihood estimates are found by substituting the observed


values xi for Xi

Proof
1. We have derived the results for the mean.
2. The likelihood for µ = µ̂ becomes
 
1 1 −1
L(µ̂, Σ) = np n exp − tr[Σ nSn ]
(2π) 2 |Σ| 2 2
 This
Dr. S. quantity
Iddi (UG) µ̂. Furthermore, it is a simple
is independent ofSTAT446/614 function
April 16, 2015 100of
/ 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation

Covariance Estimation

n
If we apply the previous lemma with b = 2 and B = nSn then the
maximum value is is achieved for
1
Σ̂ = (nSn ) = Sn
2 2n

 Observed that MLE for mean is unbiased but MLE for the covariance
matrix is biased.
 Substituting µ̂ and Σ̂ in L(µ, Σ) yields
 
1 1 −1 1
L(µ̂, Σ̂) = np exp − tr[Σ̂ nΣ̂] . n
(2π) 2 2 |Σ̂| 2
1 − n tr[Ip ] 1 1 − np 1
= np e 2 . n = np e 2 . n
(2π) 2 |Σ̂| 2 (2π) 2 |Σ̂| 2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 101 / 261


Multivariate Normal Distribution Sufficient Statistics

Sufficient Statistics

 The likelihood depends on x1 , x2 , . . . , xn only through


i. x̄ = 1nP
Σni=1 xi (mean)
ii. C = ni=1 (xi − x̄)(xi − x̄)0 = (n − 1)S (squares and cross products
matrix)

 Thus, x̄ and S are sufficient statistics.

 For other multivariate distributions, more information may be needed.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 102 / 261


Multivariate Normal Distribution Sampling Distributions

Sampling distributions

Univariate: Let X1 , X2 , . . . , Xn ∼ N1 (µ, σ 2 )


2
1. X̄ ∼ N1 (µ, σn )
2. (n − 1)s2 = ni=1 (Xi − X̄)2 ∼ σ 2 χ2n−1
P

3. X̄ and (n − 1)s2 are statistically independent.

Multivariate: Let X1 , X2 , . . . , Xn ∼ Np (µ, Σ). Then


1. X̄ ∼ Np (µ, 1n Σ)
2. (n − 1)S = ni=1 (Xi − X̄)(Xi − X̄)0 ∼ Wishartp,n−1 (Σ)
P

3. X̄ and (n − 1)S are statistically independent.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 103 / 261


Multivariate Normal Distribution Wishart Distribution

Wishart distribution

 Univariate case: Y ∼ σ 2 χ2m ∼ σ 2 (Z12 + Z22 + · · · + Zm2 ) with


σ 2 Zi ∼ N(0, σ 2 )
 Similarly, for multivariate case: Y = Wishartp,m (Σ) ∼ m 0
P
j=1 Zj Zj with
Zj ∼ Np (0, Σ)
Properties
1. Let A1 ∼ Wp,m1 (Σ), A2 ∼ Wp,m2 (Σ), A1 and A2 are independent, then
A1 + A2 ∼ Wp,m1 +m2 (Σ)
2. Let B be a q × q matrix of constants and A ∼ Wp,m (Σ). Then
BAB0 ∼ Wq,m (BΣB0 ).
3. Let b be a vector of constants, A ∼ Wp,m (Σ), then
b0 Ab ∼ σ 2 χ2m
with σ 2 = b0 Σb.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 104 / 261
Multivariate Normal Distribution Sample Distribution Properties

Large sample properties

Theorem 5.3 (Law of large numbers:)


Let X1 , X2 , . . . , Xn be an independent sample from a population with
E(Xi ) = µ. Then
X̄ → µ (n → ∞)
[i.e. P(|X̄ − µ| < ε) → 1 as n → ∞]

Theorem 5.4 (Central limit theorem:)


Let X1 , X2 , . . . , Xn be an independent with mean, µ and covariance, Σ. Then

n(X̄ − µ) ∼ Np (0, Σ)

for large n.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 105 / 261


Multivariate Normal Distribution Sample Distribution Properties

Large sample properties

Note:
 n should be large compared to p

 above result is exact for normal variables

 as a consequence of the central limit theorem

n(X̄ − µ)0 Σ−1 (X̄ − µ) ∼ χ2p

 if n is sufficiently large, we may replace Σ by S, whence

n(X̄ − µ)0 S−1 (X̄ − µ) ∼ χ2p

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 106 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Assessing normality

 Many of the multivariate techniques depend on the assumption of


(multivariate) normality (except some large sample theory results)

 Even in large samples, the approximation will be better if the distribution


is closer to normality.

 The detection of departures from normality is crucial.

 Three properties of the normal distribution will be important eg.


◦ marginal distributions are normal
◦ linear combinations are normal
◦ density contours are ellipsoids

 The detection of OUTLIERS (wild observations) will play an important


role as well.

 Checks in univariate (bivariate) margins are usually not difficult.


Dr. S. Iddi (UG) STAT446/614 April 16, 2015 107 / 261
Multivariate Normal Distribution Assessing the Normality Assumption

Assessing normality (Univariate)

 Strictly speaking, the multivariate distribution as a whole should be


investigated, but it is often sensible to restrict the investigation to
univariate and bivariate margins.
 Univariate normality: Assess symmetry in dot diagrams, histograms
etc.
 Use the property that the probability of
√ √
◦ [µj − σjj , µj + σjj ] is 68.3%
√ √
◦ [µj − 2 σjj , µj + 2 σjj ] is 95.4%
√ √
◦ [µj − 1.96 σjj , µj + 1.96 σjj ] is 95%
 Check this by comparing the proportion of observations within
√ √
◦ [x̄j − sjj , x̄j + sjj ] is 68.3%
√ √
◦ [x̄j − 2 sjj , x̄j + 2 sjj ] is 95.4%
√ √
◦ [x̄j − 1.96 sjj , x̄j + 1.96 sjj ] is 95%

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 108 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Quantile-Quantile Plot (QQ-Plot)

 Special plots called QQ Plots can be used to assess the assumption of


normality.
 They are plots of the sample quantile versus the quantile one would
expect to observe if the observations actually were normally distributed.
 When the points lie very nearly along a straight line, the normality
assumption remain tenable. Normality is suspect if the points deviate
from a straight line.
 Moreover, the pattern of the deviations can provide clues about the
nature of the nonnormality.
 Once the reasons for nonnormality are identified, corrective action is
often possible.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 109 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

QQ-Plot

 Let x1 , x2 , . . . , xn be n observation and denote their ordered values by


x(1) , x(2) , . . . , x(n) values by
x(1) ≤ x(2) ≤ · · · ≤ x(n)
where the x(j) ’s are the sample quantiles.
 When the x(j) ’s are distinct, exactly j observations are less than or equal
to x(j) . The proportion j/n of the sample at or to the left of x(j) is often
approximated by j−1/2
n for analytical convenience (continuity
correction).
 On the other hand, we can compute quantile qj of the normal distribution
Z q(j)
1 2 j − 1/2
P(Z ≤ q(j) ) = φ(z)dz = √ e−z /2 dz =
−∞ 2π n
 Solve this equation for q(j) and plot the pairs (q(j) , x(j) ). If normality is
true, the expected sample quantile should be x(j) ≈ σq(j) + µ
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 110 / 261
Multivariate Normal Distribution Assessing the Normality Assumption

Example (Constructing a QQ-Plot)

A sample of n = 10 observations gives the values in the following table.

Ordered Probability levels Standard normal


observation x(j) (j − 12 )/n quantiles q(j)
-1 (1 − 12 )/10 = 0.05 -1.645
-0.1 0.15 -1.036
0.16 0.25 -0.674
0.41 0.35 -0.384
0.62 0.45 -0.125
0.80 0.55 0.125
1.26 0.65 0.385
1.54 0.75 0.674
1.71 0.85 1.036
2.30 0.95 1.645

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 111 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Example (Constructing a QQ-Plot)


R 0.385 2
 Φ(0.385) = P(Z ≤ 0.385) = √1 e−z dz = 0.65
−∞ 2π

 Note that Φ(z) has not close form solution and so numerical results in
standard normal tables are used.

2.0
1.5 QQ−Plot
Sample Quantiles, x(i)

1.0
0.5
0.0
−0.5
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Standard Normal Quantiles, q(i)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 112 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Example (Constructing a QQ-Plot)

 The pair of points lie very nearly along a straight line and we will not
reject the notion that these data are normally distributed - particularly
with sample size as small as n = 10
 As a rule of thumb: do not trust this procedure in samples of size n ≤ 20
 There can be quite a bit of variability in the straightness of the QQ-plot
for small samples even when the observations are known to come from a
normal population.
 Is the QQ-plot straight?
 If it is straight, the correlation between the pairs would be 1.
 We can test for correlation coefficient.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 113 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Test for Correlation Coefficient

 Correlation coefficient
Pn
− x̄)(q(j) − q̄)
j=1 (x(j)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)

 We compare rQ to a table of critical points for the QQ-plot correlation


test for normality.
 If rQ > the critical value, we do not reject the hypothesis of normality.
 From our example
n
X n
X
(x(j) − x̄)q(j) = 8.584, (x(j) − x̄)2 = 8.472
j=1 j=1
n
X
(q(j) )2 = 8.795 since always q̄ = 0
j=1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 114 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Table of Critical Points

Critical Points for the QQ-Plot


Correlation Coefficient Test for Normality
Sample Size Significance level α
n 0.01 0.05 0.10
5 0.8299 0.8788 0.9032
10 0.8801 0.9198 0.9351
15 0.9126 0.9389 0.9503
20 0.9269 0.9508 0.9604
25 0.9410 0.9591 0.9665
30 0.9479 0.9652 0.9795

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 115 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Test for Correlation Coefficient

8.584
rQ = √ √ = 0.994
8.472 8.795
 The test for normality at the 10% level of significant is provided by
referring rQ = 0.994 to entries in the Table corresponding to n = 10 and
α = 0.10. This entry is 0.9351.
 Since rQ > 0.9351, we do not reject the hypothesis of normality.
 As an alternative, the Shapiro-Wilk test statistics may be computed.
 For large sample sizes, the two statistics are nearly the same so either can
be used to judge lack of normality.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 116 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Evaluating Bivariate Normality

 Consider pairs of variables. If their plots do not look ellipsoidal, the


distributional assumption is suspect.
 However, if the shape is ellipsoidal, there is no guarantee of bivariate
normality.
 Several distributions are within the classes of elliptically contoured
distribution (eg. Multivariate t−distribution, multivariate cauchy, etc).
 Remember the relationship between

(x − µ)0 Σ−1 (x − µ)

and the χ22 distribution.


 If we compute the 50% quantile of the χ22 distribution, then we get the
ellipse in which roughly 12 of the observations should lie. If this is not the
case, we should use caution again.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 117 / 261
Multivariate Normal Distribution Assessing the Normality Assumption

Example

Consider data of the pair of observations (x1 = sales, x2 = profits) for 10


largest US industrial corporations.

x1 = sales x2 = profits
Company (mil dollars) (mil dollars)
1 126874 4224
2 96933 3835
3 86656 3510
4 63438 3758
5 55264 3939
6 50976 1809
7 39069 2946
8 36156 359
9 35209 2480
10 32416 2413

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 118 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Evaluating Bivariate Normality

 This data gives


   
62309 10005.20 225.70
x̄ = ,S = × 105 .
2927 225.79 14.30
So,
 
−1 1 14.30 −225.70
S = × 10−5
77661.18 −225.70 10005.20
 
0.000184 −0.003293
= × 10−5
−0.003293 0.128131

Reading from χ2 − Table, χ22 (0.5) = 1.39. Thus, any observation


x0 = (x1 , x2 ) satisfying
 0    
x1 − 62309 0.000184 −0.00329 −5 x1 − 62309
×10 ≤ 1.39
x2 − 2927 −0.003293 0.12813 x2 − 2927

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 119 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Evaluating Bivariate Normality

is on or inside the estimated 50% contour. Otherwise the observation is


outside.
 The first pair of observations is (x1 , x2 )0 = (126974, 4224). In this case,
 0    
126974-62309 0.000184 −0.00329 −5 126974-62309
× 10
4224-2927 −0.003293 0.12813 4224-2927

= 4.34 > 1.39


and this falls outside the 50% contour.
 The remaining points have generalized distance from x̄ of 1.2, 0.59, 0.83,
1.88, 1.01, 1.02, 5.33, 0.81 and 0.97 respectively.
 Since 7 out of these distances are less than 1.39, a proportion of 0.70
falls within the 50% contour.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 120 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Evaluating Bivariate Normality

 We would expect about 50% of observations within the contour. This


large difference will ordinarily provide evidence for rejecting the notion
of bivariate normality.

 However, our sample size of 10 is too small to reach this conclusion.

 Computing the fraction of points within a contour and subjectively


comparing it with the theoretical probability is a useful but rather rough
procedure.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 121 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Multivariate: Gamma plot or Chi-square plot

 A somewhat more formal method for judging the joint normality of a


dataset is based on the squared generalized distance.

 Compute di2 = (xi − x̄)0 S−1 (xi − x̄) which are


◦ approximately χ2p if say n − p ≥ 25
◦ if the population is multivariate normal.

 Warnings:
◦ the χ2p is only an approximation, an F distribution can be used
instead.
◦ the quantities are not independent.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 122 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Construction of the Gamma Plot

 Order
2 2
d(1) ≤ · · · ≤ d(n)

 Graph ! !
1
i−
χ2p 2 2
, d(i)
n

 The plot should be a straight line.

 If a systematic pattern is detected, then a deviation from normality is


likely.
Example:
The ordered distances and the corresponding chi-square percentiles of the
previous example (n = 10, p = 2) are listed in the table below.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 123 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Construction of the Gamma Plot

 
i− 21 2 i− 21
i n d(i) q2 n

1 0.05 0.59 0.10


2 0.15 0.81 0.33
3 0.25 0.83 0.58
4 0.35 0.97 0.86
5 0.45 1.01 1.20
6 0.55 1.02 1.60
7 0.65 1.20 2.10
8 0.75 1.88 2.77
9 0.85 4.34 3.79
10 0.95 5.33 5.99

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 124 / 261


Multivariate Normal Distribution Assessing the Normality Assumption

Construction of the Gamma Plot

The chi-square plot for the data is given below.

Chi−square plot of the ordered distances

5
4
Ordered distances

3
2
1

0 1 2 3 4 5 6

χ22−quantile

 The points do not lie along a straight line with slope 1.


 This data do not appear to be bivariate normal.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 125 / 261
Multivariate Normal Distribution Assessing the Normality Assumption

Outliers

 An important reason for deviation from normality: outliers.

 An outlier is an observation generated by another mechanism.

 How can we recognize them?


◦ histogram
◦ scatter-plots for pairs of variables
◦ χ2 −plots or di2
[see textbook on steps for detecting outliers.]

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 126 / 261


Multivariate Normal Distribution Transformation to Near Normality

Transformations

What if the assumption of normality is not tenable?


 Apply techniques for non-normal data
 Apply transformations: re-express data in other units which are
sometimes more natural than the original units.
Some well-known examples

Original scale Transformed scale



count y y
p
proportion p logit(p) = log 1−p
 
correlations r Fisher’s z(r) = 21 log 1+r
1−r

 Another important transformation is the log-transform, log(y) which is


often used in eg. growth data.
 Transformations that reduce the tail of the distribution are the
log-transform and the power-transform y = x1/n
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 127 / 261
Multivariate Normal Distribution Transformation to Near Normality

Box and Cox Transforms

 Allow data to suggest transformation.

 A family of transformation proposed by Box and Cox is


 xλ−1
λ if λ 6= 0
y=
log(x) if λ = 0

 Note that the form λ = 0 follows as the limit for λ → 0

 The parameter λ is to be estimated from the data.

 λ is estimated by maximum likelihood or by minimizing the sample


variance of the variables.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 128 / 261


Inference About a mean vector Plausible Values for a Normal Population Mean

Introduction

 To determine whether a p × 1 vector µ0 is a plausible value for the mean


of a multivariate normal distribution, the following test statistics
(analogous to the univariate case) is used.
1
T 2 = (X̄ − µ0 )0 ( S)−1 (X̄ − µ0 ) = n(X̄ − µ0 )0 S−1 (X̄ − µ0 ).
n
where X̄ = n1 ni=1 Xi , S = n−1
1 Pn 0
P
  i=1 (X i − X̄)(X i − X̄) and
µ10
 µ10 
µ0 =  . 
 
 . .
µp0
 The test statistics T 2 is called Hotelling’s T 2 . Here ( 1n S) is the estimated
covariance matrix of X̄.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 129 / 261


Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

 If the statistical distance T 2 is too large ie if the X̄ is "too far" from µ0 ,


the hypothesis H0 : µ = µ0 is rejected.
 T 2 is distributed as (n−1)p
(n−p) Fp,n−p where Fp,n−p denotes a random variable
with an F− distribution with p and n − p degree of freedom.
 So we test the hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 . At the α
level of significance, we reject H0 in favor of H1 if the observed
(n − 1)p
T2 > Fp,n−p .
(n − p)

Example 5.7
Let the data matrix for a random
  of size n = 3 from a bivariate
sample
6 9
normal population be X =  10 6 . Evaluate the T 2 for µ00 = [9, 5]. What
8 3
is the sampling distribution in this case.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 130 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

Solution
6+10+8
     
x¯1 3 8
X̄ = = 9+6+3 = and
x¯2 3 6

(6 − 8)2 + (10 − 8)2 + (8 − 8)


S11 = =4
2
(6 − 8)(9 − 6) + (10 − 8)(6 − 6) + (8 − 8)(3 − 6)
S12 = = −3
2
(9 − 6)2 + (6 − 6)2 + (3 − 6)
S22 = =9
2

so  
4 −3
S=
−3 9
.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 131 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

Thus    1 1 
−1 1 9 −3
S = = 31 49
4(9) − (−3)(−3) −3 4 9 27
 1 1    −2 
8−9 7
T 2 = 3[8 − 9, 6 − 5] 13 94 = 3[−1, 1] 9
1 =
9 27 6 − 5 27 9
(3−1)2
Before the sample is selected, T 2 has the distribution of (3−2) F2,3−2 = 4F2,1
random variable.
Example 5.8
Test the hypothesis H0 : µ0 = [4, 50, 10] against H1 : µ0 6= [4, 50, 10] at level
of significance α = 0.10. Using perspiration from 20 healthy female where
three components, X1 =sweat rate, X2 =sodium content and X3 =potassium
content were measured. The data yields the following.

x̄0 = 4.640 45.4 9.965


 

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 132 / 261


Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

Example 5.8
 
2.879 10.010 −1.810
S =  10.010 199.788 −5.640  and S−1 =
−1.810 −5.640 3.628
 
0.586 −0.022 0.258
 −0.022 0.006 −0.002 
0.258 −0.002 0.402

solution

T 2 = 20
 
4.640 − 4 45.4 − 50 9.965 − 10
  
0.586 −0.022 0.258 4.640 − 4
×  −0.022 0.006 −0.002   45.4 − 50  = 9.74
0.258 −0.002 0.402 9.965 − 10

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 133 / 261


Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

comparing the observed T 2 = 9.74 with the critical value

(n − 1)p 19(3)
Fp,n−p (0.10) = F3,17 (0.10) = 3.353(2.44) = 8.18
(n − p) 17

We reject H0 if one or more of the mean components or some combination of


means differs too much from the hypothesis. We see that T 2 = 9.74 > 8.18
and consequently, we reject H0 at 0.10 level of significance.
Note:
To apply this test, we must assume the sweat data are multivariate normal by
studying Q − Q plots of X1 , X2 , X3 and ensure that scatter plots for pairs of
observation have approximate.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 134 / 261


Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

Property
One feature of the T 2 - statistics is that it is invariant under changes in the units
of measurement for X of the form
Y = CX + d
where Y is p × 1, C is p × p, X is p × 1 and d is p × 1.
Proof
Given x1 , x2 , x3 , . . . , xn , then ȳ = Cx̄ + d and
n
1X
Sy = (yi − ȳ)(yi − ȳ)0
n
i=1
n
1X
= (Cxi + d − Cx̄ − d)(Cxi + d − Cx̄ − d)0
n
i=1
n
1X
= C(xi − x̄)(x − x̄)0 C0 = CSC0
n
i=1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 135 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean

Hotelling’s T 2

Also, µy = E(Y) = E(CX + d) = Cµ + d. Therefore, T 2 computed with the


y’s and a hypothesized value µy,0 = Cµ0 + d is

T 2 = n(ȳ − µy,0 )0 S−1


y (ȳ − µy,0 )
= n(C(x̄ − µy,0 ))0 (CSC0 )−1 (C(x̄ − µy,0 ))
= n(x̄ − µy,0 )0 C0 (CSC0 )−1 C(x̄ − µy,0 )
= n(x̄ − µy,0 )0 C0 (C0 )−1 (S−1 )(C0 )−1 C(x̄ − µy,0 )
= n(x̄ − µy,0 )0 S−1 (x̄ − µy,0 )

which is T 2 computed with the x’s.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 136 / 261


Inference About a mean vector Hotelling’s T 2 and Likelihood Ratio Tests

Likelihood Ratio Tests

 An alternative and particularly convenient parameter way (in terms of


multivariate normal) to test the hypothesis hypothesis H0 : µ − µ0 = 0 is
the likelihood ratio test.
 Likelihood Ratio Test have several optimal properties for large sample.
 To determine whether µ0 is a plausible of µ, the maximum of L(µ0 , Σ)
is compared with the unrestricted maximum of L(µ, Σ).
 The resulting ratio is called the Likelihood ratio statistics.
maxΣ L(µ0 , Σ) |Σ̂|
Likelihood ratio = Λ = =
maxµ,Σ L(µ, Σ) |Σ̂0 |
 Pn n
| i=1 (xi − x̄)(xi − x̄)0 | 2
= Pn 0
| i=1 (xi − µ0 )(xi − µ0 ) |
ˆ
 The equivalence statistics Λ n = |Σ
2 |
ˆ is called Wilk’s lambda.
|Σ0 |
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 137 / 261
Inference About a mean vector Hotelling’s T 2 and Likelihood Ratio Tests

Likelihood Ratio Tests

 If the observed value of this likelihood is too small, the hypothesis


H0 : µ = µ0 is unlikely to be true and is, therefore, rejected.
 Specifically, the likelihood ratio test of H0 : µ = µ0 against H1 : µ 6= µ0
rejects H0 if Λ < Cα where Cα is the lower (100α)th percentile of the
distribution of Λ.
 Test procedure
◦ For small samples: determine the sampling distribution of Λ and
choose c accordingly.
◦ Asymptotic Theory: The sampling distribution of −2 ln Λ is
asymptotic χ2 with degree of freedom equal to dim(Θ) − dim(Θ0 )
where Θ =parameter space.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 138 / 261


Inference About a mean vector Hotelling’s T 2 and Likelihood Ratio Tests

Hotelling’s T 2 and Likelihood Ratio Tests

Result 5.1
Let X1 , X2 , ........Xn be a random sample from an Np (µ, Σ) population.Then
the test based on T 2 is equivalent to the likelihood ratio test of H0 : µ = µ0
versus H1 : µ 6= µ0 because
−1
T2

2 |Σ̂|
Λ = n 1+ =
(n − 1) |Σ̂0 |
2
H0 is rejected for small values of Λ n or equivalently, large values of T 2 .

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 139 / 261


Inference About a mean vector Hotelling’s T 2 and Likelihood Ratio Tests

Optimality Properties

 The likelihood ratio test are known to have certain optimality properties.
For testing H0 : θ = θ0 versus H1 : θ = θ1 (narrow alternative), the
likelihood ratio test has the highest power(among all test) for a given
significance level α.
 Recall

P(test reject|H0 ) = α
P(Test reject|H1 ) = 1 − β

 In multivariate setting, likelihood ratio test have asymptotically (for large


samples) highest power.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 140 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Confidence region

 To make inference from a sample, we can extend the concept of


univariate confidence interval to a multivariate confidence region. Given
data X = (X1 , X2 , . . . , Xn )0 , indicate a confidence region R(X) ⊂ <p .
 The region R(X) is called a 100(1 − α)% percent confidence region if
the average probability θ is 1 − α. We call α the level of significant.
 Formally
P(Θ ∈ R(X)) = 1 − α
 The confidence region for the mean µ of a p-dimensional normal
population is given by
 
(n − 1)p
P n(X̄ − µ)0 S−1 (X̄ − µ) ≤ Fp,n−p (α) = 1 − α
(n − p)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 141 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Confidence region

 For a given sample, we can compute x̄ and S, to obtain the confidence


region.
(n − 1)p
n(x̄ − µ)0 S−1 (x̄ − µ) ≤ Fp,n−p (α)
(n − p)
 To determine whether a given µ0 lies within the confidence region,
compute the Mahalanobis distance between x̄ and µ0 .
 The vector lies within R(x̄) if the distance is less than

(n − 1)p
Fp,n−p (α)
(n − p)

 n(x̄ − µ0 )0 S−1 (x̄ − µ0 ) = mahalanobis distance between x̄ and µ0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 142 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Example
 
0.564
Consider the following: n = 42, x̄ = ,S =
0.603
   
0.0144 0.0117 203.018 −163.391
, S−1 =
0.0117 0.0146 −163.891 200.228
The 95% percent confidence region for µ consist of all values(µ1 , µ2 )
satisfying
  
203.018 −163.391 0.564 − µ1
42[0.564 − µ1 , 0.603 − µ2 ] ×
−163.891 200.228 0.603 − µ2
2(41)
≤ F2,40 (0.05)
40
since F2,40 (0.05) = 3.23,

42(203.018)(0.564 − µ1 )2 + 42(200.228)(0.603 − µ2 )2
− 84(163.391)(0.564 − µ1 )(0.603 − µ2 ) ≤ 6.62

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 143 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Example

To see whether µ0 = [0.562, 0.589] is in the confidence region we compute

42(203.018)(.564 − .562)2 + 42(200.228)(.603 − .589)2


− 84(163.391)(.564 − .562)(.603 − .589)
= 1.30 ≤ 6.62

We conclude that µ0 = [0.562, 0.589] is in the region.


 
0.562
Equivalently, a test of H0 : µ = would not be reject in favor of
0.589
 
0.562
H1 : µ 6= at α = 0.05 level of significant.
0.589

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 144 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

 Let X has a Np (µ, Σ), then a linear combination Z = `0 X has a normal


distribution with mean µz = `0 µ and covariance σz2 = `0 Σ`. The sample
mean z̄ and s2z are computed accordingly.

Result 5.2 (T 2 -interval, Scheffe’s interval)


Let X1 , X2 , . . . , Xn be random sample from a Np (µ, Σ) population with Σ
positive definite. Then the probability that the interval,
" s s #
(n − 1)p (n − 1)p
`0 x̄ − Fp,n−p (α)`0 S`, `0 x̄ + Fp,n−p (α)`0 S`
(n − p) (n − p)

contain `0 µ simultaneous for all `, is equal to (1 − α)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 145 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

Proof of Result
n(`0 x̄−`0 µ)2
T 2 = n(x̄ − µ)0 S−1 (x̄ − µ) ≤ c2 ⇔ `0 S`
≤ c2
r r
0 `0 S` `0 S`
` x̄ − c ≤ `0 µ ≤ `0 x̄ + c
n n

Thus, the only probability we need to control is P(T 2 ≤ c2 ). We know from


the distribution form of T 2 that P(T 2 ≤ c2 ) = 1 − α is guaranteed by
choosing c2 = p(n−1)
(n−p) Fp,n−p (α)
 In particular, a confidence interval for the jth mean µj is given by
choosing ` = (0, 0.....0, 1, 0, ....0)0 :
s r s r
p(n − 1) sjj p(n − 1) sjj
x̄j − Fp,n−p (α) ≤ µj ≤ x̄j + Fp,n−p (α)
(n − p) n (n − p) n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 146 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

 We may add as many intervals as we want without changing the


confidence (1 − α). Example; intervals for contrast of two means:
s r
p(n − 1) sjj − 2sjk + skk
(x̄j − x̄k ) − Fp,n−p (α) ≤ (µj − µk )
(n − p) n
s r
p(n − 1) sjj − 2sjk + skk
≤ (x̄j − x̄k ) + Fp,n−p (α)
(n − p) n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 147 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

Example[College Data]
Consider a data
 with   
526.59 5691.34 600.51 217.25
n = 87,x̄ =  54.69  , S =  600.51 126.05 23.37 
25.13 217.25 23.37 23.11
a. compute the 95 percent simultaneous confident intervals for µ1 , µ2 and
µ3
b. construct a 95 percent confident interval for µ2 − µ3
c. compute the 95 percent confident region or confidence ellipse for the pair
(µ1 , µ2 )
Solution
p(n−1) 3(87−1) 3(86)
(a) (n−p) Fp,n−p (α) = (87−3) F3,84 (0.05) = 84 (2.7) = 8.29

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 148 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

√ √
r r
5691.34 5691.34
526.59 − 8.29 ≤ µ1 ≤ 526.59 − 8.29
87 87
503.30 ≤ µ1 ≤ 509.88
√ √
r r
126.05 126.05
54.69 − 8.29 ≤ µ2 ≤ 54.69 − 8.29
87 87
51.22 ≤ µ2 ≤ 58.16
√ √
r r
23.11 23.11
25.13 − 8.29 ≤ µ3 ≤ 25.13 − 8.29
87 87
23.65 ≤ µ3 ≤ 26.61

(b) `0 = [0, 1, −1], the interval for µ2 − µ3 has end points

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 149 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Simultaneous Confidence Statement

r
p(n − 1) s22 + s33 − 2s23
(x̄2 − x̄3 ) ± Fp,n−p (0.05)
(n − p) n
r
√ 126.05 + 23.11 − 2(23.37)
(54.69 − 25.13) ± 8.29
87
29.56 ± 3.12

[26.44, 32.68] is a 95% percent confidence interval for µ2 − µ3 .


 −1  
126.05 23.37 54.69 − µ2
(c) 87[54.69 − µ2 , 25.13 − µ3 ]
23.37 23.11 25.13 − µ3

= 0.849(54.69 − µ2 )2 + 4.633(25.13 − µ3 )
− 2 × 0.859(25.13 − µ3 )(54.69 − µ2 )
≤ 8.29

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 150 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Bonferroni Intervals

 Consider the set of p confidence intervals for the mean components.


Consider partitioning of the overall significant level

α = α1 + α2 + · · · + αp

 An obvious choice is αj = α/p.


 The confidence intervals become
 r  r
α sjj α sjj
x̄j − tn−1 ≤ µj ≤ x̄j + tn−1
2p n 2p n

 Bonferroni and T 2 both have the structure


r r
`0 S` `0 S`
`0 x̄ − (critical value) ≤ `0 µ ≤ `0 x̄ + (critical value)
n n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 151 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Bonferroni Intervals

α
 The ratio of their length for αj = m is
α

tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)

which does not depend on the random quantities x̄ and S.


 For small number of comparisons, the Bonferroni intervals are precise
(smaller), i.e when m = p. Because of this, often intervals are based on
Bonferroni method.
Example [College Data]
Bonferonni confidence interval for µ1 , µ2 and µ3 .
αi = 0.05
3 , i = 1, 2, 3, n = 87
 r
0.05 sjj
x̄j ± t86 , j = 1, 2, 3
2×3 n
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 152 / 261
Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Bonferroni Intervals

r
5691.34
526.59 ± 2.44 or 507.77 ≤ µ1 ≤ 545.41
87
r
126.05
54.69 ± 2.44 or 51.89 ≤ µ2 ≤ 57.49
87
r
23.11
25.13 ± 2.44 or 23.93 ≤ µ3 ≤ 26.33
87
Observe that these intervals are smaller than those constructed for µ1 , µ2 ,µ3
using T 2 (T 2 - intervals)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 153 / 261


Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means

Exercise

 
0.564
For the example with n = 42, x̄ = and
0.603
   
0.0144 0.0117 −1 203.018 −163.391
S= ,S =
0.0144 0.0146 −163.391 200.228
i. Conduct a test of the null hypothesis H0 : µ0 = 0.55, 0.60 at
 

α = 0.05 level of significance.


ii. Find the simultaneous 95 percent T 2 confidence interval for µ1 and µ2 .
iii. Construct the 95 percent Bonferroni intervals for µ1 and µ2 . Compare
the two.
iv. Compare the two sets of intervals in (ii) and (iii)
v. Construct the 95 percent T 2 confidence interval for µ1 − µ2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 154 / 261


Inference About a mean vector Large-Sample Inference About a Population Mean Vector

Large sample inference

 All large-sample inferences about µ are based on a χ2 - distribution.


 We know that

(X̄ − µ)0 (n−1 S)(X̄ − µ) = n(X̄ − µ)S−1 (X̄ − µ)

is approximately χ2 with p df and thus

P[n(X̄ − µ)S−1 (X̄ − µ) ≤ χ2p (α)] = 1 − α

where χ2p (α) is the upper (100α)th of the χ2 distribution.


 Hypothesis Testing: X1 , X2 , . . . , Xn be a random sample from a
population with mean µ and positive definite covariance matrix Σ.
 When (n − p) is large, the hypothesis H0 : µ = µ0 is rejected in favor
H1 : µ 6= µ0 , at a level of significance approximately α,
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 155 / 261
Inference About a mean vector Large-Sample Inference About a Population Mean Vector

Large sample inference

 if the observed
n(x̄ − µ0 )S−1 (x̄ − µ0 ) > χ2p (α)
where χ2p (α) is the upper (100α)th percentile of a chi-square distribution
with p df .
 Note: The test statistics yields essentially the same results as T 2 in
situation where the χ2 - test is appropriate. This follows from the fact
(n−1)p 2
(n−p) Fp,n−p (α) and χp (α) are approximately equal for n large relative
to p.
 Confidence Statement: Let X1 , X2 , . . . , Xn be a random sample from a
population with µ and positive definite covariance Σ. If (n − p) is large
r
0
q `0 S`
` X̄ ± χ2p (α)
n
will contain `0 µ for every `, with probability approximately (1 − α).
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 156 / 261
Inference About a mean vector Large-Sample Inference About a Population Mean Vector

Large sample inference

 Consequence, we can make 100(1 − α)% simultaneous confidence


statements r
q
2
sjj
x̄j ± χp (α)
n
contain µj
 for all pairs (µj , µk ), j, k = 1, 2, . . . , p
  
sjj sjk x̄j − µj
n[x̄j − µj , x̄k − µk ] ≤ χ2p (α)
sjk skk x̄k − µk

contain (µj , µk ).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 157 / 261


Comparison of Several Multivariate Means Paired Comparisons

Paired comparisons

 Let Xijk be the outcome for unit i on the jth variable, for treatment
(experimental condition) t = 1, 2; j = 1, . . . , p; i = 1, 2, . . . , n.
 Then define Dij = Xij1 − th
 Xij2 . Let Di be the difference vector for the i
Di1
 Di2 
unit/pair: Di =  . .
 
 .. 
Dip
 Denote E(Di ) = δ and covariance, cov(Di ) = Σd . Again, T 2 statistics
play a critical role:
T 2 = n(D̄ − δ)0 S−1
d (D̄ − δ).
 Assume Di ∼ Np (δ, Σd ), then (for a test of hypothesis H0 : δ = 0
against H1 : δ 6= 0)
(n − 1)p
T2 ∼ Fp,n−p
(n − p)
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 158 / 261
Comparison of Several Multivariate Means Paired Comparisons

Paired comparisons

 Furthermore, if n − p is large, then T 2 ∼ χ2p regardless of the distribution


of Di
0 (n−1)p
 Confidence regions: T 2 = nd̄ S−1
d d̄ > (n−p) Fp,n−p (α)

 T 2 interval (scheffe interval):


s s s s
(n − 1)p s2dj (n − 1)p s2dj
d̄j − Fp,n−p ≤ δj ≤ d̄j + Fp,n−p
(n − p) n (n − p) n

 Bonferroni intervals:
s s

α
 s2dj 
α
 s2dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 159 / 261


Comparison of Several Multivariate Means Paired Comparisons

Example: Measurement of biochemical oxygen demand (BoD) and a


suspended solids(SS) were obtained for n = 11 sample splits from two
laboratories. The data are displayed in the table below.

Commercial Lab State Lab of Hygiene


Sample i Xi11 (B0D) Xi21 (SS) Xi12 (BOD) Xi22 (SS)
1 6 27 25 15
2 6 23 28 13
3 18 64 36 22
4 8 44 35 29
5 11 30 15 31
6 34 75 44 64
7 28 26 42 30
8 71 124 54 64
9 43 54 34 56
10 33 30 29 20
11 20 14 39 21

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 160 / 261


Comparison of Several Multivariate Means Paired Comparisons

Example

 Do the two laboratories chemical analysis agree? if differences exist ,


what is their nature.
 The T 2 statistics for testing H0 : δ 0 = [δ1 , δ2 ] = [0, 0] is constructed from
the differences of paired observations:

di1 = Xi11 − Xi12 -19 -22 -18 -27 -4 -10 -14 17 9 4 -19
di2 = Xi21 − Xi22 12 10 42 15 -1 11 -4 60 -2 10 -7

Here      
−9.36
d̄1 199.26 88.38
d̄ = = , Sd =
13.27
d̄2 88.38 418.61
  
2 199.26 88.38 −9.36
T = 11[−9.36, 13.27] = 13.6
88.38 418.61 13.27

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 161 / 261


Comparison of Several Multivariate Means Paired Comparisons

Example

 Taking α = 0.05, we find that


(n − 1)p 2(10)
Fp,n−p (0.05) = F2,9 (0.05) = 9.47
(n − p) 9
 Since T 2 = 13.6 > 9.47, we reject H0 and conclude that there is a
nonzero difference between the measurements from the two laboratories.
 By inspection, it appears that the commercial laboratories tends to
produce lower BOD measurements and higher SS measurements than the
state laboratories of hygiene.
 The 95% simultaneous confidence intervals for the mean differences δ1
and δ2 are:
s s
(n − 1)p S2dj
δj : d̄j ± Fp,n−p (α)
n−p n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 162 / 261


Comparison of Several Multivariate Means Paired Comparisons

Example


r
199.26
δ1 : d¯1 ± −9.36 ± 9.47 or (−22.46, 3.47)
11

r
418.61
δ2 : d¯2 ± 13.27 ± 9.47 or (−5.71, 32.25)
11
 The 95% simultaneous confidence intervals include zero, yet the
hypothesis H1 : δ = 0 was rejected at α. What are we to conclude?
 The evidence points towards real differences. The points δ = 0 falls
outside the 95% confidence region of δ
 Note that our analysis assumed a normal distribution for Di .
Exercise: Construct the 95% confidence region for δ and show that
δ 0 = (0, 0) falls outside the 95% confidence region. Find the Bonferroni
simultaneous intervals. Do they cover zero?
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 163 / 261
Comparison of Several Multivariate Means Repeated Measures Design

Repeated Measures

 We assume that each subject is measured q times in row. Each


measurement corresponds to a specific experimental condition.
 In doing so, there is no "between" variability, reducing the sample size
and simplifying statistical analysis. The data for the ith subject is a q
dimensional vector.  
Xi1
Xi =  ... 
 

Xiq
 Most often, not the mean vector µ itself but linear functions of its
components are of interest Cµ. We call C a contrast matrix.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 164 / 261


Comparison of Several Multivariate Means Repeated Measures Design

Repeated measures

Examples are:
 Baseline comparison
    
µ 1 − µ2 1 −1 0 . . . 0 µ1
 µ 1 − µ3   1 0 −1 . . . 0  µ2 
=
    
 .. .. .. .. .. ..  .. 
 .   . . . . .  . 
µ 1 − µq 1 0 0 ... −1 µq

 Successive differences
    
µ2 − µ2 −1 1 0 . . . 0 0 µ1
 µ3 − µ2   0 −1 1 . . . 0 0  µ2 
=
    
 .. .. .. .. . . .. ..  .. 
 .   . . . . . .  . 
µq − µq−1 0 0 0 ... −1 1 µq

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 165 / 261


Comparison of Several Multivariate Means Repeated Measures Design

Test for Equality of Treatments

 All contrast matrices C with q − 1 linearly independent rows, all of


which are orthogonal.
 Further, it follows that T 2 statistics
T 2 = n(Cx̄ − Cµ)0 (CSC0 )−1 (Cx̄ − Cµ)
is independent of the choice of C.
 Let Xi ∼ Nq (µ, Σ) and let C be a contrast matrix H0 : Cµ = 0 is
rejected in favor of H1 : Cµ 6= 0 on the α-level if
(n − 1)(q − 1)
T2 > Fq−1,n−q+1 (α)
n−q+1
 Confidence Region
(n − 1)(q − 1)
n(Cx̄ − Cµ)0 (CSC0 )−1 (Cx̄ − Cµ) ≤ Fq−1,n−q+1 (α)
n−q+1
 Simultaneous T 2 and Bonferroni intervals follow immediately.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 166 / 261
Comparison of Several Multivariate Means Repeated Measures Design

Question

A researcher considered 3 indices measuring the severity of heart attacks. The


values of these indices for n = 40 heart-attack patients arriving at the hospital
emergency room produces the summary statistics,
 
101.3 63.0 71.0
x̄0 = [46.1, 57.3, 50.4] and S =  63.0 80.2 55.6 
71.0 55.6 97.4

(a) All three indices are evaluated for each patient. Test for the equality of
mean indices with α = 0.05
(b) Judge the differences in pairs
 of mean indices using 95% simultaneous
q 0 
q
0
confidence intervals. Note: C x̄ ± (n−1)(q−1)
F (α) C SC
n−q+1 q−1,n−q+1 n

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 167 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Compare mean vectors from two populations

 Now, we will have 2 (independent) sets of responses n1 subjects under


experimental condition 1, n2 subjects under experimental condition 2.
 The data setting is as follows. Let
◦ X`1 , X`2 , . . . , X`n be a random sample from population `(` = 1, 2)
which has mean µ` and covariance Σ`
◦ Both samples are independent.
◦ The observed values be x`1 , . . . , x`n` with mean µ` and covariance
S` .
◦ (small samples only): both populations are multivariate normal.
◦ (small samples only): Σ1 = Σ2 .
 Question of interest
◦ µ1 = µ2 ?
◦ statements about the components of the mean vectors.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 168 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Common Variance

 When covariances are equal (Σ1 = Σ2 = Σ), it is common practice to


construct a "pooled" estimate of variance.

(n1 − 1)S1 + (n2 − 1)S2


Spooled =
n1 + n2 − 2
Σi=1 (x1i − x̄1 )(x1i − x̄1 )0 + Σni=1
n1 2
(x2i − x̄2 )(x2i − x̄2 )0
=
n1 + n2 − 2
 Why do we have the factor n1 + n2 − 2 in the denominator?
◦ we know that adding Wishart variables, the degree of freedom add
also
◦ there is likelihood support for this procedure
 An important hypothesis is H0 : µ1 − µ2 = δ 0 . Obviously,
E(X̄1 − X̄2 ) = µ1 − µ2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 169 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Common Variance

 The samples are independent


⇒ X̄1 and X̄2 are independent.
⇒ cov(X̄1 , X̄2 ) = 0

cov(X̄1 − X̄2 ) = cov(X̄1 ) + cov(X̄2 )
1 1
= Σ1 + Σ2
n n2
1 
1 1
= + Σ
n1 n2

which is estimated by ( n11 + 1


n2 )Spooled .

 Now the confidence region follows from


  −1
1 1
T 2 = (x̄1 − x̄2 − δ0 )0 + Spooled (x̄1 − x̄2 − δ0 ) > c2
n1 n2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 170 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Common Variance

where c2 follows from the distribution of T 2 which is


(n1 + n2 − 2)p
c2 = Fp,n1 +n2 −p−1
(n1 + n2 − p − 1)
 Simultaneous confidence interval for all `0 (µ1 − µ2 ) are
s  
0 0 1 1
` (X̄1 − X̄2 ) ± c ` + Spooled `
n1 n2
 In particular, a simultaneous confidence interval for µ1i − µ2j is
s 
1 1
(X̄1j − X̄2j ) ± c + s
n1 n2 jj,pooled
 Bonferroni confidence interval for µ1j − µ2j is when we consider the
population means
 r
α 1 1
(X̄1j − X̄2j ) ± tn1 +n2 −2 + s
2p n1 n2 jj,pooled
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 171 / 261
Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Two-sample situation when Σ1 6= Σ2

 Problem if samples are small.


 Large sample results available.
 Let n1 − p and n2 − p be large. An approximate 100(1 − α) confidence
ellipsoid for µ1 − µ2 is
 −1
0 S1 S2
[(X̄1 − X̄2 ) − (µ1 − µ2 )] + [(X̄1 − X̄2 ) − (µ1 − µ2 )] ≤ χ2p (α)
n1 n2

 Simultaneous confidence interval for `0 (µ1 − µ2 ) is


s  
0 S1 S2
q
0 2
` (X̄1 − X̄2 ) ± χp (α) ` + `
n1 n2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 172 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Example:(Calculate confidence interval for differences in mean component)


Consider the following: n1 = 45, n2 = 55,
   
204.4 13825.3 23823.4
x̄1 = , S1 =
556.6 23823.4 73107.4
   
130.0 8632.0 19616.7
x̄2 = , S2 =
355.0 19616.7 55964.5
Calculate 95% simultaneous confidence intervals for the population
difference.
Solution
 
n1 − 1 n2 − 1 10963.7 21505.5
Spooled = S1 + S2 =
n1 + n2 − 2 n1 + n2 − 2 21505.5 63661.3

(n1 +n2 −2)p 98(2)


and c2 = n1 +n2 −p−1 Fp,n1 +n2 −p−1 = 97 F2,97 (0.05) = (2.02)(3.1) = 6.26

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 173 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Example

µ01 − µ02 = [µ11 − µ21 , µ12 − µ22 ]. The 95% simultaneous confidence
intervals for the population differences are:
s


1 1
µ11 − µ21 : (204.4 − 130.0) ± 6.26 + 10963.7
45 55
21.7 ≤ µ11 − µ21 ≤ 127.1
s


1 1
µ12 − µ22 : (556.6 − 355) ± 6.26 + 63661.3
45 55
74.7 ≤ µ12 − µ22 ≤ 328.5

Exercise
Construct 95% confidence ellipse for µ1 − µ2 . Does the difference in means
cover 00 = [0, 0]? [Or, will T 2 − statistics reject H0 : µ1 − µ2 = 0 at the 5%
level?].
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 174 / 261
Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Example

Analyze the above example using large-sample approach.


   
1 1 1 13825.3 23823.4 1 8632.0 19616.7
S1 + S2 = +
n1 n2 45 23823.4 73107.4 55 19616.7 55964.5
 
464.17 886.08
=
886.08 2642.15
The 95% simultaneous CI for the linear combinations
 
µ11 − µ21
µ11 − µ21 = (1, 0) and
µ11 − µ21
 
µ11 − µ21
µ12 − µ22 = (0, 1) are
µ11 − µ21
√ √
µ11 − µ21 : 74.4 ± 5.99 464.17 or (21.7, 127.1)
√ √
µ12 − µ22 : 201.6 ± 5.99 2642.15 or (75.8, 327.4)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 175 / 261


Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations

Example

The T 2 − statistics for testing H0 : µ1 − µ2 = 0 is


 −1
2 1 0 1
T = (x̄1 − x̄2 ) S1 + S2 (x̄1 − x̄2 )
n1 n2
 0   
204.4 − 130.0 464.17 886.08 204.4 − 130.0
=
556.6 − 355.0 886.08 2642.15 556.6 − 355.0
= 15.66

For α = 0.05, the critical value is χ22 (0.05) = 5.99 and since
T 2 > χ22 (0.05) = 5.99, we reject H0 .

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 176 / 261


Comparison of Several Multivariate Means One-Way MANOVA

Comparison of g populations

 Let sample `(` = 1, 2, . . . , g) be X`1 , X`2 , . . . , X`n`


 Our interest are
◦ µ1 = µ2 = · · · = µg ?
◦ statements about components
 The data setting is as follows: Let
◦ X`1 , X`2 , . . . , X`n` be a random sample from population `, with
mean µ` and common covariance Σ
◦ all samples are independent
◦ observed values x`1 , x`2 , . . . , x`n` with mean x¯` and covariance S`
◦ (small samples) each population is multivariate normal

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 177 / 261


Comparison of Several Multivariate Means One-Way MANOVA

Multivariate Analysis of Variance

 MANOVA model for comparing g population mean vectors

X`i = µ + τ ` + e`i

with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
 We take e`i ∼ N(0, Σ) and E(X`i ) = µ + τ ` , with µ = overall mean,
τ ` = group effect.
 The model isPoverspecified and so we impose the identifiability
constraints g`=1 n` τ ` = 0. That is, all group effects add up to zero and
there are n` individuals in each group.
 The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: e`i

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 178 / 261


Comparison of Several Multivariate Means One-Way MANOVA

MANOVA

 The observations are decomposed as

x`i = x̄ + (x̄` − x̄) + (x`i − x̄` )

i.e.

observation = overall mean+treatment effect (or group effect)+resisuals

 This implies,

(x`i − x̄)(x`i − x̄)0 = [(x̄` − x̄) + (x`i − x̄` )][(x̄` − x̄) + (x`i − x̄` )]0
= (x̄` − x̄)(x̄` − x̄)0 + (x̄` − x̄)(x`i − x̄` )0
+ (x`i − x̄` )(x̄` − x̄)0 + (x`i − x̄` )(x`i − x̄` )0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 179 / 261


Comparison of Several Multivariate Means One-Way MANOVA

MANOVA

 summing over i

n
X̀ n
X̀ n

0 0
(x`i − x̄)(x`i − x̄) = (x̄` − x̄)(x̄` − x̄) + (x̄` − x̄) (x`i − x̄` )0
i=1 i=1 i=1
n
X̀ n

0
+ (x`i − x̄` )(x̄` − x̄) + (x`i − x̄` )(x`i − x̄` )0
i=1 i=1
n

= n` (x̄` − x̄)(x̄` − x̄)0 + (x`i − x̄` )(x`i − x̄` )0
i=1
Pn`
since i=1 (x`i − x̄` ) = 0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 180 / 261


Comparison of Several Multivariate Means One-Way MANOVA

MANOVA

 Next, sum over `

g X̀
X n g
X
(x`i − x̄)(x`i − x̄)0 = n` (x̄` − x̄)(x̄` − x̄)0
`=1 i=1 `=1
g X̀
X n
+ (x`i − x̄` )(x`i − x̄` )0
`=1 i=1
   
Total(corrected) Sum Treatment (Between)
 of Squares and Cross  =  Sum of Squares and 
Products Cross Products
 
Residual (Within) Sums
+  of Squares and 
Cross Products
T = B+W
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 181 / 261
Comparison of Several Multivariate Means One-Way MANOVA

MANOVA

 The within matrix is only meaningful if Σ is constant over samples.


Indeed
g X̀
X n
W = (x`i − x̄` )(x`i − x̄` )0
`=1 i=1
= (n1 − 1)S1 + (n2 − 1)S2 + · · · + (ng − 1)Sg

and reminds of Spooled .


 The MANOVA table for τ 1 = τ 2 = · · · = τ g = 0 has the same structure
as ANOVA table.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 182 / 261


Comparison of Several Multivariate Means One-Way MANOVA

How to construct Test Statistics

 Recall,
g X̀
X n
W = (x`i − x̄` )(x`i − x̄` )0
`=1 i=1
g
X
B = n` (x̄` − x̄)(x̄` − x̄)0
`=1
g X̀
X n
B+W = (x`i − x̄)(x`i − x̄)0
`=1 i=1

 Remark:
◦ The "between" matrix B is often denoted as H, the "hypothesis"
matrix.
◦ The "within" matrix W is often denoted as E, the "error" matrix
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 183 / 261
Comparison of Several Multivariate Means One-Way MANOVA

Construct Test Statistics

 The simplest way to characterize information contain in p.s.d matrices is


via their eigenvalues.
 Consider

|W − λ(B + W)| = 0
|B − θ(B + W)| = 0
|B − φW| = 0

 These root equations find eigenvalues of the matrices

W(B + W)−1 , B(B + W)−1 , W −1 B

 Indeed, |B − φW| = 0 ⇔ |W −1 B − φI| = 0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 184 / 261


Comparison of Several Multivariate Means One-Way MANOVA

Construct Test Statistics

 Now rephrasing the first and the second root equation, we find
|(1 − λ)W − λB| = 0
1−λ
|B − W| = 0
λ
|B − φW| = 0
and
|(1 − θ)B − θW| = 0
θ
|B − W| = 0
(1 − θ)
 Thus
1−λ θ φ
φ = = ,θ = =1−λ
λ 1−θ 1+φ
1
λ = = 1 − θ.
1+φ
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 185 / 261
Comparison of Several Multivariate Means One-Way MANOVA

Test Statistics

There are four statistics in common use.


 Wilk’s Lambda:
|W|
Λ = = det(W(B + W)−1 )
|B + W|
p p p
Y Y Y 1
= λj = θj =
1 + φj
j=1 j=1 j=1

 Pillar’s Trace: tr (B(B + W)−1 ) =


Pp
j=1 θj
θj
 Hotelling-Lawley Trace: tr BW −1 = pj=1 φj = pj=1
 P P
1−θj
Note that T 2 is defined slightly different here.
 Roy’s Greatest Root: The largest root of the equation |B − φW| = 0:
θmax
φmax =
1 − θmax
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 186 / 261
Comparison of Several Multivariate Means One-Way MANOVA

MANOVA Table for Comparing Population Mean Vectors

Table: MANOVA Table

Source Matrix of SS Degrees of Test


of Variation and CP Freedom Statistics
Treatment B g−1
Pg ∗ |W |
Residual (Error) W `=1 n` − g Λ = |B+W |
Pg
Total (corrected for the mean) B+W `=1 n` − 1

Pg
− x̄)(x̄` − x̄)0 , W = g`=1 ni=1 (x`i − x̄` )(x`i − x̄` )0
P P`
B= `=1 n` (x̄`

 The quantity Λ∗ = |B|W |


+W |
is convenient for testing

H0 : τ 1 = τ 2 = · · · = τ g = 0

and is called the Wilk’s Lambda.


Dr. S. Iddi (UG) STAT446/614 April 16, 2015 187 / 261
Comparison of Several Multivariate Means One-Way MANOVA

Exact distribution of Λ∗

The exact distribution of Λ∗ for special cases are listed below.

Table: Distribution of Wilk’s Lambda, Λ∗ = |B|W |


+W |

No. of No. of Sampling Distribution for


Variables Groups  PMultivariate
 Normal Data
n` −g 1−Λ∗

p=1 g≥2 g−1 Λ ∗ ∼ Fg−1,P n` −g
P  √ 
n` −g−1 ∗
p=2 g≥2 √ Λ
1−
∼ F2(g−1),2(P n` −g−1 )
g−1
P  Λ∗
n` −p−1 1−Λ∗

p≥1 g=2 p  Λ ∗ ∼ Fp,P n` −p−1
P √ 
n` −p−2 ∗
p≥2 g=3 √ Λ
1−
∼ F2p,2(P n` −p−2 )
p Λ∗

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 188 / 261


Comparison of Several Multivariate Means One-Way MANOVA

Large sampling distribution of Λ∗

P
 If H0 is true and n` = n is large
p+g
−(n − 1 − ) ln Λ∗
2
has approximately chi-square distribution with p(g − 1) df .
 Consequently, for large n, we reject H0 at α level if
p+g
−(n − 1 − ) ln Λ∗ ≥ χ2p(g−1) (α)
2
where χ2p(g−1) (α) is the upper (100α)th percentile of a chi-square
distribution with p(g − 1) df .

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 189 / 261


Comparison of Several Multivariate Means One-Way MANOVA

Exercise: Observations
 on two
 responses are collected for two treatments.
x1
The observed vectors are
x2      
3 1 2
Treatment 1: , ,
3 6 3
       
2 5 3 2
Treatment 2: , , ,
3 1 1 3
(a) (i) Calculate Spooled
(ii) Test H0 : µ1 − µ2 = 0 employing a two-sample approach with
α = 0.01
(iii) Construct 99% simultaneous confidence intervals for the differences
µ1i − µ2i , i = 1, 2
(b) (i) Break up the observations into mean, treatment and residual
components.[X`i = X̄ + (X̄` − X̄) + (X`i − X̄` )]
(ii) Construct the one-way MANOVA table
(iii) Evaluate Wilk’s Lambda Λ∗ and test for treatment effects. Set
α = 0.01. Repeat the test using the chi-square approximation.
Compare the conclusions.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 190 / 261
Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 Let p measurements be made on all individuals in g groups. The


measurements can refer to
◦ p different treatment, administered in row.
◦ p tests
◦ p questions in a survey
◦ p measurements of characteristics, carried out at distinct points
(longitudinal data)
 Studying the profiles of the outcomes can yield valuable information
about the data structure.
 For two groups, we have mean vectors
   
µ11 µ21
µ1 =  ...  , µ2 =  ... 
   

µ1p µ2p

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 191 / 261


Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 Questions that come intuitively to mind are:


(1) Are the profiles parallel?

H01 : µ1j − µ2j = µ1,j−1 − µ2,j−1 , 2 ≤ j ≤ p

Equivalently,

H01 : µ1j − µ1,j−1 = µ2,j − µ2,j−1

(2) Given that the profiles are parallel, are they also coincident?

H02 : µ1j − µ2j , 1 ≤ j ≤ p

(3) Given that the profiles are coincident, are they horizontal?

H03 : µ11 = ....... = µ1p = µ21 = ........ = µ2p

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 192 / 261


Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 The null hypothesis in (1) can be written H01 : Cµ1 = Cµ2 where C is
the contrast matrix with
 
−1 1 0 0 . . . 0 0
 0 −1 1 0 . . . 0 0 
C= .
 
.. .. .. . . .. ..
 ..

. . . . . . 
0 0 0 0 . . . −1 1
 We don’t need new methodology to carry out this test. We merely need
to think of the following transformed observations:
Cx1i i = 1, . . . , n1
Cx2i i = 1, . . . , n2
 The resulting observations have (p − 1) components instead of p. Using
these observations, the ordinary T 2 test immediately applies. In our case,
this is a two a sample problem.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 193 / 261
Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 Indeed,

CX1i ∼ Np−1 (Cµ1 , CΣC0 )


CX2i ∼ Np−1 (Cµ2 , CΣC0 )

 The test for parallel profile becomes


  −1
1 1 0
2
T = (Cx̄1 − Cx̄2 ) + CSpooled C (Cx̄1 − cx̄2 )0
n1 n2

 Reject H01 of parallel profiles at α-level if T 2 > c2 with


2 −2)(p−1)
c2 = (n1 +n
n1 +n2 −p Fp−1,n1 +n2 −p

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 194 / 261


Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 When we fail to reject the hypothesis, we can conclude that the profiles
are plausible. Then it is a reasonable to test for equality.
 We now look for the a test for testing whether the distance of the two
profiles is zero.
 Given that the profile are parallel, the difference at any point (outcome) j
is d = µ1j − µ2j , whence, if they are equal:
X
pd = (µ1j − µ2j ) = 0
j

that is X X
µ1j = µ2j
j j

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 195 / 261


Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 If we use the notation J = Jp,1 for the column vector of p ones, then the
above equality becomes
J0 µ1 = J0 µ2
 The vector J plays a role, similar to the role of C in the previous test.
Now, however, we have a ‘univariate’ test. The denominator degree of
freedom for the F statistics is 1 and the test statistics is
  −1
1 1
T 2 = (J0 x̄1 − J0 x̄2 )0 + J0 Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
  −1
0 0 1 1 0
= (J x̄1 − J x̄2 ) + J Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
 2
0
J (x̄ 1 − x̄ 2 )
= q 
1 1 0
( n1 + n2 )J Spooled J

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 196 / 261


Comparison of Several Multivariate Means Profile Analysis

Profile analysis

 Note that now, F1,n1 +n2 −2 (α) = tn21 +n2 −2 ( α2 )


 The next step, if this hypothesis is not rejected, is to look at level
(horizontal) profile. Given these hypothesis are plausible, we have a
single normal population with common vector µ and samples values;
Pn1 Pn2
i=1 x1i + i=1 x2i n1 x̄1 n2 x̄2
x̄ = = +
n1 + n2 n1 + n2 n1 + n2
 The hypothesis now becomes µ1 = · · · = µp which can be written as
Cµ = 0 where C is exactly the same as in the first hypothesis.
 We derive the rejection criterion. The univariate Hotelling T 2 is
n(x̄ − µ0 )0 S−1 (x̄ − µ0 ).
 In our case, we look at Cx and Cµ0 = 0, thus
(n1 + n2 − 2)(p − 1)
(n1 + n2 )x̄0 C0 (CSpooled C0 )−1 Cx̄ > Fp−1,n1 +n2 −p
n1 + n2 − p
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 197 / 261
Comparison of Several Multivariate Means Profile Analysis

Profile analysis

Remark
 The hypothesis H01 can always be tested. It is the first hypothesis that
usually come to mind.

 The test for H02 as described is useless in the following cases:


◦ You are not interested in H01
◦ H01 is rejected.
Reason: H02 is not the same as H02 , given that H01 is true

 Similarly for H03 in the relation H01 ?

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 198 / 261


Comparison of Several Multivariate Means Profile Analysis

Example

As part of a large study of Love and marriage, a sociologist surveyed adults


with respect to their marriage "contributions" and "outcomes" and their level
of "passionate" and "companionate" love. Recently married males and
females were asked to respond to the following questions, using the 8-point
scale in the table below.

Extremely Very Moderately Slightly Slightly Moderately Very Extremely


negative negative negative negative positive positive positive positive
1 2 3 4 5 6 7 8

(3) All things considered, how would you describe your contributions to the
marriage?
(4) All things considered, how would you describe your outcomes from the
marriage?

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 199 / 261


Comparison of Several Multivariate Means Profile Analysis

Example

Subjects were also asked to respond to the following using the 5-point scale
below.
None Very Some A great Tremendous
at all little deal amount
1 2 3 4 5
(1) What is the level of passionate love that you feel for your partner?
(2) What is the level of compassionate love that you feel for your partner?
Let
x1 = an 8-point scale response to Question 1
x2 = an 8-point scale response to Question 2
x3 = a 5-point scale response to Question 3
x4 = a 5-point scale response to Question 4
and the two population be defined as
Population 1 = married men
Population 2 = married women
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 200 / 261
Comparison of Several Multivariate Means Profile Analysis

Example

Assuming a common covariance matrix Σ, it is of interest to see whether the


profiles of males and females are the same. A sample of n1 = 30 males and
n2 = 30 females
 gavethe sample mean
 vectors.

6.833 6.633
 7.033   7.000 
x̄1 = x̄m = 
 3.967  , x̄2 = x̄f =  4.000  and pooled covariance matrix
  

4.700 4.533
 
0.606 0.262 0.066 0.161
 0.262 0.637 0.173 0.143 
Spooled =
 0.066

0.173 0.810 0.029 
0.161 0.143 0.029 0.306

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 201 / 261


Comparison of Several Multivariate Means Profile Analysis

Test for parallelism

To test for parallelism, H01 : Cµ1 = Cµ2 , we compute


 
  0.606 0.262 0.066 0.161
−1 1 0 0 
0 0.262 0.637 0.173 0.143 
CSpooled C =  0 −1 1 0   0.066 0.173

0.810 0.029 
0 0 −1 1
0.161 0.143 0.029 0.306
 
−1 0 0  
 1 −1 0.719 −0.268 −0.125
0 
×  = 1.101 −0.751 
 0 1 −1 
1.058
0 0 1
and
 
 0.200   
−1 1 0 0  −0.167
0.033  

C(x̄1 − x̄2 ) =  0 −1  −0.033  = −0.066
1 0  
0 0 −1 1 0.200
0.167
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 202 / 261
Comparison of Several Multivariate Means Profile Analysis

Test for parallelism

Thus
 
−1 0.606 0.262 0.066 0.161
 0
−0.167 
1 1  0.262 0.637 0.173 0.143 
T2 =  −0.066  + 
 0.066

30 30 0.173 0.810 0.029 
0.200
0.161 0.143 0.029 0.306
 
−0.167
×  −0.066  = 15(0.67) = 1.005
0.200

with α = 0.05
(30 + 30 − 2)(4 − 1)
c2 = F3,56 (0.05) = 3.11(2.8) = 8.7
(30 + 30 − 4)

Since T 2 = 1.005 < 8.7, we conclude that the hypothesis of parallel profiles
for men and women is tenable.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 203 / 261
Comparison of Several Multivariate Means Profile Analysis

Test for Coincident Profiles

[Plot mean profiles for male and female]


Test for coincident profiles, H02 : J0 µ1 = J0 µ2
 
0.200
  0.033 
J0 (x̄1 − x̄2 ) = 1 1 1 1   −0.033  = 0.367

0.167
  
0.606 0.262 0.066 0.161 1
0.262 0.637 0.173 0.143   1
J0 Spooled J =
  
1 1 1 1   0.066 0.173 0.810
 
0.029   1 
0.161 0.143 0.029 0.306 1
= 4.207
 2
0.367
T2 =  q  = 0.501
1 1

30 + 30 4.027

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 204 / 261


Comparison of Several Multivariate Means Profile Analysis

Test for Coincident Profiles

with α = 0.05, F1,58 (0.05) = 4.0 and T 2 = 0.501 < F1,58 (0.05) = 4.0, we
cannot reject the hypothesis that the profile are coincident. That is the
responses of the men and women to the four questions posed appear to be the
same.
 We could now test for level profiles, however, it does not make sense to
carry out this test since Questions 1 and 2 were measured on a scale of
1 − 8, while Question 3 and 4 were measured on a scale of 1 − 5.

 The incompatibility of these scales makes the test for level profiles
meaningless.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 205 / 261


Comparison of Several Multivariate Means Profile Analysis

Exercise

A project was designed to investigate how consumers in Accra would react to


an electrical time -of-use pricing scheme. The cost of the electricity during
peak periods for some customers was set at eight times the cost of electricity
during off -peak hours. Hourly consumption (in kilowatt- hours) was
measured on a similar day before the experiment rates began.The responses,
log(current consumption)-log(baseline consumption) for the hours ending
9am, 11am(peak hour), 1pm and 3pm (a peak hour) produced the following
summary statistics.

n1 = 28, x̄01 = 0.153 −0.231 −0.322 −0.339


 
Test group:
n2 = 58, x̄02 = 0.151 0.180 0.256 0.257
 
Control group:
 
0.804 0.355 0.228 0.232
 0.355 0.722 0.233 0.199 
and Spooled = 
 0.228 0.233 0.592 0.239 

0.232 0.199 0.239 0.479

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 206 / 261


Comparison of Several Multivariate Means Profile Analysis

Exercise (cont.)

Perform a profile analysis. Does time-of-use pricing seem to make a


difference in electrical consumption? What is the nature of this difference, if
any? Comment. [use α = 0.05 for any statistically tests]

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 207 / 261


Principal Component Analysis

PCA

 Consider a data set with p response variables. The summary information


of this data will consist of the following pieces:
◦ p sample means
◦ p sample standard deviation;equivalently p sample variance (ie, the
diagonal of the covariance matrix)
◦ p(p−1)
2 sample covariance (ie the off-diagonal elements of the
covariance matrix);equivalently the correlation could be computed
from this covariance matrix.
 Principal Component are used for several tasks:
◦ Data Reduction: When the data can be reduced from p to a fewer
variables, the manipulation of the data becomes easier. This is true
when the data have to be used as input to other procedures such as
regression model.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 208 / 261


Principal Component Analysis

◦ Display of data: Of course, p > 3 variables cannot be plotted at


once. Usually pairs of variables are plotted against each other. At
best, fancy graphical systems allow rotating graphs of three
variables.
◦ Interpretation: It is often hoped for that the original p variables can
be replaced by, for example, smaller k new variables, which have an
intuitive interpretation when the interpretation is of greater scientific
interest than the original variables, then insight might gained.
◦ Inference: Above tasks refer to manipulating a sample of n
objects.Often,one wants to make statements about the population.
Then, inferential tools have to be invoked. Of course, data reduction
will always lead to information reduction or information loss.
 In summary, one wants to reduce p original variables to k new variables,
where k is a compromise between:
◦ minimum number of variables
◦ maximum amount of information retained
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 209 / 261
Principal Component Analysis

PCA

 Do not
◦ over estimate importance of the principal components analysis.
◦ over- interpret principal components analysis
 They can be useful:
◦ in exploratory phase of a data analysis (to learn about the structure
of the data)
◦ as input to other statistical procedures.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 210 / 261


Principal Component Analysis Population Principal Component

Population principal component


 Concepts can be developed for both the population level (ie based on the
theoretical distribution) or for the sample level (ie based on the simple
random sample to be studied).
 In virtually all cases, the two theories are dual to each other, although the
assumptions involved and the generality of the conclusion can be
different.
 Consider a random vector X = (X1 , . . . , Xp ) with covariance matrix Σ
and a correlation ρ.
 We want to apply a linear transformation of X in order to obtain
Y = (Y1 , ......Yp );
Y1 = `11 X1 + `12 X2 + · · · + `1p Xp
Y2 = `21 X1 + `22 X2 + · · · + `2p Xp
..
.
Yp = `p1 X1 + `p2 X2 + · · · + `pp Xp
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 211 / 261
Principal Component Analysis Population Principal Component

 In matrix notation,
   
Y1   X1
 Y2 `11 . . . `1p

.. ..

..   X2 
=
   
 .. . . .  .. 
 .   . 
`p1 . . . `pp
Yp Xp

 Two short hand notations:

Y = LX or Yj = `0j X(j = 1, 2, ......p)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 212 / 261


Principal Component Analysis Population Principal Component

Population PCA

 We want a few variables to explain the data.


 Two remarks:
◦ The correlation structure (p(p − 1)/2 covariance) are somewhat of a
nuisance. It will be nice if the new variables would be correlated.
Example, if p = 4, we will have a total of 10 numbers (4 variances
and 6 covariance). If the new variables are uncorrelated, then this
would reduce 10 numbers to 4 numbers.
◦ The information in the data is provided by a variance-covariance
structure. Suppose the covariances have been removed, then it is
provided by the variance structure only, or simply: the variability.
Should we be able to concentrate as much of the total variance as
possible on one new variable (Y1 ), then probably the remaining
variables (Y2 , . . . , Yp ) could be ignored. This would reduced the 4
remaining numbers to 1.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 213 / 261
Principal Component Analysis Population Principal Component

Population PCA

 Formalizing this discussion, we would require;


◦ the Yj are uncorrelated
◦ the Yj have maximal variance.
 The second requirement is a little bit strange. Indeed, suppose a variable
Y1 has been found with maximal variance. For example

Y1 = 3X1 + 0.5X2 − 2.6X3 − 3.2X4

 Let us then consider,

Z1 = 6X1 + 1.0X2 − 5.2X3 − 6.4X4

 The variable Z1 will have 4 times the variance of Y1 .


 Thus, it seems that the second condition is impossible; one can always
multiply the variable with a well chosen constant to increase its variance.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 214 / 261
Principal Component Analysis Population Principal Component

Population PCA
 This problem can be overcome by an additional requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
 The new requirement is formalized as follows:
p
X
2
||`j || = `0j `j = `2jk = 1
k=1
 These requirement can be translated in an iterative procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 215 / 261
Principal Component Analysis Heuristic Argument

Heuristic argument

 Plotting two variables that follows a bivariate normal distribution, shows


a cloud of ellipsoid form. The shape of the ellipse is found from the
covariance matrix;
◦ the principal axes are in the direction of the eigenvectors of Σ
◦ the lengths of the principal axes is proportional to the eigenvalues of
Σ.
 Now, for every bivariate sample, we can calculate the sample version of
Σ and thus these axes can be considered.
 The variability of the original variables is reflected in the "width" of the
cloud of points, found by projecting the cloud on the x-axis and on the
y-axis respectively.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 216 / 261


Principal Component Analysis Mathematical Argument

Mathematical argument

 It turns out that the heuristic reasoning of the previous section can be
generalized and proven mathematically.
 Theorem: Let Σ be the covariance matrix of X with ordered
eigenvalue-eigenvector pairs (λj , ej )(j = 1, 2, . . . , p)
◦ the principal components are Yj = e0j X = ej1 X1 + · · · + ejp Xp
◦ var(Yj ) = λj
◦ cov(Yj , Yk ) = e0j Σek = 0, j 6= k
 For distinct eigenvalues, the choice of the principal component is unique.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 217 / 261


Principal Component Analysis Mathematical Argument

Procedure

The procedure, deduced from this theorem is as follows:


(1) Determine the covariance matrix of your population.
(2) Calculate the eigenvectors ej , ordered such that they correspond to
decreasing eigenvalues (e1 corresponding to λ1 , the largest eigenvalue)
(3) Calculate the transformed variables:

Y1 = e11 X1 + e12 X2 + · · · + e1p Xp


Y2 = e21 X1 + e22 X2 + · · · + e2p Xp
..
.
Yp = ep1 X1 + ep2 X2 + · · · + epp Xp

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 218 / 261


Principal Component Analysis Mathematical Argument

Properties

The following properties ensure that the new variables are very desirable:
(1) The variance of the new variables is determined by the eigenvalues λj :
var(Yj ) = λj
(2) The covariance between pairs of distinct new variables is zero, and hence
also the correlation:
Corr(Yj , Yk ) = 0, j 6= k
(3) The total variability of the original variables is recorded by the new
variables. More precisely, the sum of the diagonal elements of the
original covariance matrix equals the sum of the eigenvalues.
(4) The total population variance is λ1 + · · · + λp . The jth principal
component "explains"
λj
λ1 + · · · + λp
of the variance.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 219 / 261
Principal Component Analysis Mathematical Argument

It is common practice to choose the first few components in order to get


approximately 80 − 90% of variation, such that an acceptable amount of
information is lost.
Example(calculating the population principal components): Suppose the
random variables X1 , X2 , and X3 have the covariance matrix
 
1 −2 0
Σ =  −2 5 0 
0 0 2

It may be verified that the eigenvalue-eigenvector pairs are

λ1 = 5.83 e01 = [0.383, −0.924, 0]


λ2 = 2.00 e02 = [0, 0, 1]
λ3 = 0.17 e03 = [0.924, 0.383, 0]

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 220 / 261


Principal Component Analysis Mathematical Argument

Therefore the principal components become

Y1 = e01 X = 0.383X1 − 0.924X2


Y2 = e02 X = X3
Y3 = e03 X = 0.924X1 + 0.383X2

The variable X3 is one of the principal components, because it is uncorrelated


with the other variables.

var(Y1 ) = var(0.383X1 − 0.924X2 )


= 0.3832 var(X1 ) + (−0.9424)2 var(X2 )
+ 2(0.383)(−0.9424)cov(X1 , X2 )
= 0.147(1) + 0.854(5) − 0.708(−2) = 5.83 = λ1
cov(Y1 , Y2 ) = cov(0.383X1 − 0.924X2 , X3 )
= 0.383cov(X1 , X3 ) − 0.9242cov(X2 , X3 )
= 0.383(0) − 0.9242(0) = 0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 221 / 261


Principal Component Analysis Mathematical Argument

 It is also readily apparent that

σ11 + σ22 + σ33 = 1 + 5 + 2 = λ1 + λ2 + λ3 = 5.83 + 2.0 + 0.17 = 8

 The proportion of the total variance accounted for by the first principal
component is
λ1 5.83
= = 0.73
λ1 + λ2 + λ3 8
 Further, the first two components Y1 and Y2 could replace the original
three variables with little loss of information.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 222 / 261


Principal Component Analysis Sample Principal Components

Sample PCs

 By replacing the population covariance matrix by the sample covariance


matrix, the results carry over.
 From the sample version, inferences can be drawn about the population
level. We may be lead to believe that the principal component is made up
of variables except those with considerably smaller coefficient.
 However, this is merely jumping to conclusions and one should be
guided by the correlations between principal components and each of its
constituents.
 A formula is given in the following property;
Property: Let X have covariance Σ, with Yj = e0j X the principal
components, then p
ejk λj
ρYj ,Xk = √
σkk
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 223 / 261
Principal Component Analysis Sample Principal Components

Sample PCs

 In other words, the correlation is a function of the coefficient, but also of


the original variance. while the eigenvalue is also involved, this is not
important, since it is constant for a given eigenvector.
 Applied to principal component 1 of the previous example
√ √
e11 λ1 0.383 5.83
ρY1 ,X1 = √ = √ = 0.925
σ11 1
√ √
e21 λ1 −0.924 5.83
ρY1 ,X2 = √ = √ = −0.998
σ22 5
 Notice here that the variable X2 , with coefficient −0.924 receives the
greater weight in the component Y1 . It also has the largest correlation (in
absolute value) with Y1 . The correlation of X1 with Y1 , 0.925 is almost as
large as that for X2 , indicating that the variables are about equally
important to the first principal component.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 224 / 261
Principal Component Analysis Sample Principal Components

Sample PCs

 The relative sizes of the coefficients of X1 and X2 suggest, however that


X2 contributes more to the determination of Y1 than does X1 .
√ √
λ2 √2
 Finally, ρY2 ,X1 = ρY2 ,X2 = 0 and ρY2 ,X3 = σ33 = 2
= 1 (as it should).
 The remaining correlations can be neglected, since the third component
is unimportant.
Proof of Property: Define `k = (0, . . . , 0, 1, 0, . . . , 0)0 , then Xk = `0k X, and
Yj = e0j X

cov(Xk , Yj ) = `0k cov(X, X)ej = `0k Σej


= λj `0k ej = λj ejk since Σej = λj ej

λe λj
In conclusion, corr(Xk , Yj ) = √ j √jk = ejk √σkk where ejk is also a measure
λj σkk
of correlation between the jth principal component and the kth original
variable.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 225 / 261
Principal Component Analysis Sample Principal Components

Exercise

(1) Determine the population principal components for the covariance


matrixes  
5 2
(i) Σ =
2 2
 
2 0 0
(ii) Σ =  0 4 0 
0 0 4
(2) Also, calculate the proportion of the total population variance explained
by the principal components.
(3) Compute the correlation between principal components and the original
variables.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 226 / 261


Principal Component Analysis Sample Principal Components

PCA based on correlation

 Covariance or Correlation?
◦ The original variables could have different units. Then we are
comparing apples and oranges.
◦ Original variables could have widely varying standard deviation.
 In both cases, the analysis is driven by the variable with large standard
deviation.
 kilometer or millimeter (×106 ) ⇒ principal components will be pulled
towards the variable in millimeters.
 A solution is provided by using the correlation matrix instead. Thus
means that all variables are replaced by their standardized versions.
 The variance changes to 1 and total variability is equal to p.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 227 / 261


Principal Component Analysis Sample Principal Components

PCA based on correlation

 The population explained by the first k principal components is


λ1 + λ2 + · · · + λk
p
 The correlation between new and original variable (standardized) is
p
ρYj ,Xk = ekj λj
Exercise 1: Consider the covariance matrix
 
1 4
Σ=
4 100
(i) Derive the correlation matrix ρ.
(ii) Perform PCA on ρ, what proportion of total variability is captured by
each principal components?
(iii) Find the correlation between the principal component and the original
variables.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 228 / 261
Principal Component Analysis Sample Principal Components

Exercise 2: Find the principal components and the proportion of the total
variance explained by each when the covariance matrix is
 2
ρσ 2 ρσ 2

σ
Σ =  ρσ 2 σ 2 ρσ 2  .
ρσ 2 ρσ 2 σ 2

Repeat using the correlation matrix, ρ derived from Σ.


Other Analysis
 Marginal PCA and PCA by groups
 "Partialled out", Fit regression model for each of the outcome variables

⇒ Yij = αj + βj Xi + ij

The residuals is computed: ij = Yij − αj + βj Xi


 The outcome vector is replaced with corresponding residual vector. The
correlation matrix of the residual vector is called the partial correlation
matrix. In other words the effect of the group variable is "taken out"
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 229 / 261
Discrimination and classification Introduction

Discrimination and classification

 In many cases in multivariate data, subgroup (stratified) structure of the


data will be the focus of scientific research interest.
 Two distinct situation can arise:
◦ Known Groups: Groups have been defined explicitly. This is often
based on subject matter (eg. biological) knowledge. Through
groups may be widely accepted by the scientific community, it does
not imply that it is easy to discriminate between groups.
◦ Unknown Groups: The researcher has good grounds to believe in
the existence of situation of strata, even though they have not
clearly been defined. There might even be uncertainty about the
actual number of groups. This second situation is the topic of
cluster analysis.
 The attention here is confined to discriminant analysis.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 230 / 261


Discrimination and classification Goal of Discriminate Analysis

Goal of Discriminate Analysis

 Discriminant analysis can serve two slightly different purposes.


◦ Discrimination: Suppose in the data have been clearly defined.The
question is then whether it is possible to distinguish among groups
in an "optimal" way. Indeed, when several characteristics are
recorded, it might turn out that summary measures, such as mean
and standard deviation, are distinct among the groups. This can be
viewed as an exploratory, descriptive technique.
◦ Classification: Given a sample for which group membership is
known, and given a new observation, the question is to allocate the
new specimen to a particular population. Clearly, this is not a
descriptive technique any more. A (mathematical) rule is required,
an automated decision process, which indicated unambiguously to
which groups a new specimen should be allocated, given the values
on the characteristics.
 In both cases, one often relies on an algebraic rule, a discriminant
function.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 231 / 261
Discrimination and classification Scope of Discriminant Analysis

Fully Parametric or Not

 Discriminant analysis can be approached by two philosophically


different roads:
◦ The parametric way: This method is based on assuming a
parametric distributional form for the outcomes in each of the
subgroups. Differences between these distributions (in terms of
their parameter), are used to discriminate between them.The best
known examples include normal and logistics discriminant analysis.
We will focus on normal discriminant analysis.
◦ Fisher’s way: This method is concerned with finding linear
combination of the original variables, that displays the group
differences best. [Fisher’s linear Discriminant Analysis]

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 232 / 261


Discrimination and classification Parametric Version: Two Populations

Two Populations

 For the parametric theory, we will restrict attention to two groups.


Indicate the two classes of objects by π1 and π2 .
 Recall that we want to:
◦ Distinguish between them
◦ Allocate objects to them
 For each object or individual i, a set p measures
Xi = (Xi1 , Xi2 , ........Xip )0 is obtained.
 We hope that the measurements are "different" between the two groups
(eg. the mean of some of them is higher or lower in one group than in the
other).
 A difference between populations is translated into statistical language
by the claim that they are generated by a different stochastic mechanism,
which in turn is characterized by a different distribution.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 233 / 261


Discrimination and classification Parametric Version: Two Populations

Two Populations

 An observation in population j = 1, 2 follows distribution Fj (x) with


density fj (x)
 Divide the variable space Ω into 2 parts R1 and R2 (regions) and adopt
the rule: A new observations is assumed to belong to πj if it falls in Rj
 Of course, the regions should be a partition:
◦ The union of R1 and R2 fill the whole parameter space Ω
◦ The intersection of R1 and R2 is empty (has probability zero).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 234 / 261


Discrimination and classification Parametric Version: Two Populations

Classification Error

 Classification (and many other things....) is bound to ERROR.

 In statistical terms, we have to study the misclassification error:


◦ Xi belongs to π1 and is classified into π2
◦ Xi belongs to π2 and is classified into π1

 That is, based on the classification rule and the observations made for a
particular specimen, we are lead to believe that the specimen belongs to
one subgroup, whereas in reality it belongs to the other.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 235 / 261


Discrimination and classification Parametric Version: Two Populations

Properties of a Good Classification Rule

 The following properties, discussed before, should be sought for a good


classification rule;
(1) The misclassification probabilities are minimal.
(2) The prior probabilities are taken into account:
◦ If one population is much larger than another, classification in
the largest should be more frequent.
◦ eg. there are more sound than bankrupt firms, whence
classification of a firm as bankruptcy candidate should occur
only if the evidence is overwhelming.
(3) Cost of misclassification error (ethnical cost, economic cost)
◦ eg. classifying a healthy person as diseased implies further
investigation and eventually the healthy condition of the person
will be established, while the opposite misclassification is
dramatic.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 236 / 261
Discrimination and classification Parametric Version: Two Populations

Formalizing the Classification Error

 Recall the setting:


◦ Two population π1 and π2 with associated densities f1 (x) and f2 (x)
◦ Let Ω be the sample space and suppose it is partitioned as
Ω = R1 ∪ R2 with Rj the set of x which we could classify as
belonging to πj (which could be wrong).
 Note that in general, there is no optimal classification possible but no
perfect classification.
 The classification errors are:
◦ The probability of lying in the second region and belong to the first
population: R
P(2|1) = P(X ∈ R2 |π1 ) = R2 f1 (x)dx
◦ The opposite classification error
R is:
P(1|2) = P(X ∈ R1 |π2 ) = R1 f2 (x)dx

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 237 / 261


Discrimination and classification Parametric Version: Two Populations

Formalizing the Classification Error

 Now, consider the prior probabilities


◦ p1 = prior probability of belonging to π1
◦ p2 = prior probability of belonging to π2
 Evidently, p1 + p2 = 1.
 There are 4 "classification probabilities". A specimen can belong to π1 or
π2 and can be classified as population π1 and π2 leading to a 2 × 2
factorial:
Classify as
True Population π1 π2
π1 P(1|1) P(2|1)
π2 P(1|2) P(2|2)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 238 / 261


Discrimination and classification Parametric Version: Two Populations

Formalizing the Classification Error

 The correct classification probability for population π1 :


P(correctly classified as π1 ) = P(X ∈ π1 , X ∈ R1 )
= P(X ∈ π1 )P(X ∈ R1 |X ∈ π1 )
= p1 P(1|1)
 and misclassification error is:
P(misclassified as π1 ) = P(X ∈ π2 , X ∈ R1 )
= P(X ∈ π2 )P(X ∈ R1 |X ∈ π2 )
= p2 P(1|2)
 In summary
P(correctly classified as π1 ) = p1 P(1|1)
P(misclassified as π1 ) = p2 P(1|2)
P(correctly classified as π2 ) = p2 P(2|2)
P(misclassified as π2 ) = p1 P(2|1)
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 239 / 261
Discrimination and classification Parametric Version: Two Populations

Expected Cost of Misclassification

 These probabilities are the first step to answer the following questions:
(1) What is the misclassification error?
(2) What is the misclassification cost?
 The cost matrix is very simple in the case of the two groups:
Classify as
True Population R1 R2
π1 0 c(2|1)
π2 c(1|2) 0

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 240 / 261


Discrimination and classification Parametric Version: Two Populations

Expected Cost of Misclassification

 We are now able to compute the Expected Cost of Misclassification.


ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2
Z Z
= c(2|1)p1 f1 (x)dx + c(1|2)p2 f2 (x)dx
R2 R1
Z  Z 
= c(2|1)p1 f1 (x)dx + c(1|2)p2 1 − f2 (x)dx
R2 R2
Z
= c(1|2)p2 + {f1 (x)c(2|1)p1 − f2 (x)c(1|2)p2 d}x
R2
 Minimizing Expected Cost of Misclassification is done by choosing
those points that yield a negative contribution to the integral
R2 = {x|f1 (x)c(2|1)p1 − f2 (x)c(1|2)p2 < 0}
= {x|f1 (x)c(2|1)p1 < f2 (x)c(1|2)p2 }
 
f1 (x) c(1|2)p2
= x| <
f2 (x) c(2|1)p1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 241 / 261
Discrimination and classification Parametric Version: Two Populations

Structure of the ECM

 Similarly,  
f1 (x) c(1|2)p2
R1 = x| >
f2 (x) c(2|1)p1
and hence the regions are defined.
 What if ff21 ((xx)) = c(1|2)p2
c(2|1)p1 ?

 This boundary case is fairly arbitrary and the performance of the rule
will not change when we either assign this curve to R1 or to R2
 The classification rule is: Assign an observation with outcome vector x
to the first population π1 if

f1 (x) c(1|2)p2
>
f2 (x) c(2|1)p1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 242 / 261


Discrimination and classification Parametric Version: Two Populations

Structure of the ECM

 In other words, the ratio of the densities should exceed a threshold


function.
f1 (x) c(1|2)p2
 The boundary f2 (x) = c(2|1)p1 is called the discriminant function.

 Inspecting this ratio, it is clear that we only need:


◦ Prior Probability ratio p1 /p2 . Of course knowing the ratio is
equivalent to knowing p1 or to knowing p2 only. Indeed, the
quantities sum to one, whence they represent only independent
quantity.
◦ Cost Ratio c(1|2)/c(2|1). This is an important reduction of the
information that needs to be found. Even if the components are hard
to specify, the ratio can be much easier to establish. Indeed, one
might have might have difficulty in calculating even a rough
approximation of the actual cost involved in these misclassification.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 243 / 261


Discrimination and classification Parametric Version: Two Populations

Structure of the ECM

But it is plausible that one has a rough idea about the relative
severity of the misclassification eg. The second type of
misclassification is to times as bad as one of the first type.
 Remarks:
◦ The shape of the discriminant function depends on the form of f1 (x)
and f2 (x). It will change with changing parametric forms assumed
for these densities (eg normal densities with equal or with unequal
variances)
◦ If either the cost ratio or the prior probability ratio is unity, the
definition of the regions simplifies accordingly.
◦ If the product of cost and prior probability ratio is unity, then we
actually allocate to the population with the highest probability. We
then classify to R1 if ff12 ((xx)) > 1 or equivalently f1 (x) > f2 (x)

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 244 / 261


Discrimination and classification Parametric Version: Two Populations

Structure of the ECM

 The ECM is not the only useful criterion to determine the classification
boundary. A few alternatives are :
◦ Total probability of misclassification (TPM): The ECM for equal
costs.
◦ Largest posterior probability: Reduce to the TPM

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 245 / 261


Discrimination and classification Two Multivariate Normal Populations

Two Normal Populations

 Now that we have defined a classification criterion: Assign an


observation with outcome x to the first population π1 if
f1 (x) c(1|2)p2
>
f2 (x) c(2|1)p1
 We can focus on a few standard cases. Assume a multivariate normal
form for the two population:
π1 : Np (µ1 , Σ1 )
π2 : Np (µ2 , Σ2 )
where µ1 and µ2 are the mean vectors and Σ1 and Σ2 are the covariance
matrix.
 Here µ1 , µ2 , Σ1 , Σ2 are assumed to be known.
 We need to distinguish between the situation where the covariance
matrices are equal or unequal.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 246 / 261
Discrimination and classification Two Multivariate Normal Populations

Equal Covariance Matrix

 In this case, we assume Σ1 = Σ2 = Σ


 Explicitly, the densities are (i = 1, 2):
 
1 1 0 −1
fi (x) = p 1 exp − (x − µi ) Σ (x − µ i )
(2π) 2 |Σ| 2 2

 The classification rule is based on the ratio of the two densities evaluated
at x:
 
f1 (x) 1 0 −1 1 0 −1
= exp − (x − µ1 ) Σ (x − µ1 ) + − (x − µ2 ) Σ (x − µ2 )
f2 (x) 2 2
 After some manipulations, the classification region R1 is found to be:
 
0 −1 1 0 −1 c(1|2)p2
(µ1 − µ2 ) Σ x − (µ1 − µ2 ) Σ (µ1 + µ2 ) ≥ ln
2 c(2|1)p1

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 247 / 261


Discrimination and classification Two Multivariate Normal Populations

Sample Version
 In the above reasoning, µ1 , µ2 and Σ are assumed to be known
population values. However, in practice, they are unknown. This implies
they have to be estimated from data.
 The following algorithm can be used.
◦ Collect n1 observations out of π1 and n2 observations of π2 .
◦ Construct the sample statistics x̄1 , x̄2 , S1 and S2 , as estimators for
µ1 , µ2 , Σ1 and Σ2 respectively.
◦ Since we assume common Σ, it is necessary to construct a common
S. In other words, S1 and S2 are assumed to estimate the same
quantity, and therefore , they should be combined in a so called
pooled covariance matrix:
(n1 − 1)S1 + (n2 − 1)S2
Spooled =
(n1 + n2 − 2)
 Observe that, when the sample sizes n1 and n2 are equal. The Spooled is
simply the average of S1 and S2 , otherwise they are weighted by the
sample size they are based upon.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 248 / 261
Discrimination and classification Two Multivariate Normal Populations

Estimated Minimum ECM Rule for Two Normal Populations

 Allocate an observation with measurements x0 to π1 if


 
1 c(1|2)p2
(x̄1 − x̄2 )0 S−1
pooled x0
0 −1
− (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 ) ≥ ln
2 c(2|1)p1

 If the product of the two ratios is unity, then


 
c(1|2)p2
ln =0
c(2|1)p1

and the right hand side of the allocation rule vanishes, whence it can be
rewritten as
1
(x̄1 − x̄2 )0 S−1 0 −1
pooled x0 ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 249 / 261


Discrimination and classification Two Multivariate Normal Populations

Estimated Minimum ECM Rule for Two Normal Populations

 Define the linear combination vector

`0 = (x̄1 − x̄2 )0 S−1


pooled

 This linear combination occurs both on the left hand side, as well as on
the right hand side of the classification rule.
 The rule can be rewritten as:
1
`0 x0 ≥ (`0 x̄1 + `0 x̄2 ) = m
2
where ` is called the vector of discriminant coefficients

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 250 / 261


Discrimination and classification Two Multivariate Normal Populations

Example: (Classification with 2 normal populations-common Σ and equal


costs)
Consider the following information from a study concerned with the defection
of hemophilia A carriers. X1 = log10 (AHF activity), X2 = log10 (AHF-like
antigen) The first group of n1 = 30 (women without hemophilia gene), A
Second group of n2 = 22 (women from known hemophilia A carriers)
     
−0.0065 −0.2483 −1 131.158 −90.423
x̄1 = , x̄2 = and Spooled =
−0.0390 0.0262 −90.423 108.147

The equal cost and equal prior discriminant function is:

`0 x = (x̄1 − x̄2 )0 S−1


pooled x
  
  131.158 −90.423 x1
= 0.2418 −0.0652
−90.423 108.147 x2
= 37.61x1 − 28.92x2

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 251 / 261


Discrimination and classification Two Multivariate Normal Populations

Also,
 
−0.0065
0
x̄2 )0 S−1
 
` x̄1 = (x̄1 − = 37.61 −28.92
pooled x̄1 = 0.88
−0.00390
 
0 0 −1
  −0.2483
` x̄2 = (x̄1 − x̄2 ) Spooled x̄2 = 37.61 −28.92 = −10.10
−0.0262
1 0 1
m = (` x̄1 + `0 x̄2 ) = (0.88 − 10.10) = −4.61
2 2
Classify a woman with x1 = −0.210, and x2 = −0.044 (ie x0 = (x1 , x2 )), or
should this woman with x0 be classified normal or obligatory carrier?
Assuming equal costs and equal priors so that ln(1) = 0, we obtain:

Allocate x0 to π1 if `0 x0 ≥ m
Allocate x0 to π2 if `0 x0 < m

where x0 = (−0.210, −0.044).

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 252 / 261


Discrimination and classification Two Multivariate Normal Populations

Since
 
0
  −0.210
` x0 = 37.61 −28.92 = −6.62 < −4.61
−0.044

we classify the woman as π2 , an obligatory carrier.


Suppose now that the prior probabilities of group membership are known.
That is p1 = 0.75 and p2 = 0.25. Assuming that the costs of misclassification
are equal, so that c(1|2) = c(2|1). The classification statistics

1
w = (x1 −x2 )0 Σ−1 x0 − (x1 −x2 )0 Σ−1 (x1 +x2 ) = −6.62−(−4.61) = −2.01
2
 
is compared to ln pp12 = ln 0.75
0.25

= −1.10
 
We see that w = −2.01 < ln pp21 = ln 0.750.25

= −1.10 , we classify the
women as π2 , an obligatory carrier.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 253 / 261


Discrimination and classification Two Multivariate Normal Populations

Unequal Covariance Matrices

 We now allow that Σ1 6= Σ2 . Manipulating the ratio of the densities, R1


is defined as the set of vectors satisfying
 
−1 0 −1 −1 0 −1 0 c(1|2)p2
R: x (Σ1 − Σ2 )x + (µ1 Σ1 − µ2 Σ2 −1)x − k ≥ ln
2 c(2|1)p1

where k = 12 ln ||Σ
 
1| 0 −1 0 −1
Σ | + (µ1 Σ1 µ1 − µ2 Σ2 µ2 )
2

 If Σ1 = Σ2 then quadratic term vanishes and we obtain the linear


discriminant function as before.
 Plugging in the sample in the sample versions, we obtain the quadratic
classification rule.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 254 / 261


Discrimination and classification Two Multivariate Normal Populations

Quadratic Classification Rule for Two Normal Populations

 Allocate x0 to π1 if
 
1 c(1|2)p2
− x00 (S−1 −1 0 −1 −1
1 − S2 )x0 + (x̄1 S1 − x̄2 S2 )x0 − k ≥ ln
2 c(2|1)p1
 Guidelines
◦ If the population are approximately normal and the variances are
unequal;use the quadratic classification rule.
◦ BUT: The quadratics rule is sensitive to departures from normality,
while the linear rule is much more generally valid, also outside the
normal framework, as we will learn later from Fisher’s discriminant
analysis.
◦ Carry out checks before performing a classification procedure:
† Transform to normality first
† Then check for homogeneity of the covariance matrix
The order is important since these homogeneity checks are sensitive
to nonnormality.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 255 / 261
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Fisher’s Discriminant Function

 As an alternative to the parametric approach developed so far, the


philosophy of Fisher can be adopted too. The key difference are:
◦ The emphasis is on displaying group differences, rather than on
classification;
◦ The concept of normality is replaced by the concept of linearity.
◦ The covariance matrices of the groups must be the same because a
pooled estimate is used.
 As usual, denote the original outcome vector by x, leading to
observations
x11 , . . . , x1n1 ,
x21 , . . . , x2n2
 Construct a (scaler) linear combination y = `0 x. We get the observation
y11 , . . . , y1n1 ,
y21 , . . . , y2n2
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 256 / 261
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Fisher’s Discriminant Function

 The separation of the groups could be measured by |ȳ1 − ȳ2 |. However, it


makes more sense to perform the measurement in standard deviation
units.
 In this way, between group variability is measured relative to within
group variability:
|ȳ1 − ȳ2 |
sy
with Pn1
− ȳ1 )2 + ni=1 (y2i − ȳ2 )2
P2
i=1 (y1i
s2y =
n1 + n2 − 2
 The squared distance can be expressed as

|ȳ1 − ȳ2 |2 (|`0 (x̄1 − x̄2 )|)2


d2 = =
s2y `0 Spooled `

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 257 / 261


Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Allocation Rule based on Fisher’s Discriminant Function

 It can be shown that d2 is maximized for

` ∝ S−1
pooled (x̄1 − x̄2 )

and that the maximized value for d2 is

(x¯1 − x¯2 )S−1


pooled (x̄1 − x̄2 )

 Allocate x0 to π if
1
Y0 = (x̄1 − x̄2 )0 S−1 0 −1
pooled ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2
 Observe that
1 1 0 1
(x̄1 − x̄2 )0 S−1
pooled (x̄1 + x̄2 ) = ` (x̄1 + x̄2 ) = (ȳ1 + ȳ2 ) = m̂
2 2 2
the estimated midpoint of two univariate means.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 258 / 261
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Fisher’s Discriminant Function

 In other words, the separation is based on d2 , the distance between the


two population means µ1 and µ2 .

 The "possibility to separate" both populations can be assessed using the


test statistic
n1 + n2 − p − 1 n1 n2 2
d ∼ Fp,n1 +n2 −p+1
(n1 + n2 − 2) n1 + n2

 For high values (H0 rejected), we are able to discriminate between the
two populations.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 259 / 261


Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Exercise

Consider the two data sets


   
3 7 6 9
X1 =  2 4  and X2 =  5 7 
4 7 4 8

(i) Construct Fisher’s (sample) linear discriminant function.


(ii) Assign the observation x0 = (0, 1) to either population π1 and π2 .
Assume equal costs and equal prior probabilities.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 260 / 261


Discrimination and classification Fisher’s Discriminant Function - Separation of Populations

Good Luck in Your Exams!!

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 261 / 261

Potrebbero piacerti anche