Linear Models Notes

Sociology 761 John Fox
Lecture Notes
Linear Models Using Matrices
Copyright 2014 by John Fox
Linear Models Using Matrices 1
1. Introduction
I The principal purpose of this lecture is to demonstrate how matrices can
be used to simplify the development of statistical models.
I A secondary purpose is to review, and extend, some material in linear
models.
I I will take up the following topics:
Expressing linear models for regression, dummy regression, and
analysis of variance in matrix form.
Deriving the least-squares coefficients using matrices.
Distribution of the least-squares coefficients.
The least-squares coefficients as maximum-likelihood estimators.
Statistical inference for linear models.
Sociology 761 c
2. Linear Models in Matrix Form

I The general linear model is
|l = 0 + 1{l1 + 2{l2 + + n {ln + %l
where
|l is the value of the response variable for the lth of q observations.
{l1> {l2> ===> {ln are the values of n regressors for observation l. In linear
regression analysis, {l1> {l2> = = = > {ln are the values of n quantitative
explanatory variables.
0> 1> = = = > n are n + 1 parameters to be estimated from the data,
including the constant or intercept term, 0.
%l is the random error variable for the lth observation.
Sociology 761 c
I The statistical assumptions of the linear model concern the behaviour of

the errors; the standard assumptions include:
Linearity: The average error is zero, H(%l) = 0; equivalently, H(|l) =
0 + 1{l1 + 2{l2 + + n {ln .
Constant error variance: The variance of the errors is the same for all
observations, Y (%l) = 2% ; equivalently, Y (|l) = 2% .
Normality: The errors are normally distributed, and so %l Q(0> 2% );
equivalently, |l Q( 0 + 1{l1 + + n {ln > 2% ).
Independence: The errors are independently sampled that is %l and
%m are independent for l 6= m ; equivalently, |l and |m are independent=
Either the {-values are fixed (with respect to repeated sampling) or, if
random, the {s are independent of the errors.
Sociology 761 c
I The linear model may be rewritten as 5 6

0
9 1 :
9 :
9 2 :
9 :
|l = [1> {l1> {l2> ===> {ln ] 9 :
9 : + %l
9 :
9 :
7 8
n
0
= xl + %l
(1n+1)(n+11)
There is one such equation for each observation, l = 1> = = = > q.
Sociology 761 c
Collecting 5
these6q equations
5 into a single
6 matrix equation:
5 6
|1 1 {11 {1n 5 6 %1
9 |2 : 9 1 {21 {2n : 0 9 %2 :
9 : 9 : 9 :
9 : 9 : 9 : 9 :
9 : = 9 : 9 .1 : + 9 :

9 : 9 : 7 . 8 9 :
9 : 9 : 9 :
7 8 7 8 n 7 8
|q 1 {q1 {qn %q
y = X + %
(q1) (qn+1)(n+11) (q1)
The X matrix in the linear model is called the model matrix (or the
design matrix).
Note the column of 1s for the constant.
Sociology 761 c
I Similarly, the assumptions of linearity, constant variance, normality, and

independence can be recast as
% Qq(0> 2% Iq)
where Qq(0> 2% Iq) denotes the multivariate-normal distribution with
mean vector 0,
and covariance matrix 5 6
2% 0 0
9 0 2% 0 :
2% Iq = 9
7 .. .. . . . .. 8
:
0 0 2%
equivalently,
y Qq(X> 2% Iq)
Sociology 761 c
2.1 Dummy Regression Models

I The matrix equation y = X + % suffices not just for linear regression
models, but with suitable specification of the regressors for linear
models generally.
I For example, consider the dummy-regression model
|l = + {l + gl + ({lgl) + %l
where
| is income in dollars,
{ is years of education,
and the dummy regressor g is coded 1 for men and 0 for women.
Sociology 761 c
I Recall that this model implies potentially different intercepts and slopes
that is, potentially different regression lines for the two groups:
for men,
|l = + {l + 1 + ({l1) + %l
= ( + ) + ( + ){l + %l
for women
|l = + {l + 0 + ({l0) + %l
= + {l + %l
and so is the difference in intercepts between men and women, and
is the difference in slopes.
Because men and women can have different slopes, this model
permits gender to interact with education in determining income.
Sociology 761 c
I Written as
5 a matrix6 equation,
5 the dummy-regression
6 5model becomes.
6
|1 1 {1 0 0 5 6 % 1
9 .. : 9 .. .. .. .. : 9 .. :
9 : 9 : 9 :
9 |q1 : 9 1 {q1 0 0 : 9 : 9 %q1 :
9 : 9 :9 : 9 :
9 |q +1 : = 9 1 {q +1 1 {q +1 : 7 8 + 9 %q +1 :
9. 1 : 9. . 1 . . 1 : 9. 1 :
7. 8 7. . . . 8 7. 8
|q 1 {q 1 {q %q
y = X + %
where, for clarity, the q1 observations for women precede the q q1
observations for men.
Sociology 761 c
I Reminder : When a categorical explanatory variable has more than two

(say, p) categories, it generates a set of p 1 dummy regressors
that is, one fewer dummy variable than the number of categories.
For example, a five-category regional classification might produce the
following four dummy regressors:
Region g1 g2 g3 g4
East 1 0 0 0
Quebec 0 1 0 0
Ontario 0 0 1 0
Prairies 0 0 0 1
BC 0 0 0 0
Here, BC is arbitrarily selected as the baseline category, to which
other categories will implicitly be compared.
Sociology 761 c
2.2 Analysis of Variance Models

I Analysis of variance or ANOVA models are linear models in which all of
the explanatory variables are factors that is, categorical variables.
I The simplest case is one-way ANOVA, where there is a single factor.
The one-way ANOVA model is usually written with double-subscript
notation as
|lm = + m + %lm
for levels m = 1> ===> p of the factor, and observations l = 1> = = = > qm of
level m .
Sociology 761 c
I The matrix
5 form of the
6 one-way
5 ANOVA model 6 is 5 6
|11 1 1 0 0 0 %11
group 1 9 . : .
9. . .. . . . 9 ..
9. : 9 . . : : 9
:
:
9 |q1>1 : 9 1 1 0 0 0 : 9 %q1>1 :
9 : 9 : 9 :
9 |12 : 9 1 0 1 0 0 :5 6 9 %12 :
group 2 9
9 ..
:
:
9. . .
9. . . .. .. : :
9.
9
:
:
9 : 9 : 9 1 : 9 . :
9 |q2>2 : 9 1 0 1 0 0 :9 : 9 :
9. : 9. . . . . : 9 2 : 9 %. q2>2 :
9. : = 9. . . . . : 9 :+9 . :
9 : 9 : 9 .. : 9 :
9 |1>p1 : 9 1 0 0 1 0 :9 : 9 %1>p1 :
group 9 : 9 : 9
.. .. : 7 p1 8 9 ..
:
9 .. : 9 .. .. .. :
p1 9 9 |q >p1 :
: 9
9 1 0 0 1 0 : p
: 9
9 %q >p1
:
:
9 p1 : 9 : 9 p1 :
9 |1p : 9 1 0 0 0 1 : 9 %1p :
group p 9
7 ..
:
8
9
7 .. .. ..
:
.. .. 8
9
7 ..
:
8
|qp>p 1 0 0 0 1 %qp>p
y = X + %
Sociology 761 c
I This formulation of the model is problematic because there is a

redundant column in the model matrix (which is therefore of deficient
rank p):
For example, the first column is the sum of the remaining columns.
This will create a problem when we try to fit the model by least
squares, but more fundamentally, it reflects a redundancy among the
parameters of the model.
I A common solution to the problem is to reduce the parameters by one.
There are many ways to do this, all providing equivalent fits to the data.
For example:
Eliminating the constant, , produces a so-called means model,
|lm = m + %lm
where m now represents the population mean of level m .
Eliminating one of the m produces a dummy-variable solution, with
the omitted coefficient corresponding to the baseline category (here
category p):
Sociology 761 c
5 6 5 6 5 6
|11 1 1 0 0 %11
9 .. : 9 .. .. .. .. : 9 .. :
9 : 9 : 9 :
9 |q1>1 : 9 1 1 0 0 : 9 %q1>1 :
9 : 9 : 9 :
9 |12 : 9 1 0 1 0 : 9 :
9. : 9. . . .. :5 6 9 %. 12 :
9. : 9. . . : 9. :
9 : 9 : 9 :
9 |q2>2 : 9 1 0 1 0 : 9 1 : 9 %q2>2 :
9. : 9 :9 : 9. :
9. : = 9 .. .. .. .. : 9 2 : + 9 . :
9 : 9 :9 . : 9 :
9 |1>p1 : 9 1 0 0 1 :7 . 8 9 %1>p1 :
9. : 9. : 9. :
9. : 9 . . .. .. : p1 9. :
9 : 9 : 9 :
9 |q >p1 : 9 1 0 0 1 : 9 %q >p1 :
9 p1 : 9 : 9 p1 :
9 |1p : 9 1 0 0 0 : 9 %1p :
9 : 9 : 9 :
7 .. 8 7 .. .. .. .. 8 7 .. 8
|qp>p 1 0 0 0 %qp>p
y = X + %
Sociology 761 c
I Alternatively, we can place a linear constraint on the parameters, most

commonly, the sigma constraint
Xp
m = 0
m=1
Under this constraint
p1
X
p = m
m=1
need not appear explicitly, producing the model matrix
Sociology 761 c
5 6
() (1) (2) (p1)
9 1 1 0 0 :
group 1 9 .. .. .. .. :
9 :
9 :
9 1 1 0 0 :
9 :
9 1 0 1 0 :
group 2 9 .. .. .. .. :
9 :
9 :
9 1 0 1 0 :
X = 9 .. .. .. .. :
9 :
(qp) 9 :
9 1 0 0 1 :
group p 1 9
9 .. .. .. .. :
:
9 :
9 1 0 0 1 :
9 :
9 1 1 1 1 :
group p 9 :
7 .. .. .. .. 8
1 1 1 1
Sociology 761 c
3. Least-Squares Fit
I The fitted linear model is
y = Xb + e
where
b = [e0> e1> ===> en ]0 is the vector of fitted coefficients.
e = [h1> h2> ===> hq]0 = y Xb is the vector of residuals.
I We want the coefficient vector b that minimizes the residual sum of
squares, expressed asXa function of b:
V(b) = h2l = e0e = (y Xb)0(y Xb)
= y0y y0Xb b0X0y + b0X0Xb
= y0y (2y0X)b + b0(X0X)b
The last line of the equation is justified because y0 X b and
(1q)(qn+1)(n+11)
b0 X0 y are both scalars, and consequently equal.
(1n+1)(n+1q)(q1)
Sociology 761 c
I Noting that y0y is a constant (with respect to b), (2y0X)b is a linear

function of b, and b0(X0X)b is a quadratic form in b,
CV(b)
= 0 2X0y + 2X0Xb
Cb
Setting the derivative to 0 produces the normal equations for the linear
model
2X0y + 2X0Xb = 0
X0Xb = X0y
a system of n + 1 linear equations in n + 1 unknowns (i.e., e0> e1> ===> en ).
We can solve the normal equations uniquely for b if as the (n + 1)
(n + 1) matrix X0X is nonsingular, which will be the case as long as
there are at least as many observations as coefficients that is,
q n + 1.
no column of the model matrix X is a perfect linear function of the
other columns.
Sociology 761 c
When X0X is nonsingular, the least-squares solution is

b = (X0X)1X0y
Looking inside of the matrices in the normal equations,
the matrix X0X contains sums of squares and cross-products for the
regressors (including the column of 1s).
X0y contains sums of products between the regressors and the
response.
The normal equations,
P therefore, are P P
e0qP +e1 P {l1 + + en P {ln = P |l
e0 {l1 +e1 {2l1 + + en {l1{ln = {l1|l
.. ..
P P P P
e0 {ln +e1 {ln {l1 + + en {2ln = {ln |l
I An example, using Duncans regression of occupational prestige on the
income and education levels of 45 U.S. occupations:
Sociology 761 c
Matrices of sums of squares

5 and products: 6
45 1884 2365
X0X = 7 1884 105> 148 122> 197 8
2365 122> 197 163> 265
5 6
2146
X0y = 7 118> 229 8
147> 936
The inverse of5X0X: 6
0=1021058996 0=0008495732 0=0008432006
(X0X)1 = 7 0=0008495732 0=0000801220 0=0000476613 8
0=0008432006 0=0000476613 0=0000540118
The regression coefficients: 5 6
6=06466
b = (X0X)1X0y = 7 0=59873 8
0=54583
Sociology 761 c
4. Distribution of the Least-Squares

Coefficients
I It is simple to show that least-squares coefficients are unbiased
estimators of the population regression coefficients:
b = (X0X)1X0y
and so (assuming a fixed model matrix X),
H(b) = (X0X)1X0H(y) = (X0X)1X0(X) =
I The covariance matrix of b follows from the covariance matrix of y,
which is 2% Iq:
h i h i0
Y (b) = (X0X)1X0 Y (y) (X0X)1X0
h i h i0
0 1 0 2 0 1 0
= (X X) X % Iq (X X) X
= 2% (X0X)1X0X(X0X)1
= 2% (X0X)1
Sociology 761 c
Because the error variance 2%is an unknown parameter, the covari-

ance matrix of b must be estimated:
Yb (b) = v2h (X0X)1
where P 2
2 hl
vh =
qn1
is the estimated error variance, and hl is the residual for observation l.
I Because the response vector y is multinormally distributed, so is b; that
is h i
2 0 1
b Qn+1 > % (X X)
Sociology 761 c
I Notice the strong analogy between the formulas for the slope coefficient
in least-squares simple regression (i.e., with a single {) and for the
coefficients of the linear model in matrix form:
Simple Regression Linear Model
Model |l = + {l + %l y = X + %

| =P { +%
{|
Least-Squares Estimator e = P 2 b = (X0X)1X0y
P {21 P
= { {|
2

Sampling Variance Y (e) = P % 2 Y (b) = 2% (X0X)1
2
P {21
= % {
Distribution e h i b h i
2
P 2
1 2 0 1
Q > % { Qn+1 > % (X X)
Sociology 761 c
In the scalar formulas the following short-hand notation is used:

{ = {l {
| = |l |
Sociology 761 c
5. Maximum-Likelihood Estimation of the

Normal Linear Model
I The standard assumptions of the linear model provide a probability
model for the data y (thinking of the model matrix X as fixed or
conditioning on it):
y Qq(X> 2% Iq)
Then, from the formula for the normal
distribution,
1 (y X)0(y X)
s(y) = exp
(2 2% )q@2 2 2%
Note: exp(d) in a formula means hd, for the constant h ' 2=718.
I In maximum-likelihood estimation, recall, we find the values of the
parameters that make the probability of observing the data as high as
possible.
Sociology 761 c
The likelihood function is the same as the probability (or probability-

density) of the data, except thought of as a function of the parameters.
Here,

2 q@2 (y X)0(y X)
O(> 2% )
= 2 % exp
2 2%
I As is usually the case, it is simpler to work with the log of the likelihood.
Whatever values of the parameters maximize the log-likelihood also
maximize the likelihood, since the log function is monotone (strictly
increasing).
For the linear model:
q q 1
logh O(> 2% ) = logh 2 logh 2% 2 (y X)0(y X)
2 2 2 %
To justify this result, recall that taking logs turns multiplication
into addition, division into subtraction, and exponentiation into
multiplication; moreover, logh hd = d.
Sociology 761 c
I To maximize the log-likelihood, we need its derivatives with respect to

the parameters.
Finding the derivatives is simplified by noticing that (y X)0(y X)
is just the sum of squared errors.
Differentiating,
C logh O(> 2% ) 1
= 2 (2X0X 2X0y)
C 2 %
2

C logh O(> % ) q 1 1
2
= 2
+ 4 (y X)0(y X)
C % 2 % %
Setting the partial derivatives to 0 and solving for maximum-likelihood
estimates of the parameters produces
b = (X0X)1X0y
b 0(y X)
(y X) b e0e
b2% =
=
q q
b
where e = yX is the vector of residuals.
Sociology 761 c
I Notice that
The MLE b is just the least-squares coefficients b.
P
The MLE of the error variance, b2% = h2l @q is biased.
The usual unbiased estimator, v2h , divides by residual degrees of
freedom q n 1 rather than by q.
The MLE is consistent, however, since the bias (along with the
variance of the estimator) goes to zero as q get larger.
Sociology 761 c
6. Statistical Inference for Least-Squares

Estimation
I Statistical inference for based on the least-squares coefficients b uses
the estimated covariance matrix Yb (b) = v2h (X0X)1.
I The simplest case is inference for an individual coefficient, em :
The standard error of the coefficient is the square root of the m th
diagonal entry of the estimated covariance matrix (indexing the matrix
from 0): q
SE(em ) = v2h [(X0X)1]mm
Because the error variance has been estimated, hypothesis tests and
confidence intervals use the w-distribution with q n 1 degress of
freedom.
Sociology 761 c
For example:
To test
K0: m = 0
we compute
em
w0 =
SE(em )
To form a 95-percent confidence interval for m we take
m = em w=975>qn1SE(em )
where w=975>qn1 is the .975 quantile of the w-distribution with q n 1
degrees of freedom.
I More generally, suppose that we want to test the linear hypothesis
K0: L = c
(tn+1)(n+11) (t1)
where the hypothesis matrix L and the right-hand-side vector c (usually
0) encode the hypothesis.
Sociology 761 c
For example, in Duncans regression of prestige on income and

education, the hypothesis matrix

0 1 0
L=
0 0 1
and right-hand-side vector
0
c=
0
specify the hypothesis
K0: 1 = 0> 2 = 0
Sociology 761 c
Likewise, again for Duncans regression, the one-row hypothesis

matrix
L = 0 1 1
and right-hand-side c = [0] correspond to the hypothesis
K0: 1 2 = 0
that is
K0: 1 = 2
Under the hypothesis K0, the statistic
h i1
(Lb c)0 L(X0X)1L0 (Lb c)
I0 =
tv2h
follows an I -distribution with t and q n 1 degrees of freedom.
Sociology 761 c
I Example: For Duncans regression, the sum of squared residuals is

e0e = 7506=699, and so
7506=699
v2h = = 178=7309
45 2 1
The estimated covariance matrix of the least-squares coefficients is

Yb (b) = v2h (X0X)15 6
0=1021058996 0=0008495732 0=0008432006
= 178=7309 7 0=0008495732 0=0000801220 0=0000476613 8
0=0008432006 0=0000476613 0=0000540118
5 6
18=249387 0=151844 0=150705
= 7 0=151844 0=014320 0=008519 8
0=150705 0=008519 0=009653
Sociology 761 c
The estimated standard errors of the regression coefficients are,

therefore, s
SE(e0) = s18=249387 = 4=272
SE(e1) = s0=014320 = 0=1197
SE(e2) = 0=009653 = 0=09825
and, a 95-percent confidence interval for 1 (the income coefficient) is
1 = 0=5987 2=0181 0=1197
= 0=5987 0=2416
Sociology 761 c
To test the hypothesis that both slope coefficients are 0,

K0: 1 = 2 = 0
we have
0 1 0
L =
0 0 1
5 6
6=06466
0 1 0 7 8 0=59873
Lb = 0=59873 = (i.e., the two slopes)
0 0 1 0=54583
0=54583
Sociology 761 c
h i1
0 0 1 0
(Lb) L(X X) L Lb
I0 =
tv2h
3 5 65 641
0=1021 0=0008 0=0008 0 0
0 1 0 7
[0=599> 0=546] C 0=0008 0=0001 0=0000 8 7 1 0 8D
0 0 1
0=0008
0=0000
0=0001 0 1
0=599

0=546
=
2 178=7309
= 101=22 with 2 and 42 degrees of freedom, s ' 0
Sociology 761 c
To test
the hypothesis
that the slopes are equal:
L = 0 1 1
5 6
6=06466
Lb = 0 1 1 7 0=59873 8 = 0=05290 (i.e., the difference in slopes)
0=54583
h i1
0 0 1 0
(Lb) L(X X) L Lb
I0 =
tv2h
3 5 65 641
0=1021 0=0008 0=0008 0
0=053 C 0 1 1 7 0=0008 0=0001 0=0000 8 7 1 8D 0=053
0=0008 0=0000 0=0001 1
=
1 178=7309
= 0=068 with 1 and 42 degrees of freedom, s = =80
Sociology 761 c

Linear Models Notes

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Linear Models Notes

Caricato da

Copyright:

Formati disponibili

Sociology 761 John Fox

Linear Models Using Matrices

Copyright 2014 by John Fox

Linear Models Using Matrices 1

2. Linear Models in Matrix Form

Linear Models Using Matrices 3

I The statistical assumptions of the linear model concern the behaviour of

I The linear model may be rewritten as 5 6

There is one such equation for each observation, l = 1> = = = > q.

Linear Models Using Matrices 5

I Similarly, the assumptions of linearity, constant variance, normality, and

Linear Models Using Matrices 7

2.1 Dummy Regression Models

Linear Models Using Matrices 9

I Reminder : When a categorical explanatory variable has more than two

Linear Models Using Matrices 11

2.2 Analysis of Variance Models

Linear Models Using Matrices 13

I This formulation of the model is problematic because there is a

Linear Models Using Matrices 15

I Alternatively, we can place a linear constraint on the parameters, most

Linear Models Using Matrices 17

I Noting that y0y is a constant (with respect to b), (2y0X)b is a linear

Linear Models Using Matrices 19

When X0X is nonsingular, the least-squares solution is

Matrices of sums of squares

Linear Models Using Matrices 21

4. Distribution of the Least-Squares

Because the error variance  2%is an unknown parameter, the covari-

Linear Models Using Matrices 23

In the scalar formulas the following short-hand notation is used:

Linear Models Using Matrices 25

5. Maximum-Likelihood Estimation of the

The likelihood function is the same as the probability (or probability-

Linear Models Using Matrices 27

I To maximize the log-likelihood, we need its derivatives with respect to

Linear Models Using Matrices 29

6. Statistical Inference for Least-Squares

Linear Models Using Matrices 31

For example, in Duncans regression of prestige on income and

Likewise, again for Duncans regression, the one-row hypothesis

Linear Models Using Matrices 33

I Example: For Duncans regression, the sum of squared residuals is

The estimated covariance matrix of the least-squares coefficients is

The estimated standard errors of the regression coefficients are,

Linear Models Using Matrices 35

To test the hypothesis that both slope coefficients are 0,

Linear Models Using Matrices 37

Potrebbero piacerti anche

Because the error variance 2%is an unknown parameter, the covari-