Sei sulla pagina 1di 140

Multiple Linear

Regression
AMS 572 Group #2

Outline

Jinmiao FuIntroduction and History


Ning MaEstablish and Fitting of the model
Ruoyu ZhouMultiple Regression Model in Matr
ix Notation
Dawei Xu and Yuan ShangStatistical Inference
for Multiple Regression
Yu MuRegression Diagnostics
Chen Wang and Tianyu LuTopics in Regression
Modeling
Tian FengVariable Selection Methods
Hua MoChapter Summary and modern applica
tion

Introduction
Multiple linear regression attempts to mo
del the relationship between two or more
explanatory variables and a response vari
able by fitting a linear equation to observ
ed data. Every value of the independent v
ariable x is associated with a value of the
dependent variable

Example: The relationship between an


adults health and his/her daily eating
amount of wheat, vegetable and meat.

Histor
y
7

Karl Pearson (18571936)


Lawyer, Germanist,
eugenicist, mathematician
and statistician
Correlation coefficient
Method of moments
Pearson's system of
continuous curves.
Chi distance, P-value
Statistical hypothesis testing theory,
statistical decision theory.
Pearson's chi-square test, Principal
component analysis.

Sir Francis Galton FRS (16


February 1822 17 January
1911)
Anthropology and
polymathy
In the late 1860s, Galton
Doctoral
students
Karl
conceived the standard
Pearson
deviation. He created

the statistical concept


of correlation and also
discovered the properties
of the bivariate normal
distribution and its
relationship to regression

Galton invented the use of the


regression line (Bulmer 2003, p.184),
and was the first to describe and explain
the common phenomenon of regression
toward the mean, which he first
observed in his experiments on the size
of the seeds of successive generations of
sweet peas.

10

The publication by his


cousin Charles Darwin of
The Origin of Species in
1859 was an event that
changed Galton's life. He
came to be gripped by
the work, especially the
first chapter on "Variation
under Domestication"
concerning the breeding
of domestic animals.
11

Adrien-Marie Legendre
(18 September 1752 10
January 1833) was a French
mathematician. He made
important contributions to
statistics, number theory,
abstract algebra and
mathematical
He developed analysis.
the least
squares method, which
has broad application
in linear regression,
signal processing,
statistics, and curve
12
fitting.

Johann Carl
Friedrich Gauss
(30 April 1777 23
February 1855) was
a German
mathematician and
scientist who
contributed
significantly to
many fields,
including number
theory, statistics,
analysis, differential
geometry, geodesy,13

Gauss, who was 23 at the time, heard


about the problem and tackled it. After
three months of intense work, he
predicted a position for Ceres in
December 1801just about a year after
its first sightingand this turned out to
be accurate within a half-degree. In the
process, he so streamlined the
cumbersome mathematics of 18th
century orbital prediction that his work
published a few years later as Theory of
14
Celestial Movementremains a

It introduced the Gaussian gravitational


constant, and contained an influential
treatment of the method of least
squares, a procedure used in all
sciences to this day to minimize the
impact of measurement error. Gauss
was able to prove the method in 1809
under the assumption of normally
distributed errors (see GaussMarkov
theorem; see also Gaussian). The
method had been described earlier by
Adrien-Marie Legendre in 1805, but
15
Gauss claimed that he had been using

Sir Ronald Aylmer Fisher


FRS (17 February 1890
29 July 1962) was an
English statistician,
evolutionary biologist,
eugenicist and geneticist.
He was described by
Anders Hald as "a genius
who almost singlehandedly created the
foundations for modern
statistical science," and
16
Richard Dawkins described

In addition to "analysis
of variance", Fisher
invented the
technique of
maximum likelihood
and originated the
concepts of
sufficiency, ancillarity,
Fisher's linear
discriminator and
Fisher information.

17

Establish and Fitting


of the Model

18

Probabilistic Model
yi : the observed value of the random
Yi

variable(r.v.)
xi1 , xi 2 ,K , xik
depends on fixed predictor values

Yi

,i=1,2,3,,n

0 , 1 ,K , k

unknown model
parameters
n is the number of
observations.
i.i.d
~i N (0, 2 )
19

Fitting the model


LS provides estimates of the unknown model
parameters,
0 , 1 ,K , k which minimizes Q
n

Q [ yi ( 0 1 xi1 2 xi 2 ... kxik )]2


i 1

(j=1,2,,k)
20

Tire tread wear vs. mileage (exampl


e11.1 in textbook)

The table gives the m


easurements on the g
roove of one tire afte
r every 4000 miles.
Our Goal: to build a mo
del to find the relation b
etween the mileage and
groove depth of the tire.

Mileage (in
1000 miles)

Groove Depth
(in mils)

394.33

329.50

291.00

12

255.17

16

229.33

20

204.83

24

179.00

28

163.83

32

150.33
21

SAS code----fitting the model


Data example
Input mile depth @@
Sqmile=mile*mile
Datalines;
0 394.33 4 329.5 8 291 12 255.17 16 229.33 20
204.83 24 179 28 163.83 32 150.33
;
run;
Proc reg data=example;
Model Depth= mile
sqmile;
Run;
22

Depth=386.2612.77mile+0.172sqmile

23

Goodness of Fit of the Model


Residuals

y i

i
ei yi y

(i 1, 2,K , n)

are the fitted values

y i
xi 1 xi 1 kxik

(i 1, 2,..., n)

An overall measure of the goodness


of fit
n

min Q SSE ei2

Error sum of squares (SSE):

i 1

total sum of squares (SST):

SST ( yi y ) 2

regression sum of squares (SSR):

SSR SST SSE


24

ultiple Regression Mod


In Matrix Notation
25

1. Transform the Formulas to Matrix Notati


on

26


The first column of X
denotes the constant term
(We can treat this as with)

27

Finally let
where
the (k+1)1 vectors of unknown parameters
LS estimates

28

Formula

becomes
Simultaneously, the linear equation
are changed to
Solve this equation respect-1 to and we get
(if the inverse of the matrix exists.)
29

2. Example 11.2 (Tire Wear Data: Quadratic


Fit Using Hand Calculations)
We will do Example 11.1 again in this part
using the matrix approach.
For the quadratic model to be fitted

30

According to formula
-1

we need to calculate first and then invert


it and get

31

Finally, we calculate the vector of LS estim


ates

32


Therefore, the LS quadratic model is

This model is the same as we obtained in Exam


ple
11.1.

33

StatisticalInference
for
MultipleRegression

34

Statistical Inference for


Multiple Regression
Determine which predictor variables have stati
stically significant effects
We test the hypotheses:

H 0 j : j 0 vs. H 1 j : j 0
If we cant reject H0j, then xj is not a significan
t predictor of y.

35

Statistical Inference on ' s


Review statistical inference for
Simple Linear Regression
1 1

1 : N 1,
: N (0,1)

Sxx
/ S xx

( n 2) S 2

2
t:

SSE

N
W / ( n 2)

: n2 2

1 1
/ S xx

( n 2) S 2
: tn2
2
( n 2)
36

Statistical Inference on ' s


What about Multiple Regression?
The steps are similar

1 : N 1, V jj
2

[ n ( k 1)]S 2

2
N
t:
W / [n (k 1)]

1 1

: N (0,1)
V jj

SSE

: n2 ( k 1)

1 1
V jj

[n ( k 1)]S 2
: tn ( k 1)
2
[n (k 1)]
37

Statistical Inference on ' s


Whats Vjj? Why 1 : N 1, 2V jj ?
1. Mean
Recall from simple linear regression, the least
squares estimators for the regression parameter
1
s
and 0are unbiased.
)

E
(

0
0
Here, of least
squares estimators E ( )

is also unbiased.


E ( 1 ) 1
L
L

k

E ( k )

38

Statistical Inference on ' s


2.Variance
Constant Variance assumption:
V ( i ) 2

var(Y )
0 L

0 0
2

L
L
L
L

0
2
Ik
L

39

Statistical Inference on ' s


( X T X ) 1 X T Y B cY

var( ) var(cY )
c var(Y )c

( X X ) X I(
k
T

( X X ) )X

T T

2 ( X T X ) 1
Let Vjj be the jth diagonal of the matrix ( X T X ) 1
2

var( j ) V jj
40

Statistical Inference on ' s


Sum up, E ( j ) j , var( j ) 2V jj
2

and we get j : N ( j , V jj )

j j
V jj

: N (0,1)

41

Statistical Inference on ' s


Like simple linear regression, the unbiased estimator
of the unknown error variance 2 is given by
2

e
SSE
MSE
2
i
S

n (k 1) n (k 1) d . f .

(n (k 1)) S 2 SSE
2
W
2 ~ n ( k 1)
2

and that S 2 and are statistically independent


j

42

Statistical Inference on ' s


Therefore,
j j
: N (0,1),
V jj

j j
V jj
t

j j
S V jj

(n (k 1)) S 2
2
~

n ( k 1)
2

j j

[n (k 1)]S

: tn ( k 1)
2
[n (k 1)]
S V jj
2

j j
B
: tn ( k 1)
SE ( )
j

SE ( j ) s v jj
43

Statistical Inference on ' s


Derivation of confidence interval of j
P ( tn ( k 1), /2

j j

tn ( k 1), /2 ) 1

SE ( )
j

P (
j t n ( k 1), /2 SE ( j ) j j t n ( k 1), /2 SE ( j )) 1

The 100(1-)% confidence interval for j is

j tn ( k 1), /2 SE ( j )

44

Statistical Inference on ' s


An level test of hypotheses
H0j : j

0
j

vs. H 1 j : j

0
j

P (Reject H 0 j | H 0 j is true) P( t j c)
c tn ( k 1), / 2

Rejects H0j if
0

j j
tj
tn ( k 1), /2
SE ( j )
45

Prediction of Future Observation

Having fitted a multiple regression model, s


uppose we wish to predict the future value of
Y for a specified vector of predictor variables
x*=(x0*,x1*,,xk*)

One way is to estimate E(Y*) by a confiden


ce interval(CI).

46

Prediction of Future Observation


*
*
* T
* E (Y * )

(
x
0
1 1
k k
k)
Var[( x* )T ] ( x* )T Var ( ) x* ( x* )T 2 ( X T X ) 1 ( x* )T
k

B 2 ( xk* )T V ( xk* )T
Replacing 2 by its estimate s 2 MSE, which has
n K 1 d.f ., and using methods as in Simple Linear
Regression, a (1- )-level CI for * is given by

* tn ( k 1), / 2 s ( x* )T Vx* * * tn ( k 1), /2 s ( x* )T Vx*


47

F-Test for

j 's

Consider:

H 0 : 1 L k 0;
vs

H1 : At least one j 0.

HereH 0 is the overall null hypothesis, wh


ich
states that none
x of the variables
areyrelated to . The alternative one sho
ws at least one is related.
48

How to Build a F-Test


The test statistic F=M
SR/MSE follows F-dist
ribution with k and n(k+1) d.f. The -level
test rejectsH 0 if

MSR
F
f k ,n ( k 1),
MSE

recall that
MSE(error mean squar
n
2
e
e)

i 1 i
MSE
n (k 1)

with n-(k+1) degrees


of freedom.
49

The relation between F and r


F can be written as a function of r.
By using the formula:
2
2
SSR r SST ; SSE (1 r ) SST .
F can be as:
r 2 [n (k 1)]
F
k (1 r 2 )

We see that F is an increasing function of r


and test the significance of it.
50

Analysis of Variance (ANOVA)


The relation between SST, SSR and SSE:
SST SSR SSE

where they are respectively equals to:


n

i 1

i 1

i 1

SST ( yi y ) 2 ; SSR ( yi y ) 2 ; SSE ( yi yi ) 2

The corresponding degrees of freedom(d.f.)


is:

d . f .( SST ) n 1; d . f .( SSR ) k ; d . f .( SSE ) n (k 1).


51

ANOVA Table for Multiple Regression


Source of
Variation
(source)

Sum of
Squares

Degrees of
Freedom
(d.f.)

(SS)
Regression

SSR

Mean
Square

(MS)
k

Error

SSE

n-(k+1)

Total

SST

n-1

SSR
k
SSE
MSE
n (k 1)
MSR

This table gives us a clear view of


analysis of variance of Multiple

MSR
MSE

52

Extra Sum of Squares Method for Testing


Subsets of Parameters
Before, we consider the full model with k
parameters. Now we consider the partial
model:
Yi 0 1 xi1 L k m xi ,k m i

(i 1, 2,K , n)

while the rest m coefficients are set to ze


ro. And we could test these m coefficients
to check out the significance:
H 0 : k m 1 L k 0;
vs

H1 : At least one of k m 1 ,K , k 0.

53

Building F-test by Using Extra Sum of Sq


uares Method
SSEk m
Let SSRk m and
be the regression and
error
sums of squares for the partial model. Sinc
e SST
SST SSRk m SSEk m SSRk SSEk
Is fixed regardless of the particular model,
so:
SSRk m SSEk SSRk SSRk m

then, we have:

( SSEk m SSEk ) / m
F
f m ,n ( k 1),
SSEk / [n (k 1)]

54

Remarks on the F-test


The numerator d.f. is m which is the numbe
r of
coefficients set to zero. While the denomina
tor
d.f. is n-(k+1) which is the error d.f. for the f
ull
model.
H0
The MSE in the denominator is the normaliz
ing
55
factor, which is an estimate of for the full

Links between ANOVA and Extra Sum of


Squares Method
Let m=1 and m=k respectively, we have:
SSE0 i 1 ( yi y ) 2 SST , SSEk SSE
n

From above we can derive:

SSE0 SSEk SST SSE SSR

Hence, the F-ratio equals:


SSR / k
MSR
F

SSE / [n (k 1)] MSE

with k and n-(k+1) d.f.


56

Regression
Diagnostics

57

5 Regression Diagnostics
5.1 Checking the Model Assumptions

Plots of the residuals against


individual predictor variables:
check for linearity
A plot of the residuals against fitted
values: check for constant variance
A normal plot of the residuals:
check for normality
58

A run chart of the residuals: check


if the random errors are auto
correlated.
Plots of the residuals against any
omitted predictor variables: check
if any of the omitted predictor
variables should be included in the
model.
59

Example: Plots of the residuals against individ


ual predictor variables

60

SAS code

61

Example: plot of the residuals against fitted


values

62

SAS code

63

Example: normal plot of the residuals

64

SAS code

65

5.2 Checking for Outliers and Influential Observati


ons

Standardized residuals
*

e
SE (e )
i

s 1 hii

Largee*i values indicate outlier observation.


Hat matrix

H X X X

If the Hat matrix diagonalh 2 k 1


n
en
ith observation is influential.
ii

, th

66

Example: graphical exploration of

outliers

67

Example: leverage plot

68

5.3 Data transformation


Transformations of the variables(both y and the x
s) are often necessary to satisfy the assumptions o
f linearity, normality, and constant error variance.
Many seemingly nonlinear models can be written i
n the multiple linear regression model form after
making a suitable transformation. For example,

y * 0 x11 x2 2

after transformation:
log y log 0 1 log x1 2 log x2
or

y * 0* 1* x1* 2* x2*
69

Topics in
Regression
Modeling
70

Multicollinearity
Multicollinearity occurs when two or more predi
ctors in the model are correlated and provide re
dundant information about the response.
Example of multicollinear predictors are height
and weight of a person, years of education and i
ncome, and assessed value and square footage
of a home.
Consequences of high multicollinearity:
a. Increased standard error of estimates of the
s
b. Often confused and misled results.
71

Detecting Multicollinearity
Easy way: compute correlations between all pair
s of predictors. If some r are close to 1 or -1, re
move one of the two correlated predictors from
the
model.X1 X2 X3
Variable
2

X1

X2

21

X3

31
Correlations
Equal to 1

12

13

23

32

X1
colinear
X2
X2
independent
X3

72

Detecting Multicollinearity
Another way: calculate the variance inflation fac
tors for each predictor xj:

where
is the coefficient of determination of
the model that includes all predictors except th
e jth predictor.
If VIFj10, then there is a problem of multicolli
nearity.
73

Muticollinearity-Example
See Example11.5 on Page 416, Response is the heat
of cement on a per gram basis (y) and predictors ar
e tricalcium aluminate(x1), tricalcium silicate(x2), te
tracalcium alumino ferrite(x3) and dicalcium silicate
(x4).

74

Muticollinearity-Example
Estimated parameters in first order model:
y =62.4+1.55x1+0.510x2+0.102x3-0.144x4.
F = 111.48 with pvalue below 0.0001. Individual t
statistics and pvalues: 2.08 (0.071), 0.7 (0.501) a
nd 0.14 (0.896), -0.20 (0.844).
Note that sign on 4 is opposite of what is expected
. And very high F would suggest more than just one
significant predictor.
75

Muticollinearity-Example
Correlations

Correlations were r13 = -0.824, r24 =-0.973. Also th


e VIF were all greater than 10. So there is a multicol
linearity problem in such model and we need to ch
oose the optimal algorithm to help us select the var
iables necessary.
76

Muticollinearity-Subsets Selecti
on
Algorithms for Selecting Subsets
All possible subsets
Only feasible with small number of potential predictors (m
aybe 10 or less)
Then can use one or more of possible numerical criteria to
find overall best

Leaps and bounds method

Identifies best subsets for each value of p


Requires fewer variables than observations
Can be quite effective for medium-sized data sets
Advantage to have several slightly different models to com
pare
77

Muticollinearity-Subsets Selecti
oin
Forward stepwise regression
Start with no predictors
First include predictor with highest correlation with response
In subsequent steps add predictors with highest partial correlation with response contr
olling for variables already in equations
Stop when numerical criterion signals maximum (minimum)
Sometimes eliminate variables when t value gets too small

Only possible method for very large predictor pools


Local optimization at each step, no guarantee of finding overall opti
mum

Backward elimination
Start with all predictors in equation
Remove predictor with smallest t value
Continue until numerical criterion signals maximum (minimum)

Often produces different final model than forward stepwise method


78

Muticollinearity-Best Subsets Criteria


Numerical Criteria for Choosing Best Subsets
No single generally accepted criterion
Should not be followed too mindlessly

Most common criteria combine measures of with add penalties fo


r increasing complexity (number of predictors)
Coefficient of determination
Ordinary multiple R-square

Always increases with increasing number of predictors, so not very good f


or comparing models with different numbers of predictors

Adjusted R-Square
Will decrease if increase in R-Square with increasing p is small

79

Muticollinearity-Best Subsets Criteria


Residual mean square (MSEp)
Equivalent to adjusted r-square except look for minimum
Minimum occurs when added variable doesn't decrease error sum
of squares enough to offset loss of error degree of freedom

Mallows' Cp statistic
Should be about equal to p and look for small values near p
Need to estimate overall error variance

PRESS statistic
The one associated with the minimum value of PRESSp is chosen
Intuitively easier to grasp than the Cp-criterion.
80

Muticollinearity-Forward Stepwi
se
First include predictor with highest correlation with
response

>FIN 81
=4

Muticollinearity-Forward Stepwi
se
In subsequent steps add predictors with highest pa
rtial correlation with response controlling for variab
les already in equations. (if Fi>FIN=4, enter the Xi a
nd Fi<FOUT=4, remove the Xi)

>FIN
=4

82

Muticollinearity-Forward Stepwi
se

>FIN
<FOUT
=4
=4

83

Muticollinearity-Forward Stepwi
se
Summarize the stepwise algorithms

Therefore our Best Model should only include x 1 a


nd x2, which is y=52.5773+1.4683x1+0.6623x2
84

Muticollinearity-Forward Stepwi
se
Check the significance of the model and individual
parameter again. We find p value are all small and
each VIF is far less than 10.

85

Muticollinearity-Best Subsets
Also we can stop when numerical criterion signals
maximum (minimum) and sometimes eliminate var
iables when t value gets too small.

86

Muticollinearity-Best Subsets
The largest R squared value 0.9824 is associated with t
he full model.
The best subset which minimizes the Cp-criterion inclu
des x1,x2
The subset which maximizes Adjusted R squared or eq
uivalently minimizes MSEp is x 1,x2,x4. And the Adjusted
R squared increases only from 0.9744 to 0.9763 by the
addition of x4to the model already containing x 1 and x2.
Thus the simpler model chosen by the Cp-criterion is p
referred, which the fitted model is
y=52.5773+1.4683x1+0.6623x2
87

Polynomial model

Polynomial models are useful in situations wher


e the analyst knows that curvilinear effects are p
resent in the true response function.
We can do this with more than one explanatory
variable using Polynomial regression model:

88

Multicollinearity-Polynomial Models
Multicollinearity is a problem in polynomial regr
ession (with terms of second and higher order):
x and x2 tend to be highly correlated.
A special solution in polynomial models is to us
e zi = xi xi instead of just xi. That is, first subt
ract each predictor from its mean and then use
the deviations in the model.

89

Multicollinearity Polynomial model


Example: x = 2, 3, 4, 5, 6 and x2 = 4, 9, 16, 25, 36.
As x increases, so does x2. rx,x2 = 0.98.

= 4 then z = 2,1, 0, 1, 2 and z2 = 4, 1, 0, 1,


4. Thus, z and z2 are no longer correlated. rz,z2 =
0.
We can get the estimates of the s from the esti
mates of the s. Since

90

Dummy Predictor Variable


The dummy variable is a simple
and useful method of introducing
into a regression analysis
information contained in
variables that are not
conventionally measured on a
numerical scale, e.g., race,
gender, region, etc.
91

Dummy Predictor Variable


The categories of an ordinal variable could
be assigned suitable numerical scores.
A nominal variable with c2 categories ca
n be coded using c 1 indicator variables,
X1,,Xc-1, called dummy variables.
Xi=1, for ith category and 0 otherwise
X1=,,=Xc-1=0, for the cth category

92

Dummy Predictor Variable


If y is a workers salary and
Di = 1 if a non-smoker
Di = 0 if a smoker
We can model this in the following way:

yi Di ut

93

Dummy Predictor Variable


Equally we could have used the dummy var
iable in a model with other explanatory var
iables. In addition to the dummy variable
we could also add years of experience (x), t
o give:

yi Di xi ut

E ( yi ) X
E ( yi ) X

For nonsmoker
For smoker
94

Dummy Predictor Variable


y

Non-smoker

Smoker
+

95

Dummy Predictor Variable


We can also add the interaction to betwee
n smoking and experience with respect to
their effects on salary.

yi Di xi Di xi ut
E ( yi ) ( ) ( ) X
E ( yi ) X

For nonsmoker
For smoker
96

Dummy Predictor Variable


y

Non-smoker

Smoker
+

97

Standardized Regression Coefficients


We typically wants to compare predictors
in terms of the magnitudes of their effect
s on response variable.
We use standardized regression coefficien
ts to judge the effects of predictors with d
ifferent units

98

Standardized Regression Coefficients


They are the LS parameter estimates obta
ined by running a regression on standardi
zed variables, defined
_ as follows:
yi y
*
yi
sy
_

x
*
ij

xij x j
sxij

(i 1, 2,K , n; j 1, 2,K , k )

sxj
Where s y and
d

are sample SDsyiof

xan
j
99

Standardized Regression Coefficients

0* 0

Let

s
*
And j ( xj )( j 1,2,K , k )
sy

*j

The magnitudes of
can be directly com
xj
pared to judge the relative effects
of on
y.

100

Standardized Regression Coefficients

0* 0

Since
, the constant can be dropped
*
* 's
y
y
from the model. Let
be the vector of th
e
x*
x* ' s
and 1 berthe Kmatrix
ryx1
r of
x1x 2

x1xk

1 K rx2 xk
1 *' *
x
2
x
1

x x R
M
M O
M
n 1

1
rxkx1 rxkx2 K

ryx 2
1 *' *

x y r
M
n 1

ryxk
101

Standardized Regression Coefficients


So we can get
*
1

M (x*'x*)1x*' y* R 1r

*
k

This method of computing


is numerically
j' s

more stable than computing


directly, bec
j' s
ause all entries of R and r are between -1 and
1.
102

Standardized Regression Coefficients


Example (Given in page 424)
From the calculation, we can obtain that

1 0.19244, 2 0.3406

And sample standard deviations of x1,x2 and


are s 6.830, s 0.641, s 1.501
x1

x2

*
s
Then we have ( x1 ) 0.875, ( sx 2 ) 0.105
1
2
2
1
sy
sy

Note that * f *2 ,although


.Thus
x1 h
1 p 2
*

as a larger effect than x2 on y.


1

103

Standardized Regression Coefficients


We can also use the matrix method to compute standardized re
gression coefficients.
First we compute the correlation matrix between x1 ,x2 and y
x2

Then we have

1
0.913
1
0.913

Next calculate
Hence

1
R
1 rx21x 2
1

1
r
x1x 2

0.971
r

0.904

x1
0.913

x2

0.971 0.904

rx1x 2
6.009 5.586

5.486
6.009

0.875
R 1r

0.105

Which is as same result as before

104

Variable Selection
Methods

105

How to decide their salaries?


32

23

Attacker

5 years

more than 20
goals per year

Lionel Messi
10,000,000
EURO/yr

Defender

11 years

less than 1
goals per year
Carles Puyol
5,000,000 EURO/yr
106

How to select variables?


1) Stepwise Regression

2)Best Subset Regression

107

Stepwise Regression
Partial F-test
Partial Correlation Coefficients
How to do it by SAS?
Drawbacks
108

Partial F-test
(p-1)-Variable Model:

Yi 0 1 xi1 ... p 1 xi , p 1 i
p-Variable Model:

Yi 0 1 xi1 ... p 1 xi , p 1 p xi , p i
109

How to do the test?


H1 p: 1 p 0
H0 p
We reject
level if
Fp

vs

H 0 p: p 0

H1 pof
in favor

( SSE p 1 SSE p ) /1
SSE p / [n ( p 1)]

at

f ,1,n ( p 1)

110

Another way to interpret the test:


test statistics:
p
tp
SE ( p )

t Fp

We reject H 0 p

at level if

2
p

| t p | tn ( p 1), /2
111

Partial Correlation Coeffientients


2
yx p | x1,..., x p1

SSE p 1 SSE p
SSE p 1

SSE ( x1 ,..., x p 1 ) SSE ( x1 ,..., x p )


SSE ( x1 ,..., x p 1 )

test statistics:
Fp t
2
p

2
yx p | x1,..., x p1

[n ( p 1)]

1 ryx2 p | x1,..., x p1

*Addx p to the regression equation that


xincludes
is large enough.
1 ,..., x p 1 only ifFp
112

How to do it by SAS? (EX9 Continuity of Ex5)


The table shows data
on the heat evolved in
calories during the
hardening of cement
on a per gram basis
(y) along with the
percentages of four
ingredients: tricalcium
aluminate (x1),
tricalcium silicate (x2),
tetracalcium alumino
ferrite (x3), and
dicalcium silicate (x4).

No.

X1

X2

X3

X4

26

60

78.5

29

15

52

74.3

11

56

20

104.3

11

31

47

87.6

52

33

95.9

11

55

22

109.2

71

17

102.7

31

22

44

72.5

54

18

22

93.1

10

21

47

26

1159

11

40

23

34

83.8

12

11

66

12

113.3

13

10

68

12

109.4

113

SAS Code

data example1;
input x1 x2 x3 x4 y;
datalines;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4

;
Run;

proc reg data=example1;


model y= x1 x2 x3 x4 /selection=stepwise;
run;
114

SAS output

115

SAS output

116

Interpretation

At the first step, x4 is chosen into the equatio


n as it has the largest correlation with y among t
he 4 predictors;

At the second step, we choose x1 into the equ


ation for it has the highest partial correlation wit
h y controlling for x4;

At the third step, sincer


is greater than
yx2 | x4 , x1
ryx3 | x4 , x1 , x2 is chosen into the equation rather th
an x3.
117

Interpretation

At the 4th step, we removed x4 from the mod


el since its partial F-statistics is too small.

From Ex11.5, we know that x4 is highly correl


ated with x2. Note that in Step4, the R-Square is
0.9787, which is slightly higher that 0.9725, the
R-Square of Step 2. It indicates that even x4 is th
e best predictor of y, the pair (x1,x2) is a better p
redictor than the predictor (x1,x4).
118

Drawbacks
The final model is not guaranteed to be optima
l in any specified case.

It yields a single final model while in practice t


here are often several equally good model.

119

Best Subset Regression


Comparison to Stepwise Method
Optimality Criteria
How to do it by SAS?

120

Comparison to Stepwise Regressio


n
In best subsets regression, a subset of var
iables is chosen from that optimizes a wel
l-defined objective criterion.
The best regression algorithm permits det
ermination of a specified number of best
subsets from which the choice of the final
model can be made by the investigator.
121

Optimality Criteria
r Criterion
2
p

r
2
p

SSR p
SST

SSE p
SST

Adjusted r Criterion
2
p

2
adj , p

SSE p / (n ( p 1))
SST / n 1

MSE p
MST
122

Optimality Criteria

C p Criterion
Standardized mean square error of prediction:
n
2
ip
p
i
2
i 1

E (Y )]
E
[
Y

involves unknown parameterssuch


as
s, so
j
p sample estimate
C p statistic
minimize a
of
. Mallows
:
p

Cp

SSE p

2( p 1) n

123

Optimality Criteria

It practice, we use theC Criterion


because of its ease of computation a
nd its ability to judge the predictive p
ower of a model.
p

124

How to do it by SAS?(Ex11.9)
proc reg data=example1;
model y= x1 x2 x3 x4 /selection=adjrsq m
se cp;
run;

125

SAS output

126

Interpretation

The best subset which minimizes Cthe


p Criterion
is x1, x2 which is the same model selected using
stepwise regression in the former example.
2

radj
The subset which maximizes
is x1, x2, x4.
,p
2
r
However, adj , p increases only from 0.9744 to 0.9
763 by the addition of x4 to the model which alr
eady contains x1 and x2.

Thus, the model chosen by the


i
C p Criterion
s preferred.

127

Chapter Summary
and Modern Application

128

Model (Extension of Simple


yRegression):
k xik
i 0 1 xi1 2 xi 2

Multiple
Regression Model

0, 1, 2, .... k are unknown parameters

Least squares method:


n

Q [ yi ( 0 1 xi1 2 xi 2 ... kxik )]2

Fitting the MLR


Model

i 1

n
Q
2 [ yi ( 0 1 xi1 2 xi 2 ... kxik )] 0
0
i 1
n
Q
2 [ yi ( 0 1 xi1 2 xi 2 ... kxik )] xij 0
j
i 1

Goodness of fit of the


model:
MLR Model in
Matrix Notation

SSR
r
SST
2

Y X

( X ' X ) 1 X 'Y

( X ' X ) 1 X 'Y

129

Statistical Inference ' s


on
Hypotheses:
H : 0 vs. H : 0
0j

Statistical Inference
for Multiple
Regression

Teststatistic:

1j

j j
Z
T

~ Tn ( k 1)
W / n (k 1)
S v jj

Hypotheses: H 0 : 1 L k 0 vs. H a : Atleastone j 0


Teststatistic:

Regression
Diagnostics

MSR r 2 {n (k 1)}
F

MSE
k (1 r 2 )

Residual Analysis
Data Transformation

130

The General Hypothesis Test:

Compare

the full model: Yi 0 1 x i1


the partial model:Y x
i

... k x ik i
i1 ... k m x i,km i

Hypotheses: H 0 : km 1 ... k 0 vs. H a : j 0


( SSEk m SSEk ) / m
~ f m,n ( k 1)
Test statistic: F0
SSEk /[n (k 1)]

RejectH
0 when F0 f m ,n ( k 1),

EstimatingandPredictingFutureObservations:
*
*
*
* '
x

(x
,
x
,...,
x
Let
0 1
k)

Teststatistic: T

and

* *
*

s x Vx

* Y * 0 1 x1* ... k xk* x*

~ Tn ( k 1)

CIfortheestimatedmean*:

* t n ( k 1), / 2 s x* Vx*

PI for the estimated Y * t n ( k 1), / 2 s 1 x* Vx*


Y*:

131

Topics in
regression
modeling

Multicollinearity
Polynomial Regression
Dummy Predictor
Variables
Logistic egression
Model
2
partial F-testryx p / x1,x p 1

SSE p 1 SSE p

SSE p 1
2
partial Correlation ryx p/ x1L x p1 n p 1
Fp
Coefficient
1 ryx2 p/ x1L x p1

Stepwise Regression:

Variable
Selection
Methods

Stepwise Regression Algorithm


Best Subsets Regression

Strategy for
building a MLR
model
132

Application of the MLR model


Linear regression is widely used in
biological, chemistry, finance and social
sciences
to
describe
possible
relationships between variables. It ranks
as one of the most important tools used
in these disciplines.

133

Financial market

biology
Housing
price

heredit
y
Chemis
try

134

Example
Broadly speaking, an asset pricing model
can be expressed as:

ri ai b1 j f1 b2 j f 2 L bkj f k i
Whereri f k,
and k denote the expected
return on asset i, the kth risk factor and
i factors,
the number of risk
denotesrespectively.
the specific return
on asset i.
135

The equation can also be expressed in the


matrix notation:

is called the factor


loading
136

GDP

Inflation
rate

Wha
t
facto s the mo
s t im
rs?
porta
nt

Interest rate
Rate of return on
the market
portfolio

Employment rate

Government
policies

137

Method
Step 1: Find the efficient factors
(EM algorithms, maximum likelih
ood)
Step 2: Fit the model and estimate the fa
ctor
loading
(Multiple linear regression)
138

According to the multiple linear regressio


n and run data on SAS, we can get the fac

tor loading
and the coefficient of mult
2
r
iple determination
We can ensure the factors that mostly eff
ect the return in term of SAS output and t
hen build the appropriate multiple factor
models
We can use the model to predict the futur
e return and make a good choice!
139

Questions
Thank you
140

Potrebbero piacerti anche