Section 3

STAT318 Data Mining
Dr. Blair Robertson

University of Canterbury, Christchurch, New Zealand
Semester 2, 2016
Some of the figures in this presentation are taken from An Introduction to

Statistical Learning, with applications in R (Springer, 2013) with permission
from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
B. Robertson, University of Canterbury
STAT318 Data Mining
,1 / 26
Linear regression
Linear regression is a simple parametric approach to

supervised learning that assumes there is an approximately
linear relationship between the predictors X1 , X2 , . . . , Xp and
the response Y .
Although true regression functions are never linear, linear

regression is an extremely useful and widely used method.
STAT318 Data Mining
,2 / 26
50
100
200
300
25
5
10
15
Sales
20
25
20
15
Sales
10
15
10
5
Sales
20
25
Linear regression: advertising data
10
TV
20
30
40
50
Radio
STAT318 Data Mining
20
40
60
80
100
Newspaper
,3 / 26
Simple linear regression
In simple (one predictor) linear regression, we assume a model

Y = 0 + 1 X + ,
where 0 and 1 are two unknown parameters and is an
error term with E () = 0.
Given some parameter estimates 0 and 1 , the prediction of
Y at X = x is given by
y = 0 + 1 x.
STAT318 Data Mining
,4 / 26
Estimating the parameters: least squares approach

Let yi = 0 + 1 xi be the prediction of Y at X = xi , the
predictor value at the ith training observation. Then, the ith
residual is defined as
ei = yi yi ,
where yi is the response value at the ith training observation.
The least squares approach chooses 0 and 1 to minimize
the residual sum of squares (RSS)
RSS =
n
X
i=1
ei2
n
n
X
X
2
=
(yi yi ) =
(yi 0 1 xi )2 .
i=1
i=1
STAT318 Data Mining
,5 / 26
15
10
5
Sales
20
25
Advertising example
50
100
150
200
250
300
TV
y = 7.03 + 0.0475x
STAT318 Data Mining
,6 / 26
Advertising example
3
0.05
2.15
0.04
0.06
2.5
2.2
0.03
2.3
3
5
3
6
Contour plot of the RSS on the advertising data, using TV as

the predictor.
STAT318 Data Mining
,7 / 26
Using some calculus, we can show that

Pn
(x x)(yi y)
Pn i
1 = i=1
)2
i=1 (xi x
and
0 = y 1 x,
where x and y are the sample means of x and y , respectively.
STAT318 Data Mining
,8 / 26
10
5
10
Y
10
10
Assessing the accuracy of the parameter estimates
True model (red) is Y = 2 + 3X + , where Normal(0, 2 ).

STAT318 Data Mining
,9 / 26
Assessing the accuracy of the parameter estimates

The standard errors for the parameter estimates are
s

2
1
x
SE(0 ) =
+ Pn
n
)2
i=1 (xi x
and
,
)2
i=1 (xi x
SE(1 ) = pPn
where =
p
V ().
Usually is not known and needs to be estimated from data

using the residual standard error (RSE)
sP
n
i )2
i=1 (yi y
RSE =
.
np1
STAT318 Data Mining
,10 / 26
Hypothesis testing
If 1 = 0, then the simple linear model reduces to Y = 0 + ,
and X is not associated with Y .
To test whether X is associated with Y , we perform a
hypothesis test:
H0 : 1 = 0 (there is no relationship between X and Y )
HA : 1 6= 0 (there is some relationship between X and Y )
If the null hypothesis is true (1 = 0), then
t=
1 0
SE(1 )
will have a t-distribution with n 2 degrees of freedom.

STAT318 Data Mining
,11 / 26
Results for the advertising data set
Intercept
TV
Coefficient
7.0325
0.0475
Std. Error
0.4578
0.0027
t-statistic
15.36
17.67
STAT318 Data Mining
p-value
<0.0001
<0.0001
,12 / 26
Assessing the overall accuracy

Once we have established that there is some relationship
between X and Y , we want to quantify the extent to which
the linear model fits the data.
The residual standard error (RSE) provides an absolute
measure of lack of fit for the linear model, but it is not always
clear what a good RSE is.
An alternative measure of fit is R-squared (R 2 ),
Pn
(yi yi )2
RSS
2
R =1
= 1 Pi=1
,
n
TSS
)2
i=1 (yi y
where TSS is the total sum of squares.
STAT318 Data Mining
,13 / 26
Results for the advertising data set
Quantity
Residual standard error (RSE)
R2
Value
3.26
0.612
The R 2 statistic has an interpretable advantage over RSE

because it always lies between 0 and 1.
A good R 2 value usually depends on the application.
STAT318 Data Mining
,14 / 26
Multiple linear regression
In multiple linear regression, we assume a model

Y = 0 + 1 X1 + . . . + p Xp + ,
where 0 , 1 , . . . , p are p + 1 unknown parameters and is
an error term with E () = 0.
Given some parameter estimates 0 , 1 , . . . , p , the prediction
of Y at X = x is given by
y = 0 + 1 x1 + . . . + p xp .
STAT318 Data Mining
,15 / 26
Multiple linear regression

Y
X2
y = 0 + 1 x1 + 2 x2 .
X1
STAT318 Data Mining
,16 / 26

The parameters 0 , 1 , . . . , p are estimated using the least
squares approach.
We choose 0 , 1 , . . . , p to minimize the sum of squared
residuals
RRS =
n
X
(yi yi )2
i=1
n
X
(yi 0 1 xi1 . . . p xip )2 .
i=1
We will calculate these parameter estimates using R.
STAT318 Data Mining
,17 / 26
Results for the advertising data
Intercept
TV
Radio
Newspaper
Coefficient
2.939
0.046
0.189
-0.001
Std. Error
0.3119
0.0014
0.0086
0.0059
t-statistic
9.42
32.81
21.89
-0.18
STAT318 Data Mining
p-value
<0.0001
<0.0001
<0.0001
0.8599
,18 / 26
Is there a relationship between Y and X ?
To test whether X is associated with Y , we perform a

hypothesis test:
H0 : 1 = 2 = . . . = p = 0 (there is no relationship)
HA : at least one j is non-zero (there is some relationship)
If the null hypothesis is true (no relationship), then
F =
(TSS - RSS)/p
RSS/(n p 1)
will have an F -distribution with parameters p and n p 1.
STAT318 Data Mining
,19 / 26
Is the model a good fit?

Once we have established that there is some relationship
between the reponse and the predictors, we want to quantify
the extent to which the multiple linear model fits the data.
The residual standard error (RSE) and R 2 are commonly used.
For the advertising data we have:
Quantity
Residual standard error (RSE)
R2
F-statistic
Value
1.69
0.897
570
STAT318 Data Mining
,20 / 26
Extensions to the linear model
We can remove the additive assumption and allow for

interaction effects.
Consider the standard linear model with two predictors
Y = 0 + 1 X1 + 2 X2 + .
An interaction term is included by adding a third predictor to

the standard model
Y = 0 + 1 X1 + 2 X2 + 3 X1 X2 + .
STAT318 Data Mining
,21 / 26
Results for the advertising data
Consider the model

Sales = 0 + 1 Tv + 2 Radio + 3 (Tv Radio) + .
The results are:
Intercept
TV
Radio
TvRadio
Coefficient
6.7502
0.0191
0.0289
0.0011
Std. Error
0.248
0.002
0.009
0.000
t-statistic
27.23
12.70
3.24
20.73
STAT318 Data Mining
p-value
<0.0001
<0.0001
0.0014
<0.0001
,22 / 26
Extensions to the linear model
We can accommodate non-linear relationships using

polynomial regression.
Consider the simple linear model
Y = 0 + 1 X + .
Non-linear relationships can be captured by including powers

of X in the model. For example, a quadratic model is
Y = 0 + 1 X + 2 X 2 + .
STAT318 Data Mining
,23 / 26
50
Polynomial regression: Auto data
30
20
10
Miles per gallon
40
Linear
Degree 2
Degree 5
50
100
150
200
Horsepower
STAT318 Data Mining
,24 / 26
Results for the auto data

The figure suggests that
mpg = 0 + 1 Horsepower + 2 Horsepower2 + ,
may fit the data better than a simple linear model.
The results are:
Intercept
Horsepower
Horsepower2
Coefficient
56.9001
-0.4662
0.0012
Std. Error
1.8004
0.0311
0.0001
t-statistic
31.6
-15.0
10.1
STAT318 Data Mining
p-value
<0.0001
<0.0001
<0.0001
,25 / 26
What we did not cover
Qualitative predictors need to be coded using dummy variables

for linear regression (R does this automatically for us).
Deciding on important variables.
Outliers and high leverage points.
Non-constant variance and correlation of error terms.
Collinearity.
STAT318 Data Mining
,26 / 26

Section 3

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Section 3

Caricato da

Copyright:

Formati disponibili

STAT318 Data Mining

Dr. Blair Robertson

Some of the figures in this presentation are taken from An Introduction to

B. Robertson, University of Canterbury

STAT318 Data Mining

Linear regression is a simple parametric approach to

Although true regression functions are never linear, linear

B. Robertson, University of Canterbury

STAT318 Data Mining

Linear regression: advertising data

B. Robertson, University of Canterbury

STAT318 Data Mining

Simple linear regression

In simple (one predictor) linear regression, we assume a model

B. Robertson, University of Canterbury

STAT318 Data Mining

Estimating the parameters: least squares approach

B. Robertson, University of Canterbury

STAT318 Data Mining

B. Robertson, University of Canterbury

STAT318 Data Mining

Contour plot of the RSS on the advertising data, using TV as

STAT318 Data Mining

Estimating the parameters: least squares approach

Using some calculus, we can show that

B. Robertson, University of Canterbury

STAT318 Data Mining

Assessing the accuracy of the parameter estimates

True model (red) is Y = 2 + 3X + , where  Normal(0, 2 ).

STAT318 Data Mining

Assessing the accuracy of the parameter estimates

Usually is not known and needs to be estimated from data

STAT318 Data Mining

will have a t-distribution with n 2 degrees of freedom.

STAT318 Data Mining

Results for the advertising data set

B. Robertson, University of Canterbury

STAT318 Data Mining

Assessing the overall accuracy

B. Robertson, University of Canterbury

STAT318 Data Mining

Results for the advertising data set

The R 2 statistic has an interpretable advantage over RSE

B. Robertson, University of Canterbury

STAT318 Data Mining

Multiple linear regression

In multiple linear regression, we assume a model

B. Robertson, University of Canterbury

STAT318 Data Mining

Multiple linear regression

STAT318 Data Mining

Estimating the parameters: least squares approach

(yi 0 1 xi1 . . . p xip )2 .

We will calculate these parameter estimates using R.

B. Robertson, University of Canterbury

STAT318 Data Mining

Results for the advertising data

B. Robertson, University of Canterbury

STAT318 Data Mining

Is there a relationship between Y and X ?

To test whether X is associated with Y , we perform a

will have an F -distribution with parameters p and n p 1.

B. Robertson, University of Canterbury

STAT318 Data Mining

True model (red) is Y = 2 + 3X + , where Normal(0, 2 ).