Sei sulla pagina 1di 34

Simple Linear Regression

and Correlation.
Corresponds to
Chapter 10
Tamhane and Dunlop

Slides prepared by Elizabeth Newton (MIT)


with some slides by
Jacqueline Telford (Johns Hopkins University)
1
Simple linear regression analysis estimates the relationship
between two variables.
One of the variables is regarded as a response or outcome
variable (y).
The other variable is regarded as predictor or explanatory
variable (x).
Sometimes it is not clear which of two variable should be the
response (e.g. height and weight). In this case, correlation
analysis may be used.
Simple linear regression estimates relationships of the form
y = a + bx.

2
Scatter plot of ozone concentration
by temperature
5
4
air$ozone

3
2
1

60 70 80 90
air$temperature
3
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
A Probabilistic Model for Simple Linear Regression

Let x1, x2,..., xn be specific settings of the predictor variable.


Let y1, y2,..., yn be the corresponding values of the response
variable.
Assume that yi is the observed value of a random variable
(r.v.) Yi, which depends X on according to the following model:
Yi = β0 + β1 xi + εi (i = 1, 2, …, n)

Here εi is the random error with E(εi)=0 and Var(εi)=σ2 .

Thus, E(Yi) = µi = β0 + β1 xi (true regression line).


The xi’s usually are assumed to be fixed (not random
variables).
4
A Probabilistic Model for Simple Linear Regression

See Figure 10.1, p. 348 and also see page 348 for the four
assumptions of a simple linear regression model.

5
Least Square Line Mathematics (invented by Gauss)
Find the line, i.e., values of β0 and β1 that minimizes the sum
of the squared deviations:
n
Q = ∑ [ y i − ( β 0 + β1x i )]2
i =1

How?

Solve for values of β0 and β1 for which

∂Q ∂Q
= 0 and =0
∂β 0 ∂β1

6
Finding Regression Coefficients

∂Q n
= −2∑ [ y i − ( β 0 + β1xi )]
∂β 0 i =1

∂Q n
= −2∑ xi [ y i − ( β 0 + β1xi )]
∂β1 i =1

7
Normal Equations

n n
nβ 0 + β1 ∑ xi = ∑ y i
i =1 i =1
n n n
β 0 ∑ x i + β1 ∑ x = ∑ x i y i
2
i
i =1 i =1 i =1

8
Solution to Normal Equations
n

∑(x i − x )( y i − y )
S xy
βˆ1 = i =1
=
n
S xx
∑(x
i =1
i − x) 2

βˆ0 = y − βˆ1 x

Note that least squares line goes through (x, y).


9
Fitted regression line
5
4
air$ozone

3
2
1

60 70 80 90

air$temperature
10
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Fitted values of yi : yˆ i = βˆ0 + βˆ1 xi , i = 1, 2,..., n
Residuals : ei = y i − yˆ i = y i − ( βˆ0 + βˆ1 x i ) , i = 1, 2,..., n

temperature ozone fitted resid


67 3.45 2.49 0.96
72 3.30 2.84 0.46
74 2.29 2.98 -0.69
62 2.62 2.14 0.48
65 2.84 2.35 0.50
11
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Matrix Approach to Simple Linear Regression
(what your regression package is really doing)

The model: y=Xβ + ε

y is n by 1
X is n by 2
β is 2 by 1
ε is n by 1

12
Y=Xβ + ε

⎡ y 1 ⎤ ⎡1 x1 ⎤ ⎡ε 0 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y 2 ⎥ = ⎢1 x 2 ⎥ ⎡ β 0 ⎤ ⎢ε 1 ⎥
⎢ ⎥ +
⎢ y 3 ⎥ ⎢1 x 3 ⎥ ⎣ β1 ⎦ ⎢ε 3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣ y 4 ⎥⎦ ⎢⎣1 x 4 ⎥⎦ ⎢⎣ε 4 ⎥⎦

13
Solution of linear equations
In linear algebra:
Find x which solves Ax=b.

In regression analysis:
Find β which solves Xβ=y
Why can’t we do this?

14
Least Squares

Q=(y-Xβ)’(y-Xβ)
= y’y – β’X’y – y’Xβ + β’X’Xβ
= y’y – 2 β’X’y + β’X’Xβ

∂Q/ ∂β = -2X’y + 2X’Xβ

∂Q/ ∂β = 0 → X’y = X’Xb, where b= βˆ

15
Least Squares continued

For simple linear regression:

X' X = ⎢
⎡n ∑ xi ⎤

⎢⎣ ∑ x i ∑ x i ⎥⎦
2

⎡∑ y i ⎤
X' y = ⎢ ⎥
⎢⎣ ∑ x i y i ⎥⎦
16
Least Squares continued

X’Xb = X’y

⎡n

∑x i ⎤
⎥ b =
⎡∑ y i ⎤
⎢ ⎥
⎢⎣∑ x i ∑x ⎢⎣∑ x i y i ⎥⎦
2
i ⎥

The Normal Equations as before


17
Least Squares continued

X’Xb = X’y
b= (X’X)-1X’y (if X has linearly
independent columns)
Solution by QR decomposition
X=QR, Q orthonormal, R upper triangular
and invertible
b=(X’X)-1X’y = (R’Q’QR)-1R’Q’y
=(R’R)-1R’Q’y = R-1Q’y
18
The Hat Matrix

b=(X’X)-1 X’y
ŷ=Xb = X(X’X)-1X’y =Hy
H (n by n) is the Hat matrix
Takes y to ŷ
H is symmetric and idempotent HH=H
Diagonal elements of the hat matrix are
useful in detecting influential observations.

19
Expected value of b
E(b) = E((X’X)-1X’y]
= E[(X’X)-1X’(Xβ+ε)]
= E[(X’X)-1X’X β+ (X’X)-1X’ε]

Hence b is an unbiased estimator of β.

20
Covariance of b
The covariance matrix of y is σ2I
b=(X’X)-1X’y = Ay (where A is k by n)
Cov(b) = A Var(y) A’ = A σ2I A = σ2AA’
= σ2 (X’X)-1X’X(X’X)-1
= σ2 (X’X)-1

21
Covariance of b

For simple linear regression, σ2(X’X)-1=

−1

σ⎢
2
⎡ n ∑i ⎥ = σ
x ⎤ 2 ⎡

∑xi - ∑xi ⎤
2


⎢⎣∑xi ∑xi ⎥⎦ n∑xi −(∑xi ) ⎢⎣−∑xi n ⎥⎦
2 2

SD ( b0 ) = σ
∑ i
x 2

; SD(b 1 ) = σ
1
nS xx S xx

22
Estimation of σ2

n n

∑ i
e 2
∑ i i
( y − ˆ
y ) 2

s2 = i =1
= i =1
n−2 n−2

Note: The denominator is n - 2 since


two parameters are being estimated
(β0 and β1).

E[S2]=σ2 (See proof in Seber, Linear


Regression Analysis)

23
Statistical Inference for βo and β1

SE ( βˆ0 ) = s
∑ i
x 2

and SE ( βˆ1 ) =
s
nS xx Sxx

For ozone example:


Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -2.2260 0.4614 -4.8243 0.0000
temperature 0.0704 0.0059 11.9511 0.0000
24
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sums of Squares

n
Sum of Squares Total (SST) : ∑ ( yi − y ) 2
i =1

n n
Sum of Squares for Error (SSE) : ∑ ei2 =∑ ( y i − yˆ i )2
i =1 i =1

n
Sum of Squares for Regression (SSR) : ∑ ( yˆ i − y ) 2
i =1

25
Geometry of the Sums of Squares
yi − y = ( yˆ i − y ) + ( yi − yˆ i )
yi

SST = SSR + SSE, see derivation on p. 354


26
J. Telford
Coefficient of Determination (R-squared)

2 SSR SSE
r = = 1− =
SST SST

proportion of the variance in y that is


accounted for by the regression on x

= square of correlation between y and ŷ

For ozone example:


Multiple R-Squared: 0.5672
27
Analysis of Variance (ANOVA)

H 0 : β1 = 0 vs. H 0 : β1 ≠ 0

SSR/1 MSR
F= = = t2
SSE/(n - 2) MSE

For ozone example:


summary.aov(tmp)
Df Sum of Sq Mean Sq F Value Pr(F)
temperature 1 49.46178 49.46178 142.8282 0
Residuals 109 37.74698 0.34630

28
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression Diagnostics
2
1
Residual vs. observation number
resid(ozone.lm)

0
-1

0 20 40 60 80 100

29
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression Diagnostics
2
1 residual vs. fitted value
resid(ozone.lm)

0
-1

2.0 2.5 3.0 3.5 4.0 4.5

fitted(ozone.lm)

30
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regession Diagnostics
residual vs. x
2
1
resid(ozone.lm)
0
-1

60 70 80 90
air$temperature
31
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Regression Diagnostics
qq plot of residuals
2
1
resid(ozone.lm)

0
-1

-2 -1 0 1 2

Quantiles of Standard Normal


32
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Hat Matrix Diagonals
0.05
hat(model.matrix(ozone.lm))
0.04
0.03
0.02
0.01

0 20 40 60 80 100

33
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Some useful S-Plus commands
my.lm <- lm(y~x, data=mydata, na.action=na.omit)
includes intercept term by default
summary(my.lm)
gives coefficients, correlation of coefficients, R-square, F-
statistic, residual standard error
summary.aov(my.lm)
gives ANOVA table
resid(my.lm)
gives residuals
fitted(my.lm)
gives fitted values
model.matrix(my.lm)
gives model matrix

34
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

Potrebbero piacerti anche