Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• Introduction
(Age) x4
Data Cleanup?
30000
25000
20000
Income
15000
10000
5000
0
0 10 20 30 40 50 60
Hours pe r Wee k
yˆ 2461 297 x
R2 = 0.311
Significance = 0.0031
Outliers:
• Rare, extreme values may distort the
outcome.
• Could be an error.
• Could be a very important observation.
• More than 3 standard deviations from the
mean.
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Simple Relationship
(Non Linear Relationship)
U-Shape d Re lationship
Correlation =
12 +0.12.
10
6
Y
0
0 2 4 6 8 10 12
X
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Yi 0 1X i i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 18
M3.2 Regression – Linear Regression
(Scatter plot)
• 1. Plot of All (Xi, Yi) Pairs
• 2. Suggests How Well Model Will Fit
Y
60
40
20
0 X
0 20 40 60
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Linear Regression
(Thinking Challenge)
How would you draw a line through the points? How do you determine which
line ‘fits best’?
• So square errors!
ˆ
n n
2
Yi Yˆi 2
i
i 1 i 1
Prediction Equation:
Sample Y – intercept:
ˆ0 y ˆ1x
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 22
M3.2 Regression – Linear Regression
(Computation Table)
Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 23
Interpretation of Coefficients
• 1. Slope (^1)
– Estimated Y Changes by ^1 for Each 1 Unit Increase in X
^
• If 1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X
• 2. Y-Intercept (^0)
– Average Value of Y When X = 0
• If ^0 = 4, then Average Y Is Expected to Be 4 When
X Is 0
Estriol Birthweight
(mg/24h) (g/1000)
1 1
2 1
3 2
4 2
5 4
n
X n
i Yi
n
i 1 i 1 1510
X Y
i i
n
37
5
ˆ1 i 1
0.70
X
n 2
15
2
i 55
n 5
Xi
2 i 1
i 1 n
0 R2 1
• We may interpret R2 as the proportionate reduction of total variability in y
associated with the use of the independent variable x.
• The larger is R2, the more is the total variation of y reduced by including the
variable x in the model.
• If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1.
• If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0.
• The closer R2 is to 1, the greater is said to be the degree of linear association
between x and y.
y i 1 x1i 2 x 2i K p x pi i
• the β’s are coefficients for the independent variables in the true
or population equation and the x’s are the values of the
independent variables for the member of the population.
All Va ria ble s E nte red : R -Sq uar e = 0.7 932 an d C (p) = 5.0 000
S um of Mea n
S our ce DF Sq uar es Sq uar e F Va lue Pr > F
I nte rce pt 18 .47 774 6.4 540 6 3 9.0 043 6 8.2 0 0.0 119
a ge 0 .08 424 0.1 893 1 0.9 423 9 0.2 0 0.6 627
f fnu m 0 .42 292 0.1 367 1 4 5.5 395 8 9.5 7 0.0 074
e xer cis e -0 .00 107 0.0 017 0 1.8 760 4 0.3 9 0.5 395
b eer 0 .32 601 0.1 151 8 3 8.1 211 1 8.0 1 0.0 127
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
b0 Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
b1
Intercept 18.47774 6.45406 39.00436 8.20 0.0119
age 0.08424 0.18931 0.94239 0.20 0.6627
b2 ffnum 0.42292 0.13671 45.53958 9.57 0.0074
exercise -0.00107 0.00170 1.87604 0.39 0.5395
beer 0.32601 0.11518 38.12111 8.01 0.0127
b3
b4
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
b0 Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
We have,
b0 = 18.478 , b1 = 0.084, b2 = 0.422,
b3 = - 0.001, b4 = 0.326
So,
vs H1: β1 ≠ β2 ≠ β3 ≠ β4 ≠ 0
The ANOVA highlighted in the green box at the top of the next slide tests
this hypothesis:
R2 = SS(Model)/SS(Total) = 273.75/345.13
= 0.79 or 79%
H0: β1 = 0 vs H1: β1 ≠ 0.
b1 = 0.084, P = 0.66
b2 = 0.422, P = 0.01
b3 = - 0.001, P = 0.54
b4 = 0.326, P = 0.01
Backward elimination
Start with all independent variables, test the global hypothesis and if
rejected, eliminate, step by step, those independent variables for which =
0.
Forward
Start with a “ core ” subset of essential variables and add others step by
step.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
b0 Variable
Parameter
Estimate
Standard
Error Type II SS F Value Pr > F
b1 Intercept
age
18.47774
0.08424
6.45406
0.18931
39.00436
0.94239
8.20
0.20
0.0119
0.6627
ffnum 0.42292 0.13671 45.53958 9.57 0.0074
b2 exercise
beer
-0.00107
0.32601
0.00170
0.11518
1.87604
38.12111
0.39
8.01
0.5395
0.0127
b3
b4
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.0500 level.
Va r ia b le ff n um En t er e d: R- S qu a re = 0 .6 6 13 an d C( p ) = 8 . 56 2 5
A na l ys i s o f V ar i an c e
Su m of M ean
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F
P ara mete r St a nd a rd
V ar i ab l e E sti mate Erro r T y pe II SS F Val ue Pr > F
St e pw i se Se l ec t io n : S te p 2
V a ri a bl e b e er En t er e d: R- S qu a re = 0 .7 7 88 an d C( p ) = 2 . 04 0 2
A na l ys i s o f V ar i an c e
Sum of Mea n
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F
Int erc ept 2 0.2 936 0 0. 755 79 3 237 .09 859 7 20. 97 <. 000 1
ffn um 0.4 638 0 0. 126 93 59 .94 878 13. 35 0. 002 0
bee r 0.3 337 5 0. 111 05 40 .55 414 9. 03 0. 008 0
Al l v ari abl es lef t i n t he mod el are si gni fica nt at the 0. 050 0 l eve l.
N o o the r v ari abl e m et the 0. 150 0 s ign ifi can ce leve l f or ent ry int o t he mod el.
• Extract the parameters of the estimated regression equation with the coefficients function.
• > coeffs = coefficients(eruption.lm)
• > coeffs
(Intercept) waiting
-1.874016 0.075628
• Print out the F-statistics of the significance test with the summary function.
• > summary(eruption.lm)
• Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.87402 0.16014 -11.7 <2e-16 ***
waiting 0.07563 0.00222 34.1 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811, Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16
• Answer: As the p-value is much less than 0.05, we reject the null hypothesis that β = 0. Hence there is a significant relationship
between the variables in the linear regression model of the data set faithful.
ln[p/(1-p)] = + X + e