Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2 and Adjusted R2
Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity
Lung Capacity
60
40
20
0
0
10
20
30
0
60
65
70
Cov( X , Y ) XY
1 n
( xi x )( yi y )
n i 1
xx
yy
09-12-2015
Correlation
Properties of Covariance:
Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 ,Var (Y ) E (Y 2 ) [ E (Y )]2
Scatter Diagram
Y
X
Positively Correlated
Weakly Correlated
Negatively Correlated
Strongly Correlated
Not Correlated
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
1 n
( xi x )( yi y )
n i 1
1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1
Var ( X ) Var (Y )
Var ( X )
Cov( X , Y )
rXY Corr ( X , Y )
Cov( X , Y )
Var ( X )Var (Y )
( x x )2 ( y y)2
( x x )( y y )
SSXY
79.75
0.957
SSX SSY
1.56 4450
09-12-2015
SSX x 2
, SSY y 2
y ,
SSXY
2
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
x y
xy n
Cigarettes
(X)
0
5
10
15
20
50
SSX = 1.56
SSY = 4450
SSXY= -79.75
X2
0
25
100
225
400
750
0
210
330
465
580
1585
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1
.
56 4450
Garg (IIM Lucknow)
Lung
Capacity
(Y)
Y2
XY
2025
1764
1089
961
841
6680
45
42
33
31
29
180
Regression Analysis
rxy
(5)(1585) (50)(180)
1075
1250 (1000)
.9615
Types of Relationships
Types of Relationships
Linear relationships
Strong relationships
Curvilinear relationships
X
Gaurav Garg (IIM Lucknow)
Weak relationships
X
Gaurav Garg (IIM Lucknow)
09-12-2015
Types of Relationships
No relationship
1
y = a + b.x
}a
a bX
(xi, yi)
error
yi
a bx i
X
xi
SSE (Yi a bX i ) 2
i 1
SSE
0 2 (Yi a bX i ) 0
a
i 1
n
i 1
Y X
Yi na b X i
i 1
(1)
i 1
i 1
i 1
( 2)
na b X i
i 1
a X i b X i2
i 1
n Yi X i Yi X i
i 1
i 1
i 1
b
2
n
n
2
Xi Xi
i 1
i 1
n
Yi X i a X i b X i2
i 1
i 1
n
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
i 1
Y Y X
n
i 1
X
n
i 1
SSXY
SSX
a Y bX
Gaurav Garg (IIM Lucknow)
09-12-2015
SSXY
b
.
SSX
Also the correlation coefficient between X and Y is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSX SSY
SSXY
SSX
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
SSX
SSX
b
SSY
SSY
( x x )( y y )
X 2.15, Y 80.
Gaurav Garg (IIM Lucknow)
SSXY
0.957
SSX SSY
SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
Residuals: ei Yi Yi
Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R2)
using the residuals.
R2 is used to examine the adequacy of the fitted
linear model to the given data.
Coefficient of Determination
Y Y Y Y
Y Y
Y
X
n
09-12-2015
r = -1
r=1
0 < R2 < 1
Weak linear
relationships
R2 = 1
Perfect linear
relationship
R2 = 0
No linear
relationship
None of the
variation in Y is
explained by X
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116
1.25
125
126.0
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV
Overweight
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
Y = -24.709 + 0.967 X
R2 = 0.768
and
20.00
15.00
10.00
5.00
Y
Predicted Y
0.00
1
10
11
12
13
14
15
-5.00
-10.00
-15.00
09-12-2015
Standard Error
Assumptions
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n2
(Y Y )
i 1
n2
Linearity
Independence
Linear
Not Linear
Independent
Equal Variance
Unequal variance
(Heteroscadastic)
residuals
X
residuals
residuals
residuals
residuals
Not Independent
Equal variance
(Homoscadastic)
20.00
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
-5.00
-10.00
-15.00
residuals
residuals
2.00
0.00
-2.00
10
15
20
25
30
35
40
45
-4.00
-6.00
-8.00
-10.00
-12.00
09-12-2015
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X1: Price (in $)
X2: Advertising Expenditure ($100s)
Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Random Error
Slopes
Yi 0 1 X 1i 2 X 2i k X ki i
i 1,2,, n.
Gaurav Garg (IIM Lucknow)
Estimate of
intercept
Estimates of slopes
Yi b0 b1 X 1i b2 X 2i bk X ki
i 1,2,, n.
Gaurav Garg (IIM Lucknow)
09-12-2015
X2
2
Y 1 X
X n 2 X nk n
n1
n
k
Y X
or
X1
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(i )= 2
In long run, mean effect of random errors is zero.
E(i )= 0.
S( ) i2 (Y X)(Y X )
i 1
Y Y- 2 X Y X X
We differentiate S() with respect to and equate
to zero, i.e.,
This gives
S
0,
b (X X)1 X Y
Intercept
(b0)
306.53
LSE of slope 1
Price
(b1)
-24.98
LSE of slope 2
Advertising (b2)
74.13
09-12-2015
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising expenditure is $350:
= 428.62
Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
Predicted Y
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
-63.80
96.15
20.88
-10.31
-9.09
-35.74
13.47
49.03
58.84
11.83
-46.16
-46.44
-15.70
8.89
-31.85
Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.
600
500
i 1
400
Y
Predicted Y
300
i 1
200
R2 = SSR/SST = 1 (SSE/SST)
100
0
1
10
11
12
13
14
15
Since
SST = SSR + SSE
and all three quantities are non-negative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
2
When R is close to 0, the linear fit is not good
And X variables do not contribute in explaining the
variability in Y.
When R2 is close to 1, the linear fit is good.
In the previously discussed example, R2 = 0.5215
If we consider Y and X1 only,
R2 =0.1965
Adjusted R2
If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
2
R Adj
1
SSE (n k 1 )
SST (n 1 )
10
09-12-2015
where
SST Yi Y
n
i 1
n
i 1
i 1
SSE e i2 Yi Yi
SS
MS
Fc
MSR/MSE
Regression
SSR
MSR
Residual or Error
n-k-1
SSE
MSE
Total
n-1
SST
Test Statistic:
Fc = MSR / MSE ~ F(k, n-k-1)
For the previous example, we wish to test
H0: 1 = 2 = 0 Against H1: at least one i 0
ANOVA Table
df
SS
MS
F(2,12)(0.05)
6.5386
3.89
Regression
29460.03
14730.01
Residual or Error
12
27033.31
2252.78
Total
14
56493.33
Test Statistic:
Tc
bj
2 C jj
2 MSE
In our example
2 2252.7755
and
11
09-12-2015
Standard Error
i 1
Yi )
residuals
SSE
n k 1
Linear
Not Linear
n k 1
residuals
residuals
SYX
(Y
Assumption of Linearity
residuals
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
Unequal variance
Equal variance
Independent
2
i
Y
residuals
residuals
i 2
Not Independent
i 1
residuals
12
09-12-2015
Assumption of Normality
When we use F test or t test, we assume that 1,
2 , , n are normally distributed.
This assumption can be examined by histogram
of residuals.
NORMAL
NOT NORMAL
NORMAL
NOT NORMAL
Gaurav Garg (IIM Lucknow)
1 n
Yi ,
n i 1
sY
1 n
(Yi Y )2
n 1 i 1
X1
1 n
X1i , s X1
n i 1
1 n
( X 1i X 1 )2
n 1 i 1
X2
1 n
X 2i , s X 2
n i 1
1 n
( X 2i X 2 ) 2
n 1 i 1
Y Y
,
sY
Standardized X 1i
Pie
Sales
Price
($)
Advertising
($100s)
-0.78
-0.95
-0.37
0.96
0.76
-0.37
-0.78
1.18
-0.98
0.48
1.18
2.09
-0.78
0.16
-0.98
-0.30
0.76
1.06
0.48
-1.80
-0.98
1.11
-0.18
0.45
0.80
0.33
0.04
10
1.43
-1.38
1.06
11
-0.93
0.50
0.04
12
-1.56
1.10
-0.57
13
0.64
-0.61
1.06
14
0.80
-1.38
0.04
15
-1.56
0.33
-1.60
Note that:
Standardized Data
Week
X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2
2
R Adj
1
Y = 0 0.461 X1 + 0.570 X2
Since 0.461 < 0.570
X2 Contributes the most
Fc
(1 R 2 )(n 1)
(n k 1)
(n k 1) R 2
k (1 R 2 )
13
09-12-2015
Model
1
ANOVAb
Sum of
Squares
df
309.986
Regression
Residual
Mean Square
154.993
143.201
15.911
453.187
11
Ads (Nos.)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
Multicollinearity
Total
F
9.741
Sig.
.006a
CONTRADICTION
H0 is rejected;
0 =0, 1 =0, 2 =0
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Model
1
t
.771
.558
1.455
Sig.
.461
.591
.180
VIF j
1 R 2j
14
09-12-2015
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022
t
.771
.558
1.455
Tolerance = 1/VIF
Greater than 5
Collinearity Diagnosticsa
Variance Proportions
Condition
Index
Model
Dimension
Eigenvalue
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
a. Dependent Variable: Sales
Negligible Value
(based on ANOVA)
(based on Correlation)
(based on Correlation)
Large Value
Stepwise Regression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step 1: Run 5 simple linear regressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <==== has lowest p-value (ANOVA) < 0.05
Y = 0 + 5 X5
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <= has lowest p-value (ANOVA) < 0.05
Y = 0 + 4 X4 + 5 X5
STOP
Best model is the one with X3 and X4 only
Ads (Nos.)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
Model
1
(Constant)
No_Adv
a. Dependent Variable: Sales
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
15
09-12-2015
Model Summary
Model Summary
Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
Sig.
.001a
Model
1
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
Total
453.187
a. Predictors: (Constant), Ex_Adv, No_Adv
Model
1
(Constant)
Ex_Adv
a. Dependent Variable: Sales
453.187
11
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
t
.587
4.538
Sig.
.570
.001
Model
1
(Constant)
No_Adv
Ex_Adv
a. Dependent Variable: Sales
Y 2.1473 0.3041 X 1
R2 =0.534
At 5% level of significance, we reject
H0: 0 = 0 (Using t test)
H0: 1 = 0 (Using t and F test)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as
0, if typeof repair is mechanical
X2
1, if typeof repair is electrical
Mean Square
154.993
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
Type of Repair
electrical
mechanical
electrical
mechanical
electrical
electrical
mechanical
mechanical
electrical
electrical
Repair Time in
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
2
9
Coefficientsa
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
a. Predictors: (Constant), Ex_Adv, No_Adv
Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)-1XY
R2 and adjusted R2
Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality
Gaurav Garg (IIM Lucknow)
16