Sei sulla pagina 1di 27

Intro to Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression: Model


• Multiple linear regression can be thought of as a generalization of
simple linear regression

• Simple linear regression:


– One numerical response variable (y)
– One numerical explanatory variable (x)
– Model: linear equation w/ two coefficients

y  yˆ    0  1 x 
• Model fit is characterized by r, which describes the strength and
direction of the relationship between x and y, and by r2, which
describes the proportion of variance explained by the model

01:830:400 Spring 2019 2


Multiple Linear Regression

Multiple Linear Regression: Model


• Multiple linear regression can be thought of as a generalization of
simple linear regression

• Multiple linear regression:


– One numerical response variable (y)
– k explanatory variables (x)
– Model: linear equation w/ k + 1 coefficients
y  yˆ    0  1 x1   2 x2    k xk 
• Model fit is characterized by R2 (multiple R2), which describes the
proportion of the variance in 𝑦 explained by the model 𝑦.

• A multiple R does exist, but it represents something different: the


correlation between 𝑦 and the full model 𝑦.ො

01:830:400 Spring 2019 3


Multiple Linear Regression

Multiple Linear Regression: Model


Note that computing an MLR is very different from computing simple
linear regression separately for each explanatory variable

In particular, all estimates in an MLR for a given variable are conditional


on all other variables being in the model, and should be interpreted that
way

• E.g., the coefficient or slope b associated with a variable x,


– If x is numerical, then all else held constant, for one unit increase in x, y
is expected to change by b units

– If x is categorical, then all else held constant, the expected effect of


factor x is a change of b units from baseline (more on this later)

01:830:400 Spring 2019 4


Multiple Linear Regression

Multiple Linear Regression: Assumptions


1. Linearity (important for inference, critical for prediction)
– Assumes that the relationship among scores is best described by a
hyperplane
2. Normality (important for inference)
– Assumes that the residuals are normally distributed
3. Homoscedasticity (important for inference)
– Assumes that the variability of residuals is homogeneous across
different score values
4. Independence of residuals (critical for inference)
– Assumes that residuals are not systematically correlated
5. Non-multicollinearity (critical for inference)
– Assumes that none of the explanatory variables are highly correlated

01:830:400 Spring 2019 5


Multiple Linear Regression

An Example
How does educational expenditure affect student performance?

State Expend PTratio Salary PctSAT verbal math SAT

Alabama 4.405 17.2 31.144 8 491 538 1029


Alaska 8.963 17.6 47.951 47 445 489 934
Arizona 4.778 19.3 32.175 27 448 496 944
Arkansas 4.459 17.1 28.934 6 482 523 1005
California 4.992 24 41.078 45 417 485 902
Colorado 5.443 18.4 34.571 29 462 518 980
Connecticut 8.817 14.4 50.045 81 431 477 908
Delaware 7.03 16.6 39.076 68 429 468 897
Florida 5.718 19.1 32.588 48 420 469 889
● ● ● ● ● ● ● ●

01:830:400 Spring 2019 6


Multiple Linear Regression

1st Approach: Let’s take a look at the relationship between per-student


expenditure and SAT scores using a simple linear regression.

Our Model:

y   0  1 x 

with y = “SAT” and x = “Expend”

01:830:400 Spring 2019 7


Multiple Linear Regression

1st Approach: Let’s take a look at the relationship between per-student


expenditure and SAT scores using a simple linear regression.

Is this relationship surprising?


Is it significant?

01:830:400 Spring 2019 8


Multiple Linear Regression

Residuals:
Min 1Q Median 3Q Max
-145.074 -46.821 4.087 40.034 128.489

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1089.294 44.390 24.539 < 2e-16 ***
Expend -20.892 7.328 -2.851 0.00641 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69.91 on 48 degrees of freedom


Multiple R-squared: 0.1448, Adjusted R-squared: 0.127
F-statistic: 8.128 on 1 and 48 DF, p-value: 0.006408

Our Model:

y  1089.29  20.89 x 

with x = “Expend” and y = “SAT”

01:830:400 Spring 2019 9


Multiple Linear Regression

That was weird. Let’s try looking at another explanatory variable

01:830:400 Spring 2019 10


Multiple Linear Regression

Residuals:
Min 1Q Median 3Q Max
-79.158 -27.364 3.308 19.876 66.080

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1053.3204 8.2112 128.28 <2e-16 ***
PctSAT -2.4801 0.1862 -13.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 34.89 on 48 degrees of freedom


Multiple R-squared: 0.787, Adjusted R-squared: 0.7825
F-statistic: 177.3 on 1 and 48 DF, p-value: < 2.2e-16

Our Model:

y  1053.32  2.48 x 

with x = “PctSAT” and y = “SAT”

What can we conclude at this point?

01:830:400 Spring 2019 11


Multiple Linear Regression

2nd Approach: Let’s model the effects of per-student expenditure and test-
taking percentage on SAT scores

Our Model:

y   0  1 x1   2 x2 

with y = “SAT”, x1 = “Expend”, and x2 = “PctSAT”

01:830:400 Spring 2019 12


Multiple Linear Regression

lm(formula = SAT ~ Expend + PctSAT, data = school)

Residuals:
Min 1Q Median 3Q Max
-88.400 -22.884 1.968 19.142 68.755

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
Expend 12.2865 4.2243 2.909 0.00553 **
PctSAT -2.8509 0.2151 -13.253 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.46 on 47 degrees of freedom


Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16

Our Model:

y  993.83  12.29 x1  2.85 x2  Interpretation?

with x1 =“Expend”, x2 = “PctSAT”, and y = “SAT”

01:830:400 Spring 2019 13


Multiple Linear Regression

01:830:400 Spring 2019 14


Multiple Linear Regression

01:830:400 Spring 2019 15


Multiple Linear Regression

Predictions
Example: You learn that, in some other year, 50% of students take the
SAT in a state that spends $5k per pupil. What is the expected mean
SAT score for this state (assuming that the relationship holds across
years)?

SAT   0  1Expend   2 PctSAT

 993.83  12.29(5)  2.85(50)

 912.78

01:830:400 Spring 2019 16


Multiple Linear Regression

Interpretation of Coefficients
• Again, each estimated coefficient represents the amount that y is
expected to increase when the value of the corresponding predictor
(explanatory variable) is increased by one, while holding constant
the values of all other predictors.

• Example: The estimated coefficient for “Expend” is 12.29.


– For each additional $1k spent per student, we expect the average SAT
score in that state to increase by 12.29 points, when holding all other
variables constant

• Estimated coefficient for “PctSAT” is -2.85.


– For each additional 1% of students taking the exam, we expect the
average SAT score in the state to decrease by 2.85 points, when
holding all other variables constant.

01:830:400 Spring 2019 17


Multiple Linear Regression

Which Variable is the Strongest Predictor?


• The coefficient that has the strongest linear association with the
outcome variable is the one with the largest absolute t value in the
summary table

• Note that, as in simple linear regression, the size of the coefficient


itself is not reliable for determining strength. It is sensitive to the
scales of the different variables. The t statistic is not, because it is a
standardized measure.

• In our previous example, the percentage of students taking the


exam was a much better predictor of average SAT performance
than was the per student expenditure.

01:830:400 Spring 2019 18


Multiple Linear Regression

Modeling Categorical Predictors


• When predictors are categorical and assigned numbers, regressions
using those numbers makes no sense

• Instead, you should create a separate binary variable for each


possible category (other than a ‘baseline’ category), setting the
variable to ‘1’ if the record corresponds to the category and ‘0’
otherwise.
– E.g., if your categories for a particular factor corresponded to different
medical treatments, say ‘control’, ‘drug’, and ‘exercise’ , then you would
break this up into two variables (‘drug’ and ‘exercise’), each of which
could take on a value of zero or one.

• These are called ‘indicator’ or ‘dummy’ variables

01:830:400 Spring 2019 19


Multiple Linear Regression

Hypothesis Tests and CI’s for Coefficients


• The reported t statistics (b/SE) and p-values are used to test
whether a particular coefficient differs significantly from 0, given that
all other coefficients are in the model.

• CI’s for coefficients are computed in the same way as for simple
linear regression (and for t-distributed variables generally)

CI1 ( i )  bi  ˆ bi t / 2
• The number of degrees of freedom for the t distribution is (n-k-1),
where n is the number of data points (records) and k is the number
of predictors (explanatory variables) in the model.

01:830:400 Spring 2019 20


Multiple Linear Regression

Hypothesis Tests for Multiple Regression


• For the overall model, we have to test the value of R2 against its
expected value (0) under the null hypothesis. Note that the
distribution of R2 depends both on the number of data points that we
have and on the number of factors that we are using to fit the model.
• We can compute the significance of R2 by forming the following
statistic:

R 2 (n  k  1)
F
1  R 2  k
• Which will be distributed approximately as an F(k,n-k-1) statistic

01:830:400 Spring 2019 21


Multiple Linear Regression

Building More Complex Models

State Expend PTratio Salary PctSAT verbal math SAT

Alabama 4.405 17.2 31.144 8 491 538 1029


Alaska 8.963 17.6 47.951 47 445 489 934
Arizona 4.778 19.3 32.175 27 448 496 944
Arkansas 4.459 17.1 28.934 6 482 523 1005
California 4.992 24 41.078 45 417 485 902
Colorado 5.443 18.4 34.571 29 462 518 980
Connecticut 8.817 14.4 50.045 81 431 477 908
Delaware 7.03 16.6 39.076 68 429 468 897
Florida 5.718 19.1 32.588 48 420 469 889
● ● ● ● ● ● ● ●

01:830:400 Spring 2019 22


Multiple Linear Regression

01:830:400 Spring 2019 23


Multiple Linear Regression

Avoiding (Multi-)Collinearity
• When predictors are highly correlated, standard errors become
inflated

• Conceptual example:
– Suppose that two variables z and x are exactly the same.
– Suppose the population regression line of y is

y  10  5 x
– If you fit a regression using sample data of y on both x and z, you wind
up fitting
y  10  1 x   2 z
– You can see that any value will work for the two coefficients, as long as
they add up to 5. Equivalently, this means that the standard errors for
the coefficients are huge.

01:830:400 Spring 2019 24


Multiple Linear Regression

lm(formula = SAT ~ Expend + PctSAT + PTratio + is.northeast, data = school)

Residuals:
Min 1Q Median 3Q Max
-87.040 -14.739 -5.112 20.255 72.428

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1038.7458 48.6843 21.336 <2e-16 ***
Expend 7.9098 4.5649 1.733 0.0900 .
PctSAT -3.0762 0.2361 -13.030 <2e-16 ***
PTratio -1.0618 2.1860 -0.486 0.6295 Remove and refit
is.northeast 33.8557 16.5452 2.046 0.0466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.44 on 45 degrees of freedom


Multiple R-squared: 0.8378, Adjusted R-squared: 0.8234
F-statistic: 58.12 on 4 and 45 DF, p-value: < 2.2e-16

01:830:400 Spring 2019 25


Multiple Linear Regression

lm(formula = SAT ~ Expend + PctSAT + is.northeast, data = school)

Residuals:
Min 1Q Median 3Q Max
-84.833 -18.528 -4.838 20.309 74.865

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1018.1301 23.6529 43.045 <2e-16 ***
Expend 8.3857 4.4214 1.897 0.0642 . Remove and refit
PctSAT -3.0888 0.2327 -13.273 <2e-16 ***
is.northeast 35.5920 16.0197 2.222 0.0313 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.18 on 46 degrees of freedom


Multiple R-squared: 0.837,Adjusted R-squared: 0.8263
F-statistic: 78.72 on 3 and 46 DF, p-value: < 2.2e-16

01:830:400 Spring 2019 26


Multiple Linear Regression

Adjusted R2
• R2 in multiple regression means much the same thing in MLR as it
does in simple linear regression: percent variance explained by the
model.
 SS y  yˆ 
R 1 
2

 total 
SS

• However, when using R2 to compare different models, this measure


will tend to favor (or be biased toward) models with more
explanatory variables. Therefore, when comparing models, we
generally use an adjusted term that corrects this bias somewhat.

 SS y  yˆ  n  1 
1 
 SS  n  k  1 
2
Radj
 total 

01:830:400 Spring 2019 27

Potrebbero piacerti anche