AM Lecture10

Intro to Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression: Model

• Multiple linear regression can be thought of as a generalization of
simple linear regression
• Simple linear regression:

– One numerical response variable (y)
– One numerical explanatory variable (x)
– Model: linear equation w/ two coefficients
y  yˆ    0  1 x 
• Model fit is characterized by r, which describes the strength and
direction of the relationship between x and y, and by r2, which
describes the proportion of variance explained by the model
01:830:400 Spring 2019 2


• Multiple linear regression can be thought of as a generalization of
simple linear regression
• Multiple linear regression:

– One numerical response variable (y)
– k explanatory variables (x)
– Model: linear equation w/ k + 1 coefficients
y  yˆ    0  1 x1   2 x2    k xk 
• Model fit is characterized by R2 (multiple R2), which describes the
proportion of the variance in 𝑦 explained by the model 𝑦.
ො
• A multiple R does exist, but it represents something different: the

correlation between 𝑦 and the full model 𝑦.ො
01:830:400 Spring 2019 3


Note that computing an MLR is very different from computing simple
linear regression separately for each explanatory variable
In particular, all estimates in an MLR for a given variable are conditional

on all other variables being in the model, and should be interpreted that
way
• E.g., the coefficient or slope b associated with a variable x,

– If x is numerical, then all else held constant, for one unit increase in x, y
is expected to change by b units
– If x is categorical, then all else held constant, the expected effect of

factor x is a change of b units from baseline (more on this later)
01:830:400 Spring 2019 4

Multiple Linear Regression: Assumptions

1. Linearity (important for inference, critical for prediction)
– Assumes that the relationship among scores is best described by a
hyperplane
2. Normality (important for inference)
– Assumes that the residuals are normally distributed
3. Homoscedasticity (important for inference)
– Assumes that the variability of residuals is homogeneous across
different score values
4. Independence of residuals (critical for inference)
– Assumes that residuals are not systematically correlated
5. Non-multicollinearity (critical for inference)
– Assumes that none of the explanatory variables are highly correlated
01:830:400 Spring 2019 5

An Example
How does educational expenditure affect student performance?
State Expend PTratio Salary PctSAT verbal math SAT
Alabama 4.405 17.2 31.144 8 491 538 1029

Alaska 8.963 17.6 47.951 47 445 489 934
Arizona 4.778 19.3 32.175 27 448 496 944
Arkansas 4.459 17.1 28.934 6 482 523 1005
California 4.992 24 41.078 45 417 485 902
Colorado 5.443 18.4 34.571 29 462 518 980
Connecticut 8.817 14.4 50.045 81 431 477 908
Delaware 7.03 16.6 39.076 68 429 468 897
Florida 5.718 19.1 32.588 48 420 469 889
● ● ● ● ● ● ● ●
01:830:400 Spring 2019 6

1st Approach: Let’s take a look at the relationship between per-student

expenditure and SAT scores using a simple linear regression.
Our Model:
y   0  1 x 
with y = “SAT” and x = “Expend”
01:830:400 Spring 2019 7

1st Approach: Let’s take a look at the relationship between per-student

expenditure and SAT scores using a simple linear regression.
Is this relationship surprising?

Is it significant?
01:830:400 Spring 2019 8

Residuals:
Min 1Q Median 3Q Max
-145.074 -46.821 4.087 40.034 128.489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1089.294 44.390 24.539 < 2e-16 ***
Expend -20.892 7.328 -2.851 0.00641 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.91 on 48 degrees of freedom

Multiple R-squared: 0.1448, Adjusted R-squared: 0.127
F-statistic: 8.128 on 1 and 48 DF, p-value: 0.006408
Our Model:
y  1089.29  20.89 x 
with x = “Expend” and y = “SAT”
01:830:400 Spring 2019 9

That was weird. Let’s try looking at another explanatory variable
01:830:400 Spring 2019 10

Residuals:
-79.158 -27.364 3.308 19.876 66.080
Coefficients:
(Intercept) 1053.3204 8.2112 128.28 <2e-16 ***
PctSAT -2.4801 0.1862 -13.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 177.3 on 1 and 48 DF, p-value: < 2.2e-16
Our Model:
y  1053.32  2.48 x 
with x = “PctSAT” and y = “SAT”
What can we conclude at this point?
01:830:400 Spring 2019 11

2nd Approach: Let’s model the effects of per-student expenditure and test-
taking percentage on SAT scores
Our Model:
y   0  1 x1   2 x2 
with y = “SAT”, x1 = “Expend”, and x2 = “PctSAT”
01:830:400 Spring 2019 12

lm(formula = SAT ~ Expend + PctSAT, data = school)
Residuals:
-88.400 -22.884 1.968 19.142 68.755
Coefficients:
(Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
Expend 12.2865 4.2243 2.909 0.00553 **
PctSAT -2.8509 0.2151 -13.253 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Our Model:
y  993.83  12.29 x1  2.85 x2  Interpretation?
with x1 =“Expend”, x2 = “PctSAT”, and y = “SAT”
01:830:400 Spring 2019 13

01:830:400 Spring 2019 14

01:830:400 Spring 2019 15

Predictions
Example: You learn that, in some other year, 50% of students take the
SAT in a state that spends $5k per pupil. What is the expected mean
SAT score for this state (assuming that the relationship holds across
years)?
SAT   0  1Expend   2 PctSAT
 993.83  12.29(5)  2.85(50)
 912.78
01:830:400 Spring 2019 16

Interpretation of Coefficients
• Again, each estimated coefficient represents the amount that y is
expected to increase when the value of the corresponding predictor
(explanatory variable) is increased by one, while holding constant
the values of all other predictors.
• Example: The estimated coefficient for “Expend” is 12.29.

– For each additional $1k spent per student, we expect the average SAT
score in that state to increase by 12.29 points, when holding all other
variables constant
• Estimated coefficient for “PctSAT” is -2.85.

– For each additional 1% of students taking the exam, we expect the
average SAT score in the state to decrease by 2.85 points, when
holding all other variables constant.
01:830:400 Spring 2019 17

Which Variable is the Strongest Predictor?

• The coefficient that has the strongest linear association with the
outcome variable is the one with the largest absolute t value in the
summary table
• Note that, as in simple linear regression, the size of the coefficient

itself is not reliable for determining strength. It is sensitive to the
scales of the different variables. The t statistic is not, because it is a
standardized measure.
• In our previous example, the percentage of students taking the

exam was a much better predictor of average SAT performance
than was the per student expenditure.
01:830:400 Spring 2019 18

Modeling Categorical Predictors

• When predictors are categorical and assigned numbers, regressions
using those numbers makes no sense
• Instead, you should create a separate binary variable for each

possible category (other than a ‘baseline’ category), setting the
variable to ‘1’ if the record corresponds to the category and ‘0’
otherwise.
– E.g., if your categories for a particular factor corresponded to different
medical treatments, say ‘control’, ‘drug’, and ‘exercise’ , then you would
break this up into two variables (‘drug’ and ‘exercise’), each of which
could take on a value of zero or one.
• These are called ‘indicator’ or ‘dummy’ variables
01:830:400 Spring 2019 19

Hypothesis Tests and CI’s for Coefficients

• The reported t statistics (b/SE) and p-values are used to test
whether a particular coefficient differs significantly from 0, given that
all other coefficients are in the model.
• CI’s for coefficients are computed in the same way as for simple
linear regression (and for t-distributed variables generally)
CI1 ( i )  bi  ˆ bi t / 2
• The number of degrees of freedom for the t distribution is (n-k-1),
where n is the number of data points (records) and k is the number
of predictors (explanatory variables) in the model.
01:830:400 Spring 2019 20

Hypothesis Tests for Multiple Regression

• For the overall model, we have to test the value of R2 against its
expected value (0) under the null hypothesis. Note that the
distribution of R2 depends both on the number of data points that we
have and on the number of factors that we are using to fit the model.
• We can compute the significance of R2 by forming the following
statistic:
R 2 (n  k  1)
F
1  R 2  k
• Which will be distributed approximately as an F(k,n-k-1) statistic
01:830:400 Spring 2019 21

Building More Complex Models
State Expend PTratio Salary PctSAT verbal math SAT
Alabama 4.405 17.2 31.144 8 491 538 1029

Alaska 8.963 17.6 47.951 47 445 489 934
Arizona 4.778 19.3 32.175 27 448 496 944
Arkansas 4.459 17.1 28.934 6 482 523 1005
California 4.992 24 41.078 45 417 485 902
Colorado 5.443 18.4 34.571 29 462 518 980
Connecticut 8.817 14.4 50.045 81 431 477 908
Delaware 7.03 16.6 39.076 68 429 468 897
Florida 5.718 19.1 32.588 48 420 469 889
● ● ● ● ● ● ● ●
01:830:400 Spring 2019 22

01:830:400 Spring 2019 23

Avoiding (Multi-)Collinearity
• When predictors are highly correlated, standard errors become
inflated
• Conceptual example:
– Suppose that two variables z and x are exactly the same.
– Suppose the population regression line of y is
y  10  5 x
– If you fit a regression using sample data of y on both x and z, you wind
up fitting
y  10  1 x   2 z
– You can see that any value will work for the two coefficients, as long as
they add up to 5. Equivalently, this means that the standard errors for
the coefficients are huge.
01:830:400 Spring 2019 24

lm(formula = SAT ~ Expend + PctSAT + PTratio + is.northeast, data = school)
Residuals:
-87.040 -14.739 -5.112 20.255 72.428
Coefficients:
(Intercept) 1038.7458 48.6843 21.336 <2e-16 ***
Expend 7.9098 4.5649 1.733 0.0900 .
PctSAT -3.0762 0.2361 -13.030 <2e-16 ***
PTratio -1.0618 2.1860 -0.486 0.6295 Remove and refit
is.northeast 33.8557 16.5452 2.046 0.0466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

01:830:400 Spring 2019 25

lm(formula = SAT ~ Expend + PctSAT + is.northeast, data = school)
Residuals:
-84.833 -18.528 -4.838 20.309 74.865
Coefficients:
(Intercept) 1018.1301 23.6529 43.045 <2e-16 ***
Expend 8.3857 4.4214 1.897 0.0642 . Remove and refit
PctSAT -3.0888 0.2327 -13.273 <2e-16 ***
is.northeast 35.5920 16.0197 2.222 0.0313 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared: 0.837,Adjusted R-squared: 0.8263
01:830:400 Spring 2019 26

Adjusted R2
• R2 in multiple regression means much the same thing in MLR as it
does in simple linear regression: percent variance explained by the
model.
 SS y  yˆ 
R 1 
2

 total 
SS
• However, when using R2 to compare different models, this measure

will tend to favor (or be biased toward) models with more
explanatory variables. Therefore, when comparing models, we
generally use an adjusted term that corrects this bias somewhat.
 SS y  yˆ  n  1 
1 
 SS  n  k  1 
2
Radj
 total 
01:830:400 Spring 2019 27

AM Lecture10

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

AM Lecture10

Caricato da

Copyright:

Formati disponibili

Intro to Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression: Model

• Simple linear regression:

01:830:400 Spring 2019 2

Multiple Linear Regression: Model

• Multiple linear regression:

• A multiple R does exist, but it represents something different: the

01:830:400 Spring 2019 3

Multiple Linear Regression: Model

In particular, all estimates in an MLR for a given variable are conditional

• E.g., the coefficient or slope b associated with a variable x,

– If x is categorical, then all else held constant, the expected effect of

01:830:400 Spring 2019 4

Multiple Linear Regression: Assumptions

01:830:400 Spring 2019 5

State Expend PTratio Salary PctSAT verbal math SAT

Alabama 4.405 17.2 31.144 8 491 538 1029

01:830:400 Spring 2019 6

1st Approach: Let’s take a look at the relationship between per-student

with y = “SAT” and x = “Expend”

01:830:400 Spring 2019 7

1st Approach: Let’s take a look at the relationship between per-student

Is this relationship surprising?

01:830:400 Spring 2019 8

Residual standard error: 69.91 on 48 degrees of freedom

with x = “Expend” and y = “SAT”

01:830:400 Spring 2019 9

That was weird. Let’s try looking at another explanatory variable

01:830:400 Spring 2019 10

Residual standard error: 34.89 on 48 degrees of freedom

with x = “PctSAT” and y = “SAT”

What can we conclude at this point?

01:830:400 Spring 2019 11

with y = “SAT”, x1 = “Expend”, and x2 = “PctSAT”

01:830:400 Spring 2019 12

lm(formula = SAT ~ Expend + PctSAT, data = school)

Residual standard error: 32.46 on 47 degrees of freedom

y  993.83  12.29 x1  2.85 x2  Interpretation?

with x1 =“Expend”, x2 = “PctSAT”, and y = “SAT”

01:830:400 Spring 2019 13

01:830:400 Spring 2019 14

01:830:400 Spring 2019 15

SAT   0  1Expend   2 PctSAT

 993.83  12.29(5)  2.85(50)

01:830:400 Spring 2019 16

• Example: The estimated coefficient for “Expend” is 12.29.

• Estimated coefficient for “PctSAT” is -2.85.

01:830:400 Spring 2019 17

Which Variable is the Strongest Predictor?

• Note that, as in simple linear regression, the size of the coefficient

• In our previous example, the percentage of students taking the

01:830:400 Spring 2019 18

Modeling Categorical Predictors

• Instead, you should create a separate binary variable for each

• These are called ‘indicator’ or ‘dummy’ variables

01:830:400 Spring 2019 19

Hypothesis Tests and CI’s for Coefficients

01:830:400 Spring 2019 20

Hypothesis Tests for Multiple Regression

01:830:400 Spring 2019 21

Building More Complex Models

State Expend PTratio Salary PctSAT verbal math SAT

Alabama 4.405 17.2 31.144 8 491 538 1029