Sei sulla pagina 1di 39

Linear Regression

DSCI 5240
Data Mining and Machine Learning for Business

Russell R. Torres
“…the statistician knows…that in nature there never was a normal distribution,
there never was a straight line, yet with normal and linear assumptions, known to
be false, he can often derive results which match, to a useful approximation, those
found in the real world.”

George Box

2
Predicting Numbers

• Assume you randomly select 10 people GPA


that walk through a door and record 3.35
their GPA 3.14
3.22

• Now I ask you to predict the GPA of the 3.29


next person to walk through the door 3.30
3.70
3.50
• You have no additional variables to aid
in your prediction 3.48
2.90
3.59
• How would you do it?
?

3
Predicting Numbers

GPA
• One approach would be to make an
educated guess 3.35
3.14
3.22
• If we note that the minimum GPA 3.29
we observed was 2.90 and the 3.30
maximum was 3.7
3.70
3.50
• We could assume that the GPA of 3.48
the next person falls somewhere 2.90
between those numbers and make a 3.59
random guess ?

4
Predicting Numbers

GPA
• A better approach would be to take
the average… in this example 3.347 3.35
3.14
3.22
• The average takes into account the 3.29
range of GPAs but also those GPAs 3.30
that appear to be most common 𝐺𝑃𝐴 = 3.347
3.70
3.50
• Guessing the GPA of the next person 3.48
as the average GPA will probably be 2.90
wrong… but probably less wrong 3.59
than guessing at random ?

5
Predicting Numbers

• The average line can be considered


3.9
a model, which can be used to
predict future values 3.7
3.347

3.5

• Its not a great model, for the 10 3.3


observations we have it is never

GPA
correct 3.1

2.9

• What if we had access to another 2.7


variable that might help our
prediction? Something like GRE 2.5
score… 1 2 3 4 5 6 7 8 9 10 ?
Observation

6
Linear Regression
• Linear regression uses ordinary least squares to allow us to predict a target
(dependent) variable based on one or more input (independent) variables

Simple Linear
Regression
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Multiple Linear
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛 + 𝜀 Regression

• Where
• 𝑦 represents the target/dependent variable
• 𝑥 represents the predictor/independent variable
• 𝛽 (beta) represents the coefficients used in the model
• 𝜀 (epsilon) represents error

7
About Lines
• The general equation of a line that you learned in high school is

𝑦 = 𝑚𝑥 + 𝑏 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
• Where
• 𝑦 represents the y coordinate
• 𝑥 represents the x coordinate
• 𝑚 represents the slope of the line
• 𝑏 represents the y intercept

8
Requirements
• Dependent variable must be at least an interval variable
• Independent variables may be interval or nominal (via dummy coding)
• Predicted values represent the mean of the target variable at the given values of
the independent variables

9
Dummy Coding

• Dummy coding is the practice of Region North South East


converting a nominal variable into North 1 0 0
several binary variables (1’s and 0’s) East 0 0 1
that can be used in a regression South 0 1 0
model. East 0 0 1
• If your variable has n levels, dummy East 0 0 1
coding will always result in n-1 North
= 1 0 0
variables West 0 0 0
• SAS Enterprise Miner and most West 0 0 0
other modern data modeling tools South 0 1 0
will perform the dummy coding West 0 0 0
operation for you
10
Back to Our Example

GPA GRE
• We have reason to believe that GRE
score and GPA are correlated 3.35 596
• Higher GRE scores are likely to 3.14 473
correspond to higher GPAs 3.22 482
• Lower GRE scores are likely to 3.29 527
correspond to lower GPAs
3.30 505
3.70 693
• A scatter plot may help us confirm 3.50 626
this assumption 3.48 663
2.90 447
3.59 588
? 600

11
Scatter Plot of GPA vs. GRE Score

GPA GRE 3.9

3.35 596
3.7
3.14 473
3.22 482 3.5

3.29 527 3.3

GPA
3.30 505
3.1
3.70 693
3.50 626 2.9

3.48 663 2.7


2.90 447
2.5
3.59 588 400 450 500 550 600 650 700 750
GRE Score
? 600

12
Linear Regression

• Linear regression provides a 3.9

mathematical approach to model 3.7


this relationship
3.5

3.3
• We hope the regression model

GPA
outperforms the “average” model 3.1

2.9

• Our goal is to identify the line 2.7


through the data that minimizes
error 2.5
400 450 500 550 600 650 700 750
GRE Score

13
Calculating 𝜷 Values
𝑮𝑹𝑬𝒊 − 𝑮𝑷𝑨𝒊 − 𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬 ∗
• There are a variety of GRE
(1)
GPA
(2)
𝑮𝑹𝑬
(3)
𝑮𝑷𝑨
(4)
(𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨)
(5)
(𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬)𝟐 (𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨)𝟐
(6) (7)
approaches that will 596 3.35 36.0000 0.0030 0.1080 1296.0000 0.0000
allow you to calculate 𝛽
473 3.14 -87.0000 -0.2070 18.0090 7569.0000 0.0428
values
482 3.22 -78.0000 -0.1270 9.9060 6084.0000 0.0161
• For simple linear 527 3.29 -33.0000 -0.0570 1.8810 1089.0000 0.0032
regression, the following
505 3.3 -55.0000 -0.0470 2.5850 3025.0000 0.0022
formulas may be used
693 3.7 133.0000 0.3530 46.9490 17689.0000 0.1246

ത 𝑖 − 𝑌)
σ(𝑋𝑖 − 𝑋)(𝑌 ത 626 3.5 66.0000 0.1530 10.0980 4356.0000 0.0234
𝑏1 =
ത 2
σ(𝑋𝑖 − 𝑋) 663 3.48 103.0000 0.1330 13.6990 10609.0000 0.0177

447 2.9 -113.0000 -0.4470 50.5110 12769.0000 0.1998


𝑏0 = 𝑌ത − 𝑏1 𝑋ത 588 3.59 28.0000 0.2430 6.8040 784.0000 0.0590

Total 5600 33.47 160.5500 65270.0000 0.4890

Mean 560 3.3470


14
Calculating 𝜷 Values
𝑮𝑹𝑬𝒊 − 𝑮𝑷𝑨𝒊 − 𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬 ∗
GRE GPA 𝑮𝑹𝑬 𝑮𝑷𝑨 (𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨) (𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬)𝟐 (𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨)𝟐
(1) (2) (3) (4) (5) (6) (7)
σ(𝐺𝑅𝐸𝑖 − 𝐺𝑅𝐸)(𝐺𝑃𝐴𝑖 − 𝐺𝑃𝐴)
𝑏1 = 596 3.35 36.0000 0.0030 0.1080 1296.0000 0.0000
σ(𝐺𝑅𝐸𝑖 − 𝐺𝑅𝐸)2
473 3.14 -87.0000 -0.2070 18.0090 7569.0000 0.0428
160.5500
𝑏1 = 482 3.22 -78.0000 -0.1270 9.9060 6084.0000 0.0161
65270.0000
527 3.29 -33.0000 -0.0570 1.8810 1089.0000 0.0032
𝑏1 = 0.0025
505 3.3 -55.0000 -0.0470 2.5850 3025.0000 0.0022

𝑏0 = 𝐺𝑃𝐴 − 𝑏1 𝐺𝑅𝐸 693 3.7 133.0000 0.3530 46.9490 17689.0000 0.1246

𝑏0 = 3.3470 − 0.0025 ∗ 560 626 3.5 66.0000 0.1530 10.0980 4356.0000 0.0234

𝑏0 = 1.9695 663 3.48 103.0000 0.1330 13.6990 10609.0000 0.0177

447 2.9 -113.0000 -0.4470 50.5110 12769.0000 0.1998


෣ = 1.9695 + 0.0025𝐺𝑅𝐸
𝐺𝑃𝐴 588 3.59 28.0000 0.2430 6.8040 784.0000 0.0590

Total 5600 33.47 160.5500 65270.0000 0.4890

Mean 560 3.3470


15
All Models are Wrong!

“All models are wrong, but some are useful.”

George Box

16
A Closer Look

These residuals
represent the amount
of error present in this
model.

17
Minimizing Error
GRE GPA ෣
𝑮𝑷𝑨 𝜺 𝜺𝟐
• Regression seeks to find coefficient (1) (2) (3) (4) (5)
values (𝛽s) that minimize the sum of 596 3.35 3.4356 -0.0856 0.0073
all squared error terms (σ 𝜀𝑖2 ) 473 3.14 3.1330 0.0070 0.0000
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀 482 3.22 3.1551 0.0649 0.0042
527 3.29 3.2658 0.0242 0.0006
𝑦ො = 𝑏0 + 𝑏1 𝑥 505 3.30 3.2117 0.0883 0.0078
𝜀 = 𝑦 − 𝑦ො 693 3.70 3.6742 0.0258 0.0007
626 3.50 3.5093 -0.0093 0.0001
• 𝑏0 = 1.9695 and 𝑏1 = 0.0025 663 3.48 3.6004 -0.1204 0.0145
447 2.90 3.0690 -0.1690 0.0286

𝐺 𝑃𝐴 = 1.9695 + 0.0025𝐺𝑅𝐸 588 3.59 3.4159 0.1741 0.0303
We want to Sum of Squared Errors (Residual) 0.0941
minimize this! 18
Remember our Original “Model?”

GPA ෣
𝑮𝑷𝑨 𝜺 𝜺𝟐
3.9
(1) (2) (3) (4)
3.347
3.35 3.3470 0.0030 0.0000 3.7

3.14 3.3470 -0.2070 0.0428 3.5

3.22 3.3470 -0.1270 0.0161


3.3

GPA
3.29 3.3470 -0.0570 0.0032
3.1
3.30 3.3470 -0.0470 0.0022
3.70 3.3470 0.3530 0.1246 2.9

3.50 3.3470 0.1530 0.0234 2.7

3.48 3.3470 0.1330 0.0177


2.5
2.90 3.3470 -0.4470 0.1998 1 2 3 4 5 6 7 8 9 10 ?
Observation
3.59 3.3470 0.2430 0.0590
Sum of Squared Errors (Total) 0.4890 19
Every Regression is a Comparison of Two Models
The Reduced (Null) Model The Full Model
GPA ෣
𝑮𝑷𝑨 𝜺 𝜺𝟐 GRE GPA ෝ
𝒚 𝜺 𝜺𝟐
3.35 3.3470 0.0030 0.0000 596 3.35 3.4356 -0.0856 0.0073
3.14 3.3470 -0.2070 0.0428 473 3.14 3.1330 0.0070 0.0000
3.22 3.3470 -0.1270 0.0161 482 3.22 3.1551 0.0649 0.0042
3.29 3.3470 -0.0570 0.0032 527 3.29 3.2658 0.0242 0.0006
3.30 3.3470 -0.0470 0.0022 505 3.30 3.2117 0.0883 0.0078
3.70 3.3470 0.3530 0.1246 693 3.70 3.6742 0.0258 0.0007
3.50 3.3470 0.1530 0.0234 626 3.50 3.5093 -0.0093 0.0001
3.48 3.3470 0.1330 0.0177 663 3.48 3.6004 -0.1204 0.0145
2.90 3.3470 -0.4470 0.1998 447 2.90 3.0690 -0.1690 0.0286
3.59 3.3470 0.2430 0.0590 588 3.59 3.4159 0.1741 0.0303
Sum of Squared Errors (Total) 0.4890 Sum of Squared Errors (Residual) 0.0941
20
Goal is SSE Minimization
Regression Assumptions
• Model Parameters
• Linearity – The relationship between each independent variable and the dependent variable
is a line, holding the other variables fixed
• No Multicollinearity – If there are multiple predictors in the model, they should not be
highly correlated with one another
• Residuals
• Homoscedasticity (Constant Variance) – The variance of the residuals should not grow with
increasing values of the dependent variables
• No Autocorrelation (Statistical Independence) – There should be no correlation between
consecutive error terms
• Normal Distribution – The residuals should be normally distributed
• Regression assumptions are generally evaluated visually using a plot of residuals
(𝑦 − 𝑦)
ො on the y axis and predicted values (𝑦)
ො on the x axis

21
Regression Assumptions – Linearity
𝑦 − 𝑦ො 𝑦 − 𝑦ො

 
0 0

𝑦ො 𝑦ො 22
Regression Assumptions – Linearity
𝑦 − 𝑦ො
• Violations of the linearity
assumption can usually be observed
in a scatter plot of x and y 
• The residual plot exaggerates the
non-linearity and makes it easier to
see
• To resolve violations
• Add nonlinear transformation of the
variables
• Add new variables

𝑦ො 23
Regression Assumptions – Linearity

• Consider the following data 100

• Simple linear regression results in a 80


poor model (high error)
• A better approach is to perform a 60
quadratic regression by using a

Y
squared term to “bend” the line 40

𝑦ො = 𝑏0 + 𝑏1 𝑥 20

𝑦ො = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 0
0 20 40 60 80 100
X

24
Regression Assumptions – Linearity

• How about this data 80

• Again simple linear regression


results in a poor model (high error) 60

• Polynomial regression creates a


better approximation of this data 40

Y
20
𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑦ො = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 + 𝑏3 𝑥 3 0
0 20 40 60 80 100
X

25
Regression Assumptions – Homoscedasticity
𝑦 − 𝑦ො 𝑦 − 𝑦ො

 
0 0

𝑦ො 𝑦ො 26
Regression Assumptions – Homoscedasticity
𝑦 − 𝑦ො
• Violation of the homoscedasticity
assumption means that as you
increase x, your model gets
progressively worse (or better) at

prediction
• To resolve violations
0
• Transform variables
• Add new variables

𝑦ො 27
Regression Assumptions – Autocorrelation
𝑦 − 𝑦ො 𝑦 − 𝑦ො

 
0 0

𝑦ො 𝑦ො 28
Regression Assumptions – Autocorrelation
𝑦 − 𝑦ො
• Autocorrelation means that the
errors are not independent of each
other (i.e., the magnitude of the last
error somehow influences the

magnitude of this error)
• Autocorrelation is an issue in time
0
series data (think about how the
season might impact sales)
• To resolve violations
• Add new variables
• Add lags as predictors

𝑦ො 29
Regression Assumptions – Normal Distribution

 

30
Regression Assumptions – Normal Distribution

• If there is no discernable pattern in


your residual plot, your errors are
probably normally distributed 
• Violations of the assumption of
normal distribution are generally
related to non-normality of model
parameters
• To resolve violations
Transform variables

31
About Transformation

• Normal distributions are desirable


but not always present in a given
data set
• Transformation involves
mathematically manipulating the
data to make modeling more
feasible
• Transformation can hurt the
interpretability of your model

32
Positive Skewness

• Moderately positive
𝑥𝑛𝑒𝑤 = 𝑥
• Substantially positive

𝑥𝑛𝑒𝑤 = log10 (𝑥)


• Substantially positive (with zero values)

𝑥𝑛𝑒𝑤 = log10 (𝑥 + 𝐶)
Where C is a constant added to each
score so that the smallest score is 1

33
Adapted from: http://oak.ucc.nau.edu/rh232/courses/eps625/handouts/data transformation handout.pdf
Negative Skewness

• Moderately negative

𝑥𝑛𝑒𝑤 = 𝐾 − 𝑥
• Substantially negative
𝑥𝑛𝑒𝑤 = log10 (𝐾 − 𝑥)

• Where K is a constant from which


each score is subtracted so that the
smallest score is 1
• Usually equal to the largest score
plus 1

34
Adapted from: http://oak.ucc.nau.edu/rh232/courses/eps625/handouts/data transformation handout.pdf
Regression Output
• The 𝛽 values calculated in regression should be interpreted by the modeler

• The 𝛽 tells us the change in the dependent variable associated with a one unit
increase in the associated independent variable, all other things being equal

• For instance, in our model:



𝐺 𝑃𝐴 = 1.9695 + 0.0025𝐺𝑅𝐸
A one unit increase in GRE, increases the predicted value of GPA by 0.0025. In addition, a 400
unit increase in GRE, increases the predicted value of GPA by a full grade point

35
Regression Output
Evaluating the output of a regression can be complicated but there are a few key
elements you should always consider (in this order):
• Significance of the Model (F test and associated p-value) – Tests overall significance of the
model. In essence, it compares the SSE of the null model to that of the full model. If the F
test is insignificant, you model adds no value over using the average of the dependent
variable as a predictor.
• Significance of the Predictors (t tests and associated p-values) – Test the likelihood that the
𝛽 of the given predictor is zero. If the t test is insignificant, the 𝛽 (slope) might be zero and
thus the predictor variable does not influence the dependent variable.
• Proportion of Variance Explained (R2 and Adjusted R2) – Indicates how much of the variation
in the dependent variable is explained by the model. Adjusted R2 is adjusted downward as
the number of predictors increases.

36
Regression Output (Excel)

Variance
Explained

Significance of
the Model

Significance of
the Predictors

37
Regression Output (SAS Enterprise Miner)

Significance of
the Model

Variance
Explained

Significance of
the Predictors

38
Regression Output (R)

Significance of
the Predictors

Variance
Explained

Significance of
the Model

39

Potrebbero piacerti anche