Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DSCI 5240
Data Mining and Machine Learning for Business
Russell R. Torres
“…the statistician knows…that in nature there never was a normal distribution,
there never was a straight line, yet with normal and linear assumptions, known to
be false, he can often derive results which match, to a useful approximation, those
found in the real world.”
George Box
2
Predicting Numbers
3
Predicting Numbers
GPA
• One approach would be to make an
educated guess 3.35
3.14
3.22
• If we note that the minimum GPA 3.29
we observed was 2.90 and the 3.30
maximum was 3.7
3.70
3.50
• We could assume that the GPA of 3.48
the next person falls somewhere 2.90
between those numbers and make a 3.59
random guess ?
4
Predicting Numbers
GPA
• A better approach would be to take
the average… in this example 3.347 3.35
3.14
3.22
• The average takes into account the 3.29
range of GPAs but also those GPAs 3.30
that appear to be most common 𝐺𝑃𝐴 = 3.347
3.70
3.50
• Guessing the GPA of the next person 3.48
as the average GPA will probably be 2.90
wrong… but probably less wrong 3.59
than guessing at random ?
5
Predicting Numbers
3.5
GPA
correct 3.1
2.9
6
Linear Regression
• Linear regression uses ordinary least squares to allow us to predict a target
(dependent) variable based on one or more input (independent) variables
Simple Linear
Regression
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Multiple Linear
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛 + 𝜀 Regression
• Where
• 𝑦 represents the target/dependent variable
• 𝑥 represents the predictor/independent variable
• 𝛽 (beta) represents the coefficients used in the model
• 𝜀 (epsilon) represents error
7
About Lines
• The general equation of a line that you learned in high school is
𝑦 = 𝑚𝑥 + 𝑏 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
• Where
• 𝑦 represents the y coordinate
• 𝑥 represents the x coordinate
• 𝑚 represents the slope of the line
• 𝑏 represents the y intercept
8
Requirements
• Dependent variable must be at least an interval variable
• Independent variables may be interval or nominal (via dummy coding)
• Predicted values represent the mean of the target variable at the given values of
the independent variables
9
Dummy Coding
GPA GRE
• We have reason to believe that GRE
score and GPA are correlated 3.35 596
• Higher GRE scores are likely to 3.14 473
correspond to higher GPAs 3.22 482
• Lower GRE scores are likely to 3.29 527
correspond to lower GPAs
3.30 505
3.70 693
• A scatter plot may help us confirm 3.50 626
this assumption 3.48 663
2.90 447
3.59 588
? 600
11
Scatter Plot of GPA vs. GRE Score
3.35 596
3.7
3.14 473
3.22 482 3.5
GPA
3.30 505
3.1
3.70 693
3.50 626 2.9
12
Linear Regression
3.3
• We hope the regression model
GPA
outperforms the “average” model 3.1
2.9
13
Calculating 𝜷 Values
𝑮𝑹𝑬𝒊 − 𝑮𝑷𝑨𝒊 − 𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬 ∗
• There are a variety of GRE
(1)
GPA
(2)
𝑮𝑹𝑬
(3)
𝑮𝑷𝑨
(4)
(𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨)
(5)
(𝑮𝑹𝑬𝒊 − 𝑮𝑹𝑬)𝟐 (𝑮𝑷𝑨𝒊 − 𝑮𝑷𝑨)𝟐
(6) (7)
approaches that will 596 3.35 36.0000 0.0030 0.1080 1296.0000 0.0000
allow you to calculate 𝛽
473 3.14 -87.0000 -0.2070 18.0090 7569.0000 0.0428
values
482 3.22 -78.0000 -0.1270 9.9060 6084.0000 0.0161
• For simple linear 527 3.29 -33.0000 -0.0570 1.8810 1089.0000 0.0032
regression, the following
505 3.3 -55.0000 -0.0470 2.5850 3025.0000 0.0022
formulas may be used
693 3.7 133.0000 0.3530 46.9490 17689.0000 0.1246
ത 𝑖 − 𝑌)
σ(𝑋𝑖 − 𝑋)(𝑌 ത 626 3.5 66.0000 0.1530 10.0980 4356.0000 0.0234
𝑏1 =
ത 2
σ(𝑋𝑖 − 𝑋) 663 3.48 103.0000 0.1330 13.6990 10609.0000 0.0177
𝑏0 = 3.3470 − 0.0025 ∗ 560 626 3.5 66.0000 0.1530 10.0980 4356.0000 0.0234
George Box
16
A Closer Look
These residuals
represent the amount
of error present in this
model.
17
Minimizing Error
GRE GPA
𝑮𝑷𝑨 𝜺 𝜺𝟐
• Regression seeks to find coefficient (1) (2) (3) (4) (5)
values (𝛽s) that minimize the sum of 596 3.35 3.4356 -0.0856 0.0073
all squared error terms (σ 𝜀𝑖2 ) 473 3.14 3.1330 0.0070 0.0000
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀 482 3.22 3.1551 0.0649 0.0042
527 3.29 3.2658 0.0242 0.0006
𝑦ො = 𝑏0 + 𝑏1 𝑥 505 3.30 3.2117 0.0883 0.0078
𝜀 = 𝑦 − 𝑦ො 693 3.70 3.6742 0.0258 0.0007
626 3.50 3.5093 -0.0093 0.0001
• 𝑏0 = 1.9695 and 𝑏1 = 0.0025 663 3.48 3.6004 -0.1204 0.0145
447 2.90 3.0690 -0.1690 0.0286
𝐺 𝑃𝐴 = 1.9695 + 0.0025𝐺𝑅𝐸 588 3.59 3.4159 0.1741 0.0303
We want to Sum of Squared Errors (Residual) 0.0941
minimize this! 18
Remember our Original “Model?”
GPA
𝑮𝑷𝑨 𝜺 𝜺𝟐
3.9
(1) (2) (3) (4)
3.347
3.35 3.3470 0.0030 0.0000 3.7
GPA
3.29 3.3470 -0.0570 0.0032
3.1
3.30 3.3470 -0.0470 0.0022
3.70 3.3470 0.3530 0.1246 2.9
21
Regression Assumptions – Linearity
𝑦 − 𝑦ො 𝑦 − 𝑦ො
0 0
𝑦ො 𝑦ො 22
Regression Assumptions – Linearity
𝑦 − 𝑦ො
• Violations of the linearity
assumption can usually be observed
in a scatter plot of x and y
• The residual plot exaggerates the
non-linearity and makes it easier to
see
• To resolve violations
• Add nonlinear transformation of the
variables
• Add new variables
𝑦ො 23
Regression Assumptions – Linearity
Y
squared term to “bend” the line 40
𝑦ො = 𝑏0 + 𝑏1 𝑥 20
𝑦ො = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 0
0 20 40 60 80 100
X
24
Regression Assumptions – Linearity
Y
20
𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑦ො = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 + 𝑏3 𝑥 3 0
0 20 40 60 80 100
X
25
Regression Assumptions – Homoscedasticity
𝑦 − 𝑦ො 𝑦 − 𝑦ො
0 0
𝑦ො 𝑦ො 26
Regression Assumptions – Homoscedasticity
𝑦 − 𝑦ො
• Violation of the homoscedasticity
assumption means that as you
increase x, your model gets
progressively worse (or better) at
prediction
• To resolve violations
0
• Transform variables
• Add new variables
𝑦ො 27
Regression Assumptions – Autocorrelation
𝑦 − 𝑦ො 𝑦 − 𝑦ො
0 0
𝑦ො 𝑦ො 28
Regression Assumptions – Autocorrelation
𝑦 − 𝑦ො
• Autocorrelation means that the
errors are not independent of each
other (i.e., the magnitude of the last
error somehow influences the
magnitude of this error)
• Autocorrelation is an issue in time
0
series data (think about how the
season might impact sales)
• To resolve violations
• Add new variables
• Add lags as predictors
𝑦ො 29
Regression Assumptions – Normal Distribution
30
Regression Assumptions – Normal Distribution
31
About Transformation
32
Positive Skewness
• Moderately positive
𝑥𝑛𝑒𝑤 = 𝑥
• Substantially positive
𝑥𝑛𝑒𝑤 = log10 (𝑥 + 𝐶)
Where C is a constant added to each
score so that the smallest score is 1
33
Adapted from: http://oak.ucc.nau.edu/rh232/courses/eps625/handouts/data transformation handout.pdf
Negative Skewness
• Moderately negative
𝑥𝑛𝑒𝑤 = 𝐾 − 𝑥
• Substantially negative
𝑥𝑛𝑒𝑤 = log10 (𝐾 − 𝑥)
34
Adapted from: http://oak.ucc.nau.edu/rh232/courses/eps625/handouts/data transformation handout.pdf
Regression Output
• The 𝛽 values calculated in regression should be interpreted by the modeler
• The 𝛽 tells us the change in the dependent variable associated with a one unit
increase in the associated independent variable, all other things being equal
35
Regression Output
Evaluating the output of a regression can be complicated but there are a few key
elements you should always consider (in this order):
• Significance of the Model (F test and associated p-value) – Tests overall significance of the
model. In essence, it compares the SSE of the null model to that of the full model. If the F
test is insignificant, you model adds no value over using the average of the dependent
variable as a predictor.
• Significance of the Predictors (t tests and associated p-values) – Test the likelihood that the
𝛽 of the given predictor is zero. If the t test is insignificant, the 𝛽 (slope) might be zero and
thus the predictor variable does not influence the dependent variable.
• Proportion of Variance Explained (R2 and Adjusted R2) – Indicates how much of the variation
in the dependent variable is explained by the model. Adjusted R2 is adjusted downward as
the number of predictors increases.
36
Regression Output (Excel)
Variance
Explained
Significance of
the Model
Significance of
the Predictors
37
Regression Output (SAS Enterprise Miner)
Significance of
the Model
Variance
Explained
Significance of
the Predictors
38
Regression Output (R)
Significance of
the Predictors
Variance
Explained
Significance of
the Model
39