02-MLR For Prediction

Software Installation Reminder
• Install Spotfire before leaving the

campus; it requires site license to
activate.
• XLMiner license, once applied, works

while you are not in the campus;
however, if it doesn’t, you won’t have the
option to use the LRC machines.
• Use the quick videos as a reference while

you are doing the exercise.
Linear Regression – for
Prediction
Explanatory Predictive
1. Explain/describe 1. Predict values of new
population records
relationships
2. Small sample, few 2. Large sample, many
variables variables
3. Retrospective 3. Prospective
4. Find good fitting 4. Regression with high
regression model predictive power
5. Confidence intervals, 5. Predictive power on
hypothesis test, p- holdout data
value
Example: Housing prices in MidCity
HousingPrices.xlsx (128 recent sales of

single-family houses)
– Price: Final sale price
– SqFt: Floor area in ft2
– Bedrooms: # bedrooms
– Bathrooms: # bathrooms
– Offers: # offers made on the house prior to
sale
– Brick: Brick construction? (yes/no)
– Neighborhood: East/West/North
Explanatory Objective:
• Estimate and interpret the pricing
structure of houses in MidCity
Predictive Objective:
• Predict the sale price of a house that
is on the market
Data Sample
Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood

114300 1790 2 2 2 No East
114200 2030 4 2 3 No East
114800 1740 3 2 1 No East
94700 1980 3 2 3 No East
119800 2130 3 3 3 No East
114600 1780 3 2 2 No North
151600 1830 3 3 3 Yes West
150700 2160 4 2 2 No West
Naïve Rule of Prediction
• Partition the data
• Predict the price of houses in the
validation set based on whatever is
the mean/median house price in the
training set
• Benchmark model (prediction using

naïve rule)—any model you build
must be better than this
Fitting a Regression model for
Prediction
• Choose predictors (X)
• View scatter plots
• Partition data into training/validation
• Fit regression model to training data.
Get estimated model
• Evaluate predictive power on validation
• Re-fit final model
• Use model to score (=predict) new data
Choose predictors
• Goal: Predict price of new houses on the
market.
1) Price: Final sale price

2) SqFt: Floor area in ft2
3) Bedrooms: # bedrooms
4) Bathrooms: # bathrooms
5) Offers: # offers made on the house prior to
sale
6) Brick: Brick construction? (yes/no)
7) Neighborhood: East/West/North
Choose Variables
• Based on theory or domain knowledge
• Real-estate agent claims that the

collected variables should all affect
House Price
• Very importantly, only those variables

that are available at the time of
prediction.
Partition Data
XLMiner > Data Mining > Partition > Standard Partition
• XLMiner Defaults
– Observations
randomly assigned
to training and
validation sets
– 60% of data go to
training set, 40% to
validation
• See sheet
STDPartition
XLMiner > Predict > Linear Regression
Estimate Model (pay attention to variable
selection; why all variables are not chosen?)
Evaluating prediction accuracy
• Is R2 useful for evaluating prediction
quality?
• General evaluation method:

1. Partition data into training and
validation set
2. Fit model to training set
3. Evaluate predictions on validation set
Measures of prediction accuracy
1. Compare each prediction to its actual Y,
using one of the popular distances
2. Then, average across all records
ei = y i − y iˆ
• Mean Absolute Deviation/Error (MAD or MAE)
• Sum Square Error (SSE)
• Mean Square Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute % Error (MAPE)
If performance on validation data is way worse than that on training data, it
is an indication of overfitting.
XLMiner > Predict > Linear Regression
(same model as before)
Predictor Selection using
Stepwise
Metrics for comparing predictive models:
RSS = residual sum of squares (smaller = better)
Mallow’s Cp: (Cp should be  the number of predictors
including the intercept if present and/or Cp is at a
minimum)
Probability: higher = better (rule out a subset if < 0.05)
Parsimony vs. Predictive accuracy
What if you have multiple choices for

models based on the Cp criteria?
– Model 1: Lower error but a higher
number of variables
– Model 2: Higher error but a lower number
of variables
Prediction emphasis
• Interpretation is not the goal
• Statistical significance of predictors
not necessarily criterion for retaining
predictors
• Residual analysis not pivotal
• What matters is predictive accuracy
and parsimony
• BUT: any domain knowledge should
be included in choice of predictors!
Linear regression models
• Useful for
– EXPLAINING - Describe how, on
average, an output variable is
affected by input variables
– PREDICTING - Given a record,

predict values of the output variable
using info on the input variables
• Same tool, but you will get very

different final models for each
purpose
Explaining vs. Prediction
• Explanatory/descriptive modeling
– Statistical tests used to assess
generalizability to the population
– Meaning of regression coefficients
– Residual plots used for checking model
validity
• Predictive modeling
– Predictors must be available at the time of
prediction
– Evaluate prediction accuracy by
partitioning data into training/validation
and calculate accuracy measures
– Automated variable selection
Which of the measures is/are
measures of prediction accuracy?
A. Mean Absolute Error (MAE)
B. Root Mean Squared Error (RMSE)
C. R-Square
D. Adjusted R-Square
E. Mean Absolute Percentage Error
(MAPE)
• Which of the following measures of
prediction accuracy you think could be
very sensitive to low value of y, the
dependent variable (in fact, undefined
when y equals zero)?
A. Mean Absolute Error (MAE)

B. Root Mean Squared Error (RMSE)
C. Mean Absolute Percentage Error
(MAPE)

02-MLR For Prediction

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

02-MLR For Prediction

Caricato da

Copyright:

Formati disponibili

Software Installation Reminder

• Install Spotfire before leaving the

• XLMiner license, once applied, works

• Use the quick videos as a reference while

HousingPrices.xlsx (128 recent sales of

Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood

• Benchmark model (prediction using

1) Price: Final sale price

• Real-estate agent claims that the

• Very importantly, only those variables

• General evaluation method:

What if you have multiple choices for

– PREDICTING - Given a record,

• Same tool, but you will get very

A. Mean Absolute Error (MAE)

Potrebbero piacerti anche