Sei sulla pagina 1di 24

Software Installation Reminder

• Install Spotfire before leaving the


campus; it requires site license to
activate.

• XLMiner license, once applied, works


while you are not in the campus;
however, if it doesn’t, you won’t have the
option to use the LRC machines.

• Use the quick videos as a reference while


you are doing the exercise.
Linear Regression – for
Prediction
Explanatory Predictive
1. Explain/describe 1. Predict values of new
population records
relationships
2. Small sample, few 2. Large sample, many
variables variables
3. Retrospective 3. Prospective
4. Find good fitting 4. Regression with high
regression model predictive power
5. Confidence intervals, 5. Predictive power on
hypothesis test, p- holdout data
value
Example: Housing prices in MidCity

HousingPrices.xlsx (128 recent sales of


single-family houses)
– Price: Final sale price
– SqFt: Floor area in ft2
– Bedrooms: # bedrooms
– Bathrooms: # bathrooms
– Offers: # offers made on the house prior to
sale
– Brick: Brick construction? (yes/no)
– Neighborhood: East/West/North
Explanatory Objective:
• Estimate and interpret the pricing
structure of houses in MidCity

Predictive Objective:
• Predict the sale price of a house that
is on the market
Data Sample

Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood


114300 1790 2 2 2 No East
114200 2030 4 2 3 No East
114800 1740 3 2 1 No East
94700 1980 3 2 3 No East
119800 2130 3 3 3 No East
114600 1780 3 2 2 No North
151600 1830 3 3 3 Yes West
150700 2160 4 2 2 No West
Naïve Rule of Prediction
• Partition the data
• Predict the price of houses in the
validation set based on whatever is
the mean/median house price in the
training set

• Benchmark model (prediction using


naïve rule)—any model you build
must be better than this
Fitting a Regression model for
Prediction
• Choose predictors (X)
• View scatter plots
• Partition data into training/validation
• Fit regression model to training data.
Get estimated model
• Evaluate predictive power on validation
• Re-fit final model
• Use model to score (=predict) new data
Choose predictors
• Goal: Predict price of new houses on the
market.

1) Price: Final sale price


2) SqFt: Floor area in ft2
3) Bedrooms: # bedrooms
4) Bathrooms: # bathrooms
5) Offers: # offers made on the house prior to
sale
6) Brick: Brick construction? (yes/no)
7) Neighborhood: East/West/North
Choose Variables
• Based on theory or domain knowledge

• Real-estate agent claims that the


collected variables should all affect
House Price

• Very importantly, only those variables


that are available at the time of
prediction.
Partition Data
XLMiner > Data Mining > Partition > Standard Partition

• XLMiner Defaults
– Observations
randomly assigned
to training and
validation sets
– 60% of data go to
training set, 40% to
validation

• See sheet
STDPartition
XLMiner > Predict > Linear Regression
Estimate Model (pay attention to variable
selection; why all variables are not chosen?)
Evaluating prediction accuracy
• Is R2 useful for evaluating prediction
quality?

• General evaluation method:


1. Partition data into training and
validation set
2. Fit model to training set
3. Evaluate predictions on validation set
Measures of prediction accuracy
1. Compare each prediction to its actual Y,
using one of the popular distances
2. Then, average across all records

ei = y i − y iˆ
• Mean Absolute Deviation/Error (MAD or MAE)
• Sum Square Error (SSE)
• Mean Square Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute % Error (MAPE)
If performance on validation data is way worse than that on training data, it
is an indication of overfitting.
XLMiner > Predict > Linear Regression
(same model as before)
Predictor Selection using
Stepwise
Metrics for comparing predictive models:
RSS = residual sum of squares (smaller = better)
Mallow’s Cp: (Cp should be  the number of predictors
including the intercept if present and/or Cp is at a
minimum)
Probability: higher = better (rule out a subset if < 0.05)
Parsimony vs. Predictive accuracy

What if you have multiple choices for


models based on the Cp criteria?
– Model 1: Lower error but a higher
number of variables
– Model 2: Higher error but a lower number
of variables
Prediction emphasis
• Interpretation is not the goal
• Statistical significance of predictors
not necessarily criterion for retaining
predictors
• Residual analysis not pivotal
• What matters is predictive accuracy
and parsimony
• BUT: any domain knowledge should
be included in choice of predictors!
Linear regression models
• Useful for
– EXPLAINING - Describe how, on
average, an output variable is
affected by input variables

– PREDICTING - Given a record,


predict values of the output variable
using info on the input variables

• Same tool, but you will get very


different final models for each
purpose
Explaining vs. Prediction
• Explanatory/descriptive modeling
– Statistical tests used to assess
generalizability to the population
– Meaning of regression coefficients
– Residual plots used for checking model
validity

• Predictive modeling
– Predictors must be available at the time of
prediction
– Evaluate prediction accuracy by
partitioning data into training/validation
and calculate accuracy measures
– Automated variable selection
Which of the measures is/are
measures of prediction accuracy?
A. Mean Absolute Error (MAE)
B. Root Mean Squared Error (RMSE)
C. R-Square
D. Adjusted R-Square
E. Mean Absolute Percentage Error
(MAPE)
• Which of the following measures of
prediction accuracy you think could be
very sensitive to low value of y, the
dependent variable (in fact, undefined
when y equals zero)?

A. Mean Absolute Error (MAE)


B. Root Mean Squared Error (RMSE)
C. Mean Absolute Percentage Error
(MAPE)

Potrebbero piacerti anche