Predictive

Predictive Analytics:
Modeling the World

Richard D. De Veaux
Professor of Statistics, Williams College
January 28, 2005
OR/MS Seminar
Getting to Know Your Customers
50 years ago this was easy
Customer data base could fit in one persons head

Retention of customers depended on ability to do so
21st Century Data Bases
Ability to anticipate customers

needs crucial for retention
Even Sam Walton didnt know all

his customers preferences
Amazon.com Earths biggest

selection
$390,000 Diamond Necklace

Worlds biggest book
Yak Cheese from Tibet
No one can do this without help
Well, almost no one!
Direct Marketing Example
Paralyzed Veterans of America
KDD 1998 cup
Mailing list of 3.5 million potential donors
Lapsed donors
Made their last donation to PVA 13 to 24 months prior

to June 1997
200,000 (training and test sets)
Who should get the current mailing?
Cost effective strategy
Why is this Hard?
Amount of Information
Cross tabs / OLAP
481 predictors 2 responses
How many combinations?

What to focus on?
Data Preparation
This alone can be 60-95% of the effort

Categorical vs. Quantitative
Whats Hard? --Example
T-Code
So, what does it mean?

T -C o d e T itle
0 _
1 M R.
1 6 DEAN
1 7 J UDGE
1 0 0 1 M ES S RS .
1 7 0 0 2 J UDGE & M RS .
1 0 0 2 M R. & M RS .
2 M RS .
1 8 M AJ O R
1 8 0 0 2 M AJ O R & M RS .
2 0 0 2 M ES DAM ES
3 M IS S
1 9 S ENATO R
1 0 9 LIC.
5 0 ELDER
1 1 1 S A.
5 6 M AYO R
1 1 4 DA.
5 9 0 0 2 LIEUTENANT & M RS .
6 2 LO RD
6 3 CARDINAL
1 1 6 S R.
1 1 7 S RA.
1 1 8 S RTA.
6 4 FRIEND
1 2 0 YO UR M AJ ES TY
3 0 0 3 M IS S ES
2 1 0 0 2 S ERGEANT & M RS .
6 5 FRIENDS
1 2 2 HIS HIGHNES S
4 DR.
4 0 0 2 DR. & M RS .
2 2 0 0 2 CO LNEL & M RS .
2 4 LIEUTENANT
6 8 ARCHDEACO N
6 9 CANO N
1 2 3 HER HIGHNES S
1 2 4 CO UNT
7 0 BIS HO P
1 2 5 LADY
4 0 0 4 DO CTO RS
5 M ADAM E
6 S ERGEANT
9 RABBI
1 0 P RO FES S O R
1 0 0 0 2 P RO FES S O R & M RS .
1 0 0 1 0 P RO FES S O RS
1 1 ADM IRAL
1 1 0 0 2 ADM IRAL & M RS .
1 2 GENERAL
2 0 GO V ERNO R
4 8 CO RP O RAL
2 6 M O NS IGNO R
2 7 REV EREND
28 MS.
28028 MSS.
7 2 0 0 2 REV EREND & M RS .

7 3 P AS TO R
7 5 ARCHBIS HO P
1 2 7 P RINCES S
1 2 8 CHIEF
2 9 BIS HO P
8 5 S P ECIALIS T
1 2 9 BARO N
3 1 AM BAS S ADO R
8 7 P RIV ATE
1 3 0 S HEIK
8 9 S EAM AN
9 0 AIRM AN
1 3 1 P RINCE AND P RINCES S

1 3 2 YO UR IM P ERIAL M AJ ES T
9 1 J US TICE
1 3 5 M . ET M M E.
9 2 M R. J US TICE
2 1 0 P RO F.
3 1 0 0 2 AM BAS S ADO R & M RS

3 3 CANTO R
3 6 BRO THER
3 7 S IR
1 2 0 0 2 GENERAL & M RS .
3 8 CO M M O DO RE
100 M.
1 3 CO LO NEL
1 3 0 0 2 CO LO NEL & M RS .
4 0 FATHER
4 2 S IS TER
1 0 3 M LLE.
1 0 4 CHANCELLO R
4 3 P RES IDENT
1 0 6 REP RES ENTATIV E
1 4 0 0 2 CAP TAIN & M RS .
4 4 M AS TER
1 0 7 S ECRETARY
1 5 CO M M ANDER
1 5 0 0 2 CO M M ANDER & M RS .
4 6 M O THER
4 7 CHAP LAIN
1 0 8 LT. GO V ERNO R
1 4 CAP TAIN
1 2 6 P RINCE
Results for PVA Data Set
If entire list (100,000 donors) are

mailed, net donation is $10,500
Using data mining techniques,

this was increased 41.37%
KDD CUP 98 Results
10
KDD CUP 98 Results 2
11
Data Mining Is
the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. --- Fayyad
finding interesting structure (patterns, statistical
models, relationships) in data bases.--- Fayyad,
Chaduri and Bradley
a knowledge discovery process of extracting
previously unknown, actionable information from
very large data bases--- Zornes
a process that uses a variety of data analysis
tools to discover patterns and relationships in
data that may be used to make valid predictions.
---Edelstein
12
Data Mining Is
13
Case Study I
Ingot Cracking
953 30,000 lb. Ingots

20% cracking rate
$30,000 per recast
90 Potential Explanatory
Variables
Water composition
Metal composition
Process variables
Other environmental variables
Can we predict under what

conditions ingots will crack?
14
Case Study II
Car Insurance
42800 mature policies

65 Potential Predictors
Can we find a pattern for the unprofitable

policies?
15
Case Study III
Breast Cancer Diagnosis
Mammograms used as
screening instrument
Expensive radiologist read

Inaccurate
False positive and negative rates over

25%
Over a decade, nearly 100% false
positive rate
Can we do better?
Automatically read by a scanning

algorithm
Automatically diagnosed by a
model
16
Why not Queries?
Queries Describe
Models promote understanding

Models can be assessed both by their understanding and
their predictions
Queries are Event Driven
Its difficult to predict especially the future
Models are phenomenon driven
Queries are reactive
Models are proactive
17
What Happened on the Titanic?

Class
Crew
First
Second
Third
18
C
C32
1
Mosaic Plot
19
Models
Powerful predictors for optimizing

performance
Powerful summaries for

understanding
Used to explore data set
Are not perfect
All models are wrong, but some are useful

Statisticians, like artists, have the bad habit of falling
in love with their models.
20
Tree Diagram
F
Adult
2 or 3
Child
1 or Crew
Crew
1,2,C
46%
93%
1 or 2
1st
14%
27%
23%
100%
33%
21
Why Models?
Whats
Most associated variables in the census

Whats associated with shampoo
purchases?
Beer
interesting?
and Diapers
In the convenience stores we looked at, on

Friday nights, purchases of beer and
purchases of diapers are highly associated
Conclusions?
Actions?
22
Beer and Diapers
Picture from TandemTM ad

23
25
20
15
train2$y
10
15
10
5
train2$y
20
25
Toy Problem
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
25
20
15
train2$y
0.4
0.6
0.8
1.0
0.0
0.2
0.4
25
20
15
train2$y
15
10
10
20
25
train2[, i]
train2$y
1.0
5
0.2
train2[, i]
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
20
15
train2$y
15
10
5
10
20
25
train2[, i]
25
train2[, i]
train2$y
0.8
10
20
15
train2$y
10
5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
20
15
train2$y
15
10
5
10
20
25
train2[, i]
25
train2[, i]
train2$y
0.6
train2[, i]
25
train2[, i]
0.0
0.2
0.4
0.6
train2[, i]
0.8
1.0
0.0
0.2
0.4
train2[, i]
24
Familiar Models
Linear Regression
25
Logistic Regression
26
Linear Regression
Term Estimate Std Error t Ratio Prob>|t|
Intercept
0.806
0.427 1.890
0.059
x1
7.269
0.273 26.590 <.0001
x2
7.289
0.281 25.940 <.0001
x3
-0.719
0.287 -2.500
0.012
x4
9.769
0.273 35.810 <.0001
x5
4.834
0.275 17.590 <.0001
x6
-0.456
0.280 -1.630
0.104
x7
0.123
0.270 0.460
0.647
x8
-0.349
0.276 -1.270
0.206
x9
-0.578
0.285 -2.030
0.043
x10
0.080
0.280 0.280
0.777
R-squared: 76.1% Train
73.3% Test
27
Stepwise Regression
Term
Intercept
x1
x2
x3
x4
x5
x9
Estimate Std Error t Ratio Prob>|t|

0.561
0.328 1.710
0.087
7.252
0.273 26.550 <.0001
7.311
0.280 26.110 <.0001
-0.767
0.286 -2.690
0.007
9.747
0.272 35.790 <.0001
4.799
0.274 17.510 <.0001
-0.609
0.284 -2.140
0.032
R-squared 76.0% on Train
73.4% Test
28
Stepwise 2ND Order Model

Term
Estimate Std Error
t Ratio Prob>|t|
Intercept
0.000
0.000 .
.
x1
7.204
0.169
42.510 <.0001
(x1-0.49573)*(x1-0.49573) -12.137
0.682
-17.790 <.0001
x2
7.313
0.173
42.380 <.0001
(x2-0.48895)*(x2-0.48895) -11.289
0.688
-16.410 <.0001
x3
-1.010
0.179
-5.660 <.0001
(x3-0.46706)*(x3-0.46706)
20.658
0.703
29.390 <.0001
x4
10.169
0.172
59.070 0.000
x5
5.135
0.168
30.610 <.0001
(x5-0.49425)*(x5-0.49425)
1.714
0.694
2.470 0.014
x7
0.244
0.165
1.480 0.140
x8
0.079
0.171
0.460 0.646
(x1-0.49573)*(x2-0.48895)
2.370
0.639
3.710 0.000
(x2-0.48895)*(x4-0.49038)
-0.322
0.626
-0.510 0.607
(x3-0.46706)*(x7-0.4962)
1.273
0.626
2.030 0.042
(x4-0.49038)*(x8-0.4975)
-1.015
0.603
-1.680 0.092
(x7-0.4962)*(x8-0.4975)
-1.283
0.601
-2.130 0.033
R-squared 90.0% Train
88.5% Test
29
Next Steps
Higher order terms?
When to stop?
Transformations?
Too simple: underfitting bias
Too complex: inconsistent predictions,

overfitting high variance
Selecting models is Occams razor
Keep goals of interpretation vs. Prediction in mind
30
Tree Model
|
x4<0.512146
x1<0.209569
x1<0.359395
x5<0.260297
x2<0.299431
x4<0.140557
x3<0.215425 x5<0.232708
x2<0.129879
x2<0.27271
x3<0.885533 x5<0.621811
x5<0.412206
x5<0.588094
10.640
x4<0.336583
x4<0.148234
x2<0.54068
x3<0.490631
9.785
x4<0.223909
x4<0.264999
x3<0.248065
x4<0.768584 x2<0.20279
x2<0.414822
16.830
7.602
x4<0.283724
4.400
7.074
x4<0.916189
12.700
14.380 12.250
15.830
18.160
15.150
12.040
6.688
9.865
8.887
12.060
x3<0.784959
7.994
9.956
x1<0.328133
x3<0.0777104
x3<0.0789249
x3<0.177433x3<0.114976
15.020
x3<0.728124
19.060
14.340
17.560
21.260
17.770
x3<0.821878
21.240
18.760
x8<0.933915
15.190
x4<0.941058
25.320
x4<0.700738
10.780
14.060
R squared 82.3% Train
14.030
16.860
67.2% Test
14.200
21.100
24.280
17.690
19.470
31
Feature Creation
New predictor based on

original predictors
Often linear:
zi = +b1 x1 + ... + b p x p
Principal components
Factor analysis
Multidimensional scaling
32
Neural Nets
Dont resemble the brain
Are just a statistical model
33
A Single Neuron
x1
0.3
x2
0.7
x3
x4
x5
x0
-0.2
0.4
-0.5
s(z1)
Input (z1)
Output
0.8
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5
34
More exotic Neural networks

z1
x1
z2
x2
z3
Output layer
Input layer
Hidden layer
35
Running a Neural Net
36
Predictions for Example
R squared 92.7% Train
90.6% Test
37
What Does This Get Us?

Enormous
Ability
flexibility
to fit anything
Including noise
Interpretation?
38
Case Study Warranty Data
A new backpack inkjet printer is

showing higher than expected warranty
claims
What are the important variables?

Whats going on?
A neural networks shows that Zipcode

is the most important predictor
39
Spatial Analysis
Warranty Data showing

problem with ink jet
printer
Use the model as a

black box for variable
selection
40
MARS
Multivariate
What
do they do?
10
1.2
1.2
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Replace each step function in a tree model by a pair of linear

functions.
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
-0.2
-0.2
Adaptive Regression Splines
6
x
10
10
41
MARS Variable Importance
R-squared 95.0% Train

(96.3%)
94.3% Test
(95.8%)
42
MARS Function Output
43
Collaborative Filtering
Goal: predict what movies people will like
Data: list of movies each person has

watched
Lyle
Ellen
Fred
Dean
Jason
Andre, Starwars
Andre, Starwars, Coeur en Hiver
Starwars, Batman
Starwars, Batman, Rambo
Coeur en Hiver, Chocolat
44
Data Base
Data can be represented as a sparse matrix

Andre
Starwars Batman
Lyle
Ellen
Fred
Dean
Jason
y
y
y
y
y
y
y
Karen
Rambo
Coeur
Chocolat
y
y
y
Karen likes Andre. What else might she like?
CDNow doubled e-mail responses
45
How Do We Really Start?

Life is not so kind
Categorical variables
Missing data
500 variables, not 10
481
variables where to start?
46
Where to Start?
EDM
Use a tree to find a smaller subset of
variables to investigate
Explore this set graphically
Start the modeling process over
Build model
Compare model on small subset with full

predictive model
47
Start With a Simple Model

Maybe
a Tree:
x4<0.477873
|
x2<0.288579
x5<0.465905
x1<0.297806
x1<0.333728
x1<0.152683
-2.560 -0.265
x5<0.529173
x5<0.466843
x4<0.208211
-1.890 1.150
5.820
2.000 4.570
x2<0.343653
x2<0.125849
2.540 5.120
x4<0.752766
x5<0.644585 x5<0.49235
2.910 6.050
7.500 10.100 9.880 12.200
48
Automatic Models
KXEN
49
PVA Results from KXEN
50
Combining Models -- Bagging
Bagging (Bootstrap
Aggregation)
Bootstrap a data set repeatedly

Take many versions of same
model (e.g. tree)
Form a committee of models
Take majority rule of predictions
51
Combining Models -- Boosting
Take the data and apply a simple classifier
Reweight the data, weighting the misclassified

data much higher.
Reapply the classifier
Repeat over and over
The final prediction is a combination of the

output of each classifier, weighted by the
overall misclassification rate.
Details in Freund, Y. Boosting a weak

learning algorithm by majority, Information
and Computation 121(2), 256-285.
52
Breast Cancer Diagnosis
53
Results from Random Forest

Results from 1000 splits of Training and Test data
T re e
Bo o ste d T re e s
Ra n d o m F o re st
Ne u ra l Ne tw o rk
Ra d io lo g ists
F a lse P o sitive Ra te
32.20%
24.90%
19.30%
25.50%
22.40%
F a lse Ne g a tive Ra te
33.70%
32.50%
28.80%
31.70%
35.80%
54
Case Study Ingot failures
Ingot cracking
953 30,000 lb. Ingots

20% cracking rate
$30,000 per recast
90 potential explanatory variables
Water composition (reduced)

Metal composition
Process variables
Other environmental variables
55
Model building process
Model building
Train
Test
Evaluate
56
Most Important Variable
Take One Here we started

with trees
Alloy
OK, take two
Yttrium
We know that
What do you think is in the alloy?
Third times the charm?
Selenium!
OH!
57
Case Study Car Insurance
Now that we have 40000 mature policies, can

we find other factors to price policies better?
65 potential predictors
Industry, vehicle age, color, numbers of vehicles, usage

and location etc
58
Fast Fail
Not every modeling effort is a success
A model search can save lots of queries
Data took 8 months to get ready
Analyst spent 2 months exploring it
A new model search program (KXEN)

running for several hours found no out of
sample predictive ability
Tree model gave similar results
59
PVA Recap
Remember --- 481 predictor variables
Need a way to trim this down
Need an exploratory model
Neural network?
Tree?
60
Students in Data Mining Class
Student #1 $15,024
Student #2 $14,695
Student #3 $14,345
61
Take Home Messages
What a great time to be a Statistician!
Problems are exciting
Research is exciting
Success in Data mining
Requires Team Work

Requires Flexibility in modeling
Means that you Act on Your results
Depends much more on the way you mine the data rather
than the specific model or tool that you use
Which method to use?
Yes!! Have fun!

62
Thank you!
deveaux@williams.edu
63

Predictive

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Predictive

Caricato da

Copyright:

Formati disponibili

Predictive Analytics:

Modeling the World

Getting to Know Your Customers

50 years ago this was easy

Customer data base could fit in one persons head

21st Century Data Bases

Ability to anticipate customers

Even Sam Walton didnt know all

Amazon.com Earths biggest

$390,000 Diamond Necklace

No one can do this without help

Well, almost no one!

Direct Marketing Example

Paralyzed Veterans of America

KDD 1998 cup

Mailing list of 3.5 million potential donors

Made their last donation to PVA 13 to 24 months prior

200,000 (training and test sets)

Who should get the current mailing?

Cost effective strategy

Why is this Hard?

Cross tabs / OLAP

481 predictors 2 responses

How many combinations?

This alone can be 60-95% of the effort

Whats Hard? --Example

So, what does it mean?

7 2 0 0 2 REV EREND & M RS .

1 3 1 P RINCE AND P RINCES S

3 1 0 0 2 AM BAS S ADO R & M RS

1 0 6 REP RES ENTATIV E

1 4 0 0 2 CAP TAIN & M RS .

Results for PVA Data Set

If entire list (100,000 donors) are

Using data mining techniques,

KDD CUP 98 Results

KDD CUP 98 Results 2

953 30,000 lb. Ingots

Can we predict under what

42800 mature policies

Can we find a pattern for the unprofitable

Case Study III

Breast Cancer Diagnosis

Expensive radiologist read

False positive and negative rates over

Automatically read by a scanning

Why not Queries?

Models promote understanding

Queries are Event Driven

Its difficult to predict especially the future

Models are phenomenon driven

Queries are reactive

Models are proactive

What Happened on the Titanic?

Powerful predictors for optimizing

Powerful summaries for

Used to explore data set

Are not perfect

All models are wrong, but some are useful

Most associated variables in the census

In the convenience stores we looked at, on

Beer and Diapers

Picture from TandemTM ad

R-squared: 76.1% Train