Sei sulla pagina 1di 63

Predictive Analytics:

Modeling the World


Richard D. De Veaux
Professor of Statistics, Williams College
January 28, 2005
OR/MS Seminar

Getting to Know Your Customers

50 years ago this was easy

Customer data base could fit in one persons head


Retention of customers depended on ability to do so

21st Century Data Bases

Ability to anticipate customers


needs crucial for retention

Even Sam Walton didnt know all


his customers preferences

Amazon.com Earths biggest


selection

$390,000 Diamond Necklace


Worlds biggest book
Yak Cheese from Tibet

No one can do this without help

Well, almost no one!

Direct Marketing Example

Paralyzed Veterans of America

KDD 1998 cup

Mailing list of 3.5 million potential donors

Lapsed donors

Made their last donation to PVA 13 to 24 months prior


to June 1997

200,000 (training and test sets)

Who should get the current mailing?

Cost effective strategy

Why is this Hard?

Amount of Information

Cross tabs / OLAP

481 predictors 2 responses

How many combinations?


What to focus on?

Data Preparation

This alone can be 60-95% of the effort


Categorical vs. Quantitative

Whats Hard? --Example

T-Code

So, what does it mean?


T -C o d e T itle
0 _
1 M R.

1 6 DEAN
1 7 J UDGE

1 0 0 1 M ES S RS .

1 7 0 0 2 J UDGE & M RS .

1 0 0 2 M R. & M RS .
2 M RS .

1 8 M AJ O R
1 8 0 0 2 M AJ O R & M RS .

2 0 0 2 M ES DAM ES
3 M IS S

1 9 S ENATO R

1 0 9 LIC.

5 0 ELDER

1 1 1 S A.

5 6 M AYO R

1 1 4 DA.

5 9 0 0 2 LIEUTENANT & M RS .
6 2 LO RD
6 3 CARDINAL

1 1 6 S R.
1 1 7 S RA.
1 1 8 S RTA.

6 4 FRIEND

1 2 0 YO UR M AJ ES TY

3 0 0 3 M IS S ES

2 1 0 0 2 S ERGEANT & M RS .

6 5 FRIENDS

1 2 2 HIS HIGHNES S

4 DR.
4 0 0 2 DR. & M RS .

2 2 0 0 2 CO LNEL & M RS .
2 4 LIEUTENANT

6 8 ARCHDEACO N
6 9 CANO N

1 2 3 HER HIGHNES S
1 2 4 CO UNT

7 0 BIS HO P

1 2 5 LADY

4 0 0 4 DO CTO RS
5 M ADAM E
6 S ERGEANT
9 RABBI
1 0 P RO FES S O R
1 0 0 0 2 P RO FES S O R & M RS .
1 0 0 1 0 P RO FES S O RS
1 1 ADM IRAL
1 1 0 0 2 ADM IRAL & M RS .
1 2 GENERAL

2 0 GO V ERNO R

4 8 CO RP O RAL

2 6 M O NS IGNO R
2 7 REV EREND
28 MS.
28028 MSS.

7 2 0 0 2 REV EREND & M RS .


7 3 P AS TO R
7 5 ARCHBIS HO P

1 2 7 P RINCES S
1 2 8 CHIEF

2 9 BIS HO P

8 5 S P ECIALIS T

1 2 9 BARO N

3 1 AM BAS S ADO R

8 7 P RIV ATE

1 3 0 S HEIK

8 9 S EAM AN
9 0 AIRM AN

1 3 1 P RINCE AND P RINCES S


1 3 2 YO UR IM P ERIAL M AJ ES T

9 1 J US TICE

1 3 5 M . ET M M E.

9 2 M R. J US TICE

2 1 0 P RO F.

3 1 0 0 2 AM BAS S ADO R & M RS


3 3 CANTO R
3 6 BRO THER
3 7 S IR

1 2 0 0 2 GENERAL & M RS .

3 8 CO M M O DO RE

100 M.

1 3 CO LO NEL
1 3 0 0 2 CO LO NEL & M RS .

4 0 FATHER
4 2 S IS TER

1 0 3 M LLE.
1 0 4 CHANCELLO R

4 3 P RES IDENT

1 0 6 REP RES ENTATIV E

1 4 0 0 2 CAP TAIN & M RS .

4 4 M AS TER

1 0 7 S ECRETARY

1 5 CO M M ANDER
1 5 0 0 2 CO M M ANDER & M RS .

4 6 M O THER
4 7 CHAP LAIN

1 0 8 LT. GO V ERNO R

1 4 CAP TAIN

1 2 6 P RINCE

Results for PVA Data Set

If entire list (100,000 donors) are


mailed, net donation is $10,500

Using data mining techniques,


this was increased 41.37%

KDD CUP 98 Results

10

KDD CUP 98 Results 2

11

Data Mining Is
the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. --- Fayyad
finding interesting structure (patterns, statistical
models, relationships) in data bases.--- Fayyad,
Chaduri and Bradley
a knowledge discovery process of extracting
previously unknown, actionable information from
very large data bases--- Zornes
a process that uses a variety of data analysis
tools to discover patterns and relationships in
data that may be used to make valid predictions.
---Edelstein
12

Data Mining Is

13

Case Study I

Ingot Cracking

953 30,000 lb. Ingots


20% cracking rate
$30,000 per recast
90 Potential Explanatory
Variables

Water composition
Metal composition
Process variables
Other environmental variables

Can we predict under what


conditions ingots will crack?

14

Case Study II

Car Insurance

42800 mature policies


65 Potential Predictors

Can we find a pattern for the unprofitable


policies?

15

Case Study III

Breast Cancer Diagnosis

Mammograms used as
screening instrument

Expensive radiologist read


Inaccurate

False positive and negative rates over


25%
Over a decade, nearly 100% false
positive rate

Can we do better?

Automatically read by a scanning


algorithm
Automatically diagnosed by a
model
16

Why not Queries?

Queries Describe

Models promote understanding


Models can be assessed both by their understanding and
their predictions

Queries are Event Driven

Its difficult to predict especially the future

Models are phenomenon driven

Queries are reactive

Models are proactive

17

What Happened on the Titanic?


Class

Crew
First
Second
Third

18

C
C32
1

Mosaic Plot

19

Models

Powerful predictors for optimizing


performance

Powerful summaries for


understanding

Used to explore data set

Are not perfect

All models are wrong, but some are useful


Statisticians, like artists, have the bad habit of falling
in love with their models.

20

Tree Diagram
F

Adult

2 or 3

Child

1 or Crew

Crew

1,2,C

46%

93%

1 or 2

1st

14%

27%

23%

100%

33%

21

Why Models?
Whats

Most associated variables in the census


Whats associated with shampoo
purchases?

Beer

interesting?

and Diapers

In the convenience stores we looked at, on


Friday nights, purchases of beer and
purchases of diapers are highly associated
Conclusions?
Actions?
22

Beer and Diapers

Picture from TandemTM ad


23

25
20
15

train2$y

10

15
10
5

train2$y

20

25

Toy Problem
0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

25
20
15

train2$y
0.4

0.6

0.8

1.0

0.0

0.2

0.4

25
20
15

train2$y

15

10

10

20

25

train2[, i]

train2$y

1.0

5
0.2

train2[, i]

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

20
15

train2$y

15

10
5

10

20

25

train2[, i]

25

train2[, i]

train2$y

0.8

10

20
15

train2$y

10
5

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

20
15

train2$y

15

10
5

10

20

25

train2[, i]

25

train2[, i]

train2$y

0.6
train2[, i]

25

train2[, i]

0.0

0.2

0.4

0.6
train2[, i]

0.8

1.0

0.0

0.2

0.4
train2[, i]

24

Familiar Models

Linear Regression

25

Logistic Regression

26

Linear Regression
Term Estimate Std Error t Ratio Prob>|t|
Intercept
0.806
0.427 1.890
0.059
x1
7.269
0.273 26.590 <.0001
x2
7.289
0.281 25.940 <.0001
x3
-0.719
0.287 -2.500
0.012
x4
9.769
0.273 35.810 <.0001
x5
4.834
0.275 17.590 <.0001
x6
-0.456
0.280 -1.630
0.104
x7
0.123
0.270 0.460
0.647
x8
-0.349
0.276 -1.270
0.206
x9
-0.578
0.285 -2.030
0.043
x10
0.080
0.280 0.280
0.777

R-squared: 76.1% Train

73.3% Test
27

Stepwise Regression
Term
Intercept
x1
x2
x3
x4
x5
x9

Estimate Std Error t Ratio Prob>|t|


0.561
0.328 1.710
0.087
7.252
0.273 26.550 <.0001
7.311
0.280 26.110 <.0001
-0.767
0.286 -2.690
0.007
9.747
0.272 35.790 <.0001
4.799
0.274 17.510 <.0001
-0.609
0.284 -2.140
0.032

R-squared 76.0% on Train

73.4% Test

28

Stepwise 2ND Order Model


Term
Estimate Std Error
t Ratio Prob>|t|
Intercept
0.000
0.000 .
.
x1
7.204
0.169
42.510 <.0001
(x1-0.49573)*(x1-0.49573) -12.137
0.682
-17.790 <.0001
x2
7.313
0.173
42.380 <.0001
(x2-0.48895)*(x2-0.48895) -11.289
0.688
-16.410 <.0001
x3
-1.010
0.179
-5.660 <.0001
(x3-0.46706)*(x3-0.46706)
20.658
0.703
29.390 <.0001
x4
10.169
0.172
59.070 0.000
x5
5.135
0.168
30.610 <.0001
(x5-0.49425)*(x5-0.49425)
1.714
0.694
2.470 0.014
x7
0.244
0.165
1.480 0.140
x8
0.079
0.171
0.460 0.646
(x1-0.49573)*(x2-0.48895)
2.370
0.639
3.710 0.000
(x2-0.48895)*(x4-0.49038)
-0.322
0.626
-0.510 0.607
(x3-0.46706)*(x7-0.4962)
1.273
0.626
2.030 0.042
(x4-0.49038)*(x8-0.4975)
-1.015
0.603
-1.680 0.092
(x7-0.4962)*(x8-0.4975)
-1.283
0.601
-2.130 0.033

R-squared 90.0% Train

88.5% Test
29

Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting bias

Too complex: inconsistent predictions,


overfitting high variance

Selecting models is Occams razor

Keep goals of interpretation vs. Prediction in mind

30

Tree Model
|

x4<0.512146

x1<0.209569

x1<0.359395

x5<0.260297

x2<0.299431

x4<0.140557

x3<0.215425 x5<0.232708

x2<0.129879

x2<0.27271

x3<0.885533 x5<0.621811

x5<0.412206

x5<0.588094

10.640

x4<0.336583
x4<0.148234
x2<0.54068

x3<0.490631
9.785

x4<0.223909

x4<0.264999

x3<0.248065
x4<0.768584 x2<0.20279

x2<0.414822

16.830

7.602

x4<0.283724
4.400
7.074

x4<0.916189

12.700
14.380 12.250
15.830
18.160
15.150

12.040
6.688
9.865
8.887
12.060

x3<0.784959
7.994
9.956

x1<0.328133

x3<0.0777104
x3<0.0789249
x3<0.177433x3<0.114976

15.020

x3<0.728124

19.060
14.340
17.560

21.260
17.770

x3<0.821878
21.240
18.760

x8<0.933915
15.190

x4<0.941058
25.320

x4<0.700738
10.780
14.060

R squared 82.3% Train

14.030
16.860

67.2% Test

14.200

21.100
24.280

17.690
19.470

31

Feature Creation

New predictor based on


original predictors

Often linear:

zi = +b1 x1 + ... + b p x p

Principal components
Factor analysis
Multidimensional scaling

32

Neural Nets

Dont resemble the brain

Are just a statistical model

33

A Single Neuron
x1
0.3
x2
0.7
x3
x4
x5
x0

-0.2
0.4
-0.5

s(z1)
Input (z1)

Output

0.8
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

34

More exotic Neural networks


z1
x1
z2
x2

z3
Output layer
Input layer
Hidden layer
35

Running a Neural Net

36

Predictions for Example

R squared 92.7% Train

90.6% Test
37

What Does This Get Us?


Enormous
Ability

flexibility

to fit anything

Including noise

Interpretation?

38

Case Study Warranty Data

A new backpack inkjet printer is


showing higher than expected warranty
claims

What are the important variables?


Whats going on?

A neural networks shows that Zipcode


is the most important predictor

39

Spatial Analysis

Warranty Data showing


problem with ink jet
printer

Use the model as a


black box for variable
selection

40

MARS
Multivariate
What

do they do?

10

1.2

1.2

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Replace each step function in a tree model by a pair of linear


functions.

0.0

0.0

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

1.0

1.0

-0.2

-0.2

Adaptive Regression Splines

6
x

10

10

41

MARS Variable Importance

R-squared 95.0% Train


(96.3%)

94.3% Test
(95.8%)
42

MARS Function Output

43

Collaborative Filtering

Goal: predict what movies people will like

Data: list of movies each person has


watched
Lyle
Ellen
Fred
Dean
Jason

Andre, Starwars
Andre, Starwars, Coeur en Hiver
Starwars, Batman
Starwars, Batman, Rambo
Coeur en Hiver, Chocolat

44

Data Base

Data can be represented as a sparse matrix


Andre

Starwars Batman

Lyle
Ellen
Fred
Dean
Jason

y
y

y
y
y
y
y

Karen

Rambo

Coeur

Chocolat

y
y
y

Karen likes Andre. What else might she like?

CDNow doubled e-mail responses

45

How Do We Really Start?


Life is not so kind
Categorical variables
Missing data
500 variables, not 10
481

variables where to start?

46

Where to Start?
EDM
Use a tree to find a smaller subset of
variables to investigate
Explore this set graphically

Start the modeling process over

Build model

Compare model on small subset with full


predictive model

47

Start With a Simple Model


Maybe

a Tree:
x4<0.477873
|

x2<0.288579

x5<0.465905

x1<0.297806

x1<0.333728

x1<0.152683
-2.560 -0.265

x5<0.529173

x5<0.466843

x4<0.208211
-1.890 1.150
5.820

2.000 4.570

x2<0.343653

x2<0.125849
2.540 5.120

x4<0.752766

x5<0.644585 x5<0.49235
2.910 6.050

7.500 10.100 9.880 12.200

48

Automatic Models
KXEN

49

PVA Results from KXEN

50

Combining Models -- Bagging

Bagging (Bootstrap
Aggregation)

Bootstrap a data set repeatedly


Take many versions of same
model (e.g. tree)
Form a committee of models
Take majority rule of predictions

51

Combining Models -- Boosting

Take the data and apply a simple classifier

Reweight the data, weighting the misclassified


data much higher.

Reapply the classifier

Repeat over and over

The final prediction is a combination of the


output of each classifier, weighted by the
overall misclassification rate.

Details in Freund, Y. Boosting a weak


learning algorithm by majority, Information
and Computation 121(2), 256-285.

52

Breast Cancer Diagnosis

53

Results from Random Forest


Results from 1000 splits of Training and Test data

T re e
Bo o ste d T re e s
Ra n d o m F o re st
Ne u ra l Ne tw o rk
Ra d io lo g ists

F a lse P o sitive Ra te
32.20%
24.90%
19.30%
25.50%
22.40%

F a lse Ne g a tive Ra te
33.70%
32.50%
28.80%
31.70%
35.80%

54

Case Study Ingot failures

Ingot cracking

953 30,000 lb. Ingots


20% cracking rate
$30,000 per recast
90 potential explanatory variables

Water composition (reduced)


Metal composition
Process variables
Other environmental variables

55

Model building process

Model building

Train
Test

Evaluate

56

Most Important Variable

Take One Here we started


with trees

Alloy

OK, take two

Yttrium

We know that

What do you think is in the alloy?

Third times the charm?

Selenium!

OH!
57

Case Study Car Insurance

Now that we have 40000 mature policies, can


we find other factors to price policies better?

65 potential predictors

Industry, vehicle age, color, numbers of vehicles, usage


and location etc

58

Fast Fail

Not every modeling effort is a success

A model search can save lots of queries

Data took 8 months to get ready

Analyst spent 2 months exploring it

A new model search program (KXEN)


running for several hours found no out of
sample predictive ability

Tree model gave similar results

59

PVA Recap

Remember --- 481 predictor variables

Need a way to trim this down

Need an exploratory model

Neural network?
Tree?

60

Students in Data Mining Class

Student #1 $15,024
Student #2 $14,695
Student #3 $14,345

61

Take Home Messages

What a great time to be a Statistician!

Problems are exciting

Research is exciting

Success in Data mining

Requires Team Work


Requires Flexibility in modeling
Means that you Act on Your results
Depends much more on the way you mine the data rather
than the specific model or tool that you use

Which method to use?

Yes!! Have fun!


62

Thank you!

deveaux@williams.edu

63

Potrebbero piacerti anche