Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CART from A to B
Contents
An Insurance Example
Some Basic Theory
Suggested Uses of CART
Case Study: comparing CART with other methods
What is CART?
Classification And Regression Trees
Developed by Breiman, Friedman, Olshen, Stone in
early 80’s.
Introduced tree-based modeling into the statistical
mainstream
Rigorous approach involving cross-validation to select
the optimal tree
One of many tree-based modeling techniques.
CART -- the classic
CHAID
C5.0
Software package variants (SAS, S-Plus, R…)
Note: the “rpart” package in “R” is freely available
Philosophy
An Insurance Example
Let’s Get Rolling
Suppose you have 3 variables:
# vehicles: {1,2,3…10+}
Age category: {1,2,3…6}
Liability-only: {0,1}
At each iteration, CART tests all 15 splits.
(#veh<2), (#veh<3),…, (#veh<10)
(age<2),…, (age<6)
(lia<1)
FREQ1_F_RPT <= 0.500 FREQ1_F_RPT > 0.500 AVGAGE_CAT <= 8.500 AVGAGE_CAT > 8.500
Terminal Terminal Terminal Terminal
Node 1 Node 2 Node 4 Node 5
Class = 0 Class = 1 Class = 1 Class = 0
Class Cases % Class Cases % Class Cases % Class Cases %
0 18984 78.7 0 2508 57.4 0 4327 48.1 0 2072 76.5
1 5138 21.3 1 1859 42.6 1 4671 51.9 1 637 23.5
N = 24122 N = 4367 N = 8998 N = 2709
Observations (Shaking the Tree)
First split (# vehicles) is
rather obvious
More exposure more
Node 1
NUM_VEH
claims Class Cases %
0 37891 66.2
But it confirms that CART is 1 19312 33.8
N = 57203
doing something reasonable.
Also: the choice of NUM_VEH <= 4.500 NUM_VEH > 4.500
6) is non-obvious.
Node 1 Node 2
Class Cases % Class Cases %
of the tree...
NUM_VEH
N = 57203
CART is struggling to
capture a linear NUM_VEH <= 4.500 NUM_VEH > 4.500
relationship Node 2
LIAB_ONLY
Node 4
NUM_VEH
Weakness of CART
N = 36359 N = 20844
The best CART can do LIAB_ONLY > 0.500 NUM_VEH > 10.500
is a step function
Terminal Terminal
LIAB_ONLY <= 0.500 Node 3 NUM_VEH <= 10.500 Node 6
Class = 0 Class = 1
approximation of a
Node 3 Node 5
Class Cases % Class Cases %
FREQ1_F_RPT AVGAGE_CAT
0 7591 96.5 0 2409 26.4
N = 28489 N = 11707
1 279 3.5 1 6728 73.6
FREQ1_F_RPT <= 0.500 FREQ1_F_RPT > 0.500 AVGAGE_CAT <= 8.500 AVGAGE_CAT > 8.500
Terminal Terminal Terminal Terminal
Node 1 Node 2 Node 4 Node 5
Class = 0 Class = 1 Class = 1 Class = 0
Class Cases % Class Cases % Class Cases % Class Cases %
0 18984 78.7 0 2508 57.4 0 4327 48.1 0 2072 76.5
1 5138 21.3 1 1859 42.6 1 4671 51.9 1 637 23.5
N = 24122 N = 4367 N = 8998 N = 2709
Interactions and Rules
This tree is obviously
not the best way to
model this dataset. Node 1
N = 57203
Liability-only policies
with fewer than 5
NUM_VEH <= 4.500 NUM_VEH > 4.500
Node 2 Node 4
LIAB_ONLY NUM_VEH
data.
LIAB_ONLY <= 0.500 Node 3 NUM_VEH <= 10.500 Node 6
Class = 0 Class = 1
Node 3 Node 5
Class Cases % Class Cases %
FREQ1_F_RPT AVGAGE_CAT
0 7591 96.5 0 2409 26.4
N = 28489 N = 11707
1 279 3.5 1 6728 73.6
Could be used as an
N = 7870 N = 9137
underwriting rule
FREQ1_F_RPT <= 0.500 FREQ1_F_RPT > 0.500 AVGAGE_CAT <= 8.500 AVGAGE_CAT > 8.500
Terminal Terminal Terminal Terminal
Node 1 Node 2 Node 4 Node 5
Class = 0 Class = 1 Class = 1 Class = 0
Or an interaction
Class Cases % Class Cases % Class Cases % Class Cases %
0 18984 78.7 0 2508 57.4 0 4327 48.1 0 2072 76.5
1 5138 21.3 1 1859 42.6 1 4671 51.9 1 637 23.5
N = 24122 N = 4367 N = 8998 N = 2709
term in a GLM
High-Dimensional Predictors
Categorical predictors:
CART considers every Node 1
LINE_IND$
A Little Theory
Splitting Rules
Select the variable value (X=t1) that
produces the greatest “separation” in the
target variable.
“Separation” defined in many ways.
Regression Trees (continuous target): use
sum of squared errors.
Classification Trees (categorical target):
choice of entropy, Gini measure, “twoing”
splitting rule.
Regression Trees
Tree-based modeling for continuous target variable
most intuitively appropriate method for loss
ratio analysis
Find split that produces greatest separation in
∑[y – E(y)]2
i.e.: find nodes with minimal within variance
and therefore greatest between variance
in purity.
“weakest link”
pruning.
Finding the Right Tree
“Inside every big tree is
a small, perfect tree |
Cost-Complexity Pruning
Definition: Cost-Complexity Criterion
Rα= MC + αL
MC = misclassification rate
Relative to # misclassifications in root node.
L = # leaves (terminal nodes)
You get a credit for lower MC.
“blue” data. 1 train train train train train train train train train test
Test them on the “red” 2 train train train train train train train train test train
data. 3 train train train train train train train test train train
different values of α. 5 train train train train train test train train train train
Now go back to the full 6 train train train train test train train train train train
8 train train test train train train train train train train
9 train test train train train train train train train train
10 test train train train train train train train train train
How to Cross-Validate
Grow the tree on all the data: T0.
Now break the data into 10 equal-size pieces.
10 times: grow a tree on 90% of the data.
Drop the remaining 10% (test data) down the nested
trees corresponding to each value of α.
For each α add up errors in all 10 of the test data
sets.
Keep track of the α corresponding to lowest test error.
This corresponds to one of the nested trees Tk«T0.
Just Right
Relative error: proportion size of tree
of CV-test cases 1 2 3 5 6 7 8 10 13 18 21
misclassified.
According to CV, the 15-
1.0
node tree is nearly
optimal.
0.8
X-val Relative Error
In summary: grow
the tree all the way
0.6
out.
0.4
Then weakest-link
prune back to the 15 0.2
node tree.
Inf 0.059 0.035 0.0093 0.0055 0.0036
cp
© Deloitte Consulting, 2005
CART in Practice
CART advantages
Nonparametric (no probabilistic assumptions)
Automatically performs variable selection
Uses any combination of continuous/discrete
variables
Very nice feature: ability to automatically bin
massively categorical variables into a few
categories.
zip code, business class, make/model…
Discovers “interactions” among variables
Good for “rules” search
Hybrid GLM-CART models
CART advantages
CART handles missing values automatically
Using “surrogate splits”
Invariant to monotonic transformations of
predictive variable
Not sensitive to outliers in predictive variables
Unlike regression
Great way to explore, visualize data
CART Disadvantages
The model is a step function, not a continuous score
So if a tree has 10 nodes, yhat can only take on
10 possible values.
MARS improves this.
Case Study:
Spam e-mail Detection
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
1.0
0.8
X-val Relative Error
0.6
0.4
0.2
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
Pruned Tree #1
The pruned tree is still
|
pretty big.
Can we get away with
pruning the tree back
even further?
Let’s be radical and prune
way back to a tree we
actually wouldn’t mind
looking at.
size of tree
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
1.0
0.8
X-val Relative Error
0.6
0.4
0.2
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
cp
Pruned Tree #2
Suggests rule:
Many “$” signs, caps, and “!” and few instances of
company name (“HP”) spam!
freq_DOLLARSIGN< 0.0555
|
0 1 0 1
415/29 0/13 208/54 12/51
CART Gains Chart
Spam Email Detection - Gains Charts
How do the three trees
compare?
1.0
Use gains chart on test
data.
0.8
Outer black line: the
best one could do
0.6
Perc.Spam
45o line: monkey
throwing darts perfect model
0.4
unpruned tree
The bigger trees are pruned tree #1
about equally good in 0.2
pruned tree #2
We do lose something
with the simpler tree. 0.0 0.2 0.4 0.6 0.8 1.0
Perc.Total.Pop
Other Models
Fit a purely additive MARS model to the data.
No interactions among basis functions
Fit a neural network with 3 hidden nodes.
Fit a logistic regression (GLM).
Using the 20 strongest variables
Fit an ordinary multiple regression.
A statistical sin: the target is binary, not normal
GLM model
Logistic regression
run on 20 of the
most powerful
predictive variables
Neural Net Weights
Comparison of Techniques
Spam Email Detection - Gains Charts
All techniques add value.
1.0
MARS/NNET beats GLM.
But note: we used all
0.8
variables for
MARS/NNET; only 20 for
0.6
GLM.
Perc.Spam
GLM beats CART. perfect model
0.4
mars
In real life we’d probably neural net
pruned tree #1
use the GLM model but glm
0.2
regression
refer to the tree for
“rules” and intuition.
0.0
Perc.Total.Pop
Parting Shot: Hybrid GLM model
We can use the simple decision tree (#3) to
motivate the creation of two ‘interaction’ terms:
“Goodnode”:
(freq_$ < .0565) & (freq_remove < .065) & (freq_! <.524)
“Badnode”:
(freq_$ > .0565) & (freq_hp <.16) & (freq_! > .375)
We read these off tree (#3)
Code them as {0,1} dummy variables
Include in GLM model
At the same time, remove terms no longer
significant.
Hybrid GLM model
1.0
the original GLM.
See gains chart
0.8
See confusion matrix
0.6
Perc.Spam
in this particular model… perfect model
0.4
neural net
… but proves the decision tree #2
glm
concept hybrid glm
0.2
0.0
Perc.Total.Pop
Concluding Thoughts
In many cases, CART will likely under-perform
tried-and-true techniques like GLM.
Poor at handling linear structure
Data gets chopped thinner at each split