CART v1 PDF

DECISION TREES
Tree-based Methods
•Here we describe tree-based methods for

regression and classification.
•These involve stratifying or segmenting the

predictor space into a number of simple
regions.
•Since the set of splitting rules used to

segment the predictor space can be
summarized in a tree, these types of
approaches are known as decision-tree
methods.
Pros and Cons
•Tree-based methods are simple and useful for
interpretation.
•However they typically are not competitive with
the best supervised learning approaches in
terms of prediction accuracy.
•Hence we also discuss bagging, random forests,
and boosting. These methods grow multiple trees
which are then combined to yield a single
consensus prediction.
•Combining a large number of trees can often
result in dramatic improvements in prediction
accuracy, at the expense of some loss
interpretation.
The Basics of Decision Trees
•Decision trees can be applied to both

regression and classification problems.
•We first consider regression problems, and

then move on to classification.
Baseball salary data:
how would you stratify it?
Salary is color-coded from low (blue,
green) to high (yellow,red)
Decision tree for these data
Details of previous figure
For the Hitters data, a regression tree for
predicting the log salary of a baseball player,
based on the number of years that he has played in
the major leagues and the number of hits that he
made in the previous year.
•At a given internal node, the label (of the form Xj
< tk) indicates the left-hand branch emanating from
that split, and the right-hand branch corresponds to
Xj ≥ tk. For instance, the split at the top of the
tree results in two large branches. Theleft-hand
branch corresponds to Years<4.5, and the right-hand
branch corresponds to Years>=4.5.
•The tree has two internal nodes and three terminal
nodes, or leaves. The number in each leaf is the
mean of the response for the observations that fall
there.
Results
Terminology for Trees
•In keeping with the tree analogy, the

regions R1, R2, and R3 are known as
terminal nodes
•Decision trees are typically drawn upside
down, in the sense that the leaves are at
the bottom of the tree.
•The points along the tree where the
predictor space is split are referred to as
internal nodes
•In the hitters tree, the two internal nodes
are indicated by the text Years<4.5 and
Hits<117.5
Interpretation of Results
•Years is the most important factor in determining
Salary, and players with less experience earn
lower salaries than more experienced players.
•Given that a player is less experienced, the
number of Hits that he made in the previous year
seems to play little role in his Salary.
•But among players who have been in the major
leagues for five or more years, the number of Hits
made in the previous year does affect Salary, and
players who made more Hits last year tend to have
higher salaries.
•Surely an over-simplification, but compared to a
regression model, it is easy to display, interpret
and explain
Details of the
tree-building process
•We divide the predictor space | that is,
the set of possible values for X1; X2………Xp
-- into J distinct and non-overlapping
regions, R1; R2………RJ.
•For every observation that falls into the

region Rj, we make the same prediction,
which is simply the mean of the response
values for the training observations in Rj.
More details of the
tree-building process
•In theory, the regions could have any shape.
However, we choose to divide the predictor
space into high-dimensional rectangles, or
boxes, for simplicity and for ease of
interpretation of the resulting predictive
model.
•The goal is to find boxes R1………RJ that minimize
the RSS, given by
where is the mean response for the training

observations within the jth box.
More details of the tree-
building process
Unfortunately, it is computationally infeasible to
consider every possible partition of the feature
space into J boxes.
• For this reason, we take a top-down, greedy
approach that is known as recursive binary
splitting.
• The approach is top-down because it begins at
the top of the tree and then successively splits
the predictor space; each split is indicated via
two new branches further down on the tree.
• It is greedy because at each step of the tree-
building process, the best split is made at that
particular step, rather than looking ahead and
picking a split that will lead to a better tree in
some future step
Details Continued
Predictions
•We predict the response for a given test

observation using the mean of the training
observations in the region to which that
test observation belongs.
•A five-region example of this approach is
shown in the next slide.
Details of previous
figure
•Top Left: A partition of two-dimensional
feature space that could not result from
recursive binary splitting.
•Top Right: The output of recursive binary
splitting on a two-dimensional example.
•Bottom Left: A tree corresponding to the
partition in the top right panel.
•Bottom Right: A perspective plot of the
prediction surface corresponding to that
tree
How to decide the depth of the
tree?
 One of the questions that arises in a decision tree
algorithm is the optimal size of the final tree.
 A tree that is too large risks overfitting the training
data and poorly generalizing to new samples.
A small tree might not capture important structural
information about the sample space.
 However, it is hard to tell when a tree algorithm should
stop because it is impossible to tell if the addition of
a single extra node will dramatically decrease error.
 A common strategy is to grow the tree until each node
contains a small number of instances then use pruning to
remove nodes that do not provide additional information.
Pruning a Tree
 Pruning is a technique that

reduces the size of decision
trees by removing sections of the
tree that provide little power to
classify instances.
 Pruning should reduce the size of

a learning tree without reducing
predictive accuracy as measured by
a cross-validation set.
Pruning Parameters
Following parameters are used for pruning a tree:
 max_leaf_nodes
– Reduce the number of leaf nodes.
 min_samples_leaf
– Restrict the size of sample leaf.
 max_depth
– Reduce the depth of the tree to build a generalized tree.
 The optimal value of above parameters can be found out

using the cross validation method.
Cross Validation Revisited
 In machine learning, we fit the model on the training

data, but we can’t say that the model will work
accurately for the real data.
 Cross-validation is a technique in which we train our
model using the subset of the data-set and then
evaluate using the complementary subset of the data-
set.
 The three steps involved in cross-validation are as
follows :
 Reserve some portion of sample data-set.
 Using the rest data-set train the model.
 Test the model using the reserve portion of the data-set.
Evaluation using Cross
Validation
Hyper-parameter Tuning
 Tuning is the process of maximizing a model’s

performance without overfitting or creating too high of
a variance. In machine learning, this is accomplished by
selecting appropriate “hyperparameters.”
A model hyperparameter is a configuration that is
external to the model and whose value cannot be
estimated from data.
 Models can have many hyperparameters and finding the
best combination of parameters can be treated as a
search problem.
 Scikit-Learn’s GridSearchCV is used for this search
problem.
GridSearchCV() for Pruning
Parameters
 GridSearchCV lets you combine an estimator with a grid
search preamble to tune hyper-parameters.
 The method picks the optimal parameter from the grid search
and uses it with the estimator selected by the user.
 The best parameters values can be obtained by the
.best_params_
 Example:
parameters = {'max_depth': [1,2,3,4,5]}

model1 = DecisionTreeRegressor()
grid = GridSearchCV(model1, parameters, cv=10)
grid.fit(X_train, y_train)
grid.best_params_
Classification Trees
Decision Tree models where the target

variable can take a discrete set of values
are called classification trees.
A classification tree splits the dataset

based on the homogeneity of data.
Measures of impurity like entropy or Gini

index are used to quantify the homogeneity
of the data when it comes to
classification trees.
Classification Trees Intuition
 The plot shows sample data

for two independent
variables, x and y.
 Each data point is colored

by the outcome variable,
red or gray.
 CART tries to split this

data into subsets
 Each subset is as pure or

homogeneous as possible.
 The first split tests whether the

variable x is less than 60.
 If yes, the model predicts red and if
no, the model moves on the next split.
 The second split checks whether or not
the variable y is less than 20.
 If no, the model predicts gray, and if
yes, he model moves on the next split.
 The third split checks whether or not
the variable x is less than 85.
 If yes, the model predicts red and if
no, the model predicts gray.
Quick Quiz 1
 How many splits are there in the following tree?
Quick Quiz 2
 For which data observations should we predict "Red",
according to this tree? Select all that apply.
Deciding where to split
 Decision tree splits the nodes on all available

variables and then selects the split which results
in most homogeneous sub-nodes.
 The creation of sub-nodes increases the

homogeneity of resultant sub-nodes.
 In other words, we can say that purity of the node

increases with respect to the target variable.
 Information Gain and Gini Index are the most

commonly used measures in decision trees.
Entropy
 Entropy, is a measure of the

randomness in the information being
processed.
 If the sample is completely

homogeneous the entropy is zero and
if the sample is equally divided then
it has entropy of one.
 The equation for entropy is given as:
Where, pi is the is a
probability of samples in a given
class.
Entropy
 Consider a variable with two possible outcome let’s say
‘yes’ and ‘no’.
 Case 1: If the samples are completely homogeneous ie: all ‘yes’ or all
‘no’.
Then, the entropy when all sample are ‘yes’ will be calculated as:
E(s) = -p(yes)log2 p(yes)
= -1 log2 1
= 0
similarly, when all samples are ‘no’ the entropy will be:
E(s) = -p(no)log2 p(no)
= -1 log2 1
= 0
Entropy
 Case 2: If the samples are equally divided i.e: equal no. of ‘yes’ and
equal no. of ‘no’.
Then, the entropy will be calculated as:
E(s) = -p(yes) log2 p(yes) + -p(no) log2 p(no)
= -0.5 log2 0.5 + (- 0.5 log2 0.5)
= -0.5 (-1) + (-0.5) (-1)
= 0.5 + 0.5
= 1
 The higher the entropy, the harder it is to draw

any conclusions from that information.
Information Gain
 Information gain (IG) measures how much

“information” a feature gives us about the class.
 Information gain is the main key that is used by Decision Tree
Algorithms to construct a Decision Tree.
 Decision Trees algorithm will always try to maximize Information gain.
 An attribute with highest Information gain will be selected for split first.
 The equation for calculating information gain is :
IG = Entropy(parent) – (weighted average) * Entropy(children)

Information Gain Example
 Consider a dataset with ‘Temperature’ and ‘Grades’

as two independent variables and Ice-cream as
dependent variable.
Temperature Grades Ice-cream
Hot Good Yes

Cold Good No
Hot Bad No
Hot Good Yes
Hot Bad Yes
 Step 1: Calculate Entropy for the dependent

variable.
 Step 2: Calculate Entropy and Information Gain for

the Temperature variable.
 Step 3: Calculate Entropy and Information Gain for

the Grades variable.
 Step 4: The variable with highest Information Gain

will be selected for first split.
Entropy for the Ice-cream -3/5 log2(3/5) + (-2/5) log2(2/5) 0.970951
Entropy for Temperature(hot) -3/4 log2(3/4) + (-1/4) log2(1/4) 0.811278
Entropy for Temperature(Cold) -1 log2 1 0
IG from Temperature 0.9709 – (4/5*(0.8112)+1/5*(0)) 0.321928
Entropy for Grades(Good) -2/3 log2(2/3) + (-1/3) log2(1/3) 0.918296
Entropy for Grades(Bad) -1/2 log2(1/2) + (-1/2) log2(1/2) 1
IG from Grades 0.9709 – (3/5*(0.9182) + 2/5*1) 0.019973
 As the IG from Temperature is more, the first split is made on

“Temperature” variable.
Gini Index
 The Gini Index is a measure of total variance

across the K classes.
 It is defined as:
 The Gini index takes on a small value if all the

are close to zero or one.
 For this reason the Gini index is referred to as a
measure of node purity – a small value of
indicates that a node contains predominantly
observations from a single class.
Gini Index Example
 Consider a dataset with ‘Temperature’ and ‘Grades’ as

two independent variables and Ice-cream as dependent
variable.
Temperature Grades Ice-cream
Hot Good Yes

Cold Good No
Hot Bad No
Hot Good Yes
Hot Bad Yes
Gini Index Example
Gini for Temperature (Hot) 4/5 * (1 – 4/5) 0.16
Gini for Temperature(cold) 1/5 * (1 – 1/5) 0.16
Gini for Temperature 0.16 + 0.16 0.32
Gini for Grades(Good) 3/5 * (1 - 3/5) 0.24
Gini for Grades(Bad) 2/5 * (1 – 2/5) 0.24
Gini for Grades 0.24 + 0.24 0.48
 As the Gini Index for Temperature is less hence the

split will be on ‘Temperature’ variable.
Classification
Trees Case Study
Example: heart data
•These data contain a binary outcome HD for 303
patients who presented with chest pain.
•An outcome value of Yes indicates the presence
of heart disease based on an angiographic test,
while No means no heart disease.
•There are 13 predictors including Age, Sex,
Chol (a cholesterol measurement), and other
heart and lung function measurements.
•Cross-validation yields a tree with six
terminal nodes. See next figure.
Trees Versus Linear Models
Top Row: True linear boundary; Bottom row: true non-linear boundary.
Left column: linear model; Right column: tree-based model
Advantages and
Disadvantages of Trees

CART v1 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CART v1 PDF

Caricato da

Copyright:

Formati disponibili

DECISION TREES

•Here we describe tree-based methods for

•These involve stratifying or segmenting the

•Since the set of splitting rules used to

•Decision trees can be applied to both

•We first consider regression problems, and

•In keeping with the tree analogy, the

•For every observation that falls into the

where is the mean response for the training

•We predict the response for a given test

 Pruning is a technique that

 Pruning should reduce the size of

Following parameters are used for pruning a tree:

– Reduce the number of leaf nodes.

– Restrict the size of sample leaf.

– Reduce the depth of the tree to build a generalized tree.

 The optimal value of above parameters can be found out

 In machine learning, we fit the model on the training

 Tuning is the process of maximizing a model’s

parameters = {'max_depth': [1,2,3,4,5]}

Decision Tree models where the target

A classification tree splits the dataset

Measures of impurity like entropy or Gini

 The plot shows sample data

 Each data point is colored

 CART tries to split this

 Each subset is as pure or

 The first split tests whether the

 Decision tree splits the nodes on all available

 The creation of sub-nodes increases the

 In other words, we can say that purity of the node

 Information Gain and Gini Index are the most

 Entropy, is a measure of the

 If the sample is completely

 The equation for entropy is given as:

 The higher the entropy, the harder it is to draw

 Information gain (IG) measures how much

 Decision Trees algorithm will always try to maximize Information gain.

 The equation for calculating information gain is :

IG = Entropy(parent) – (weighted average) * Entropy(children)

 Consider a dataset with ‘Temperature’ and ‘Grades’

Temperature Grades Ice-cream

Hot Good Yes

 Step 1: Calculate Entropy for the dependent

 Step 2: Calculate Entropy and Information Gain for

 Step 3: Calculate Entropy and Information Gain for

 Step 4: The variable with highest Information Gain

Entropy for the Ice-cream -3/5 log2(3/5) + (-2/5) log2(2/5) 0.970951

Entropy for Temperature(hot) -3/4 log2(3/4) + (-1/4) log2(1/4) 0.811278

Entropy for Temperature(Cold) -1 log2 1 0

IG from Temperature 0.9709 – (4/5*(0.8112)+1/5*(0)) 0.321928

Entropy for Grades(Good) -2/3 log2(2/3) + (-1/3) log2(1/3) 0.918296

Entropy for Grades(Bad) -1/2 log2(1/2) + (-1/2) log2(1/2) 1

IG from Grades 0.9709 – (3/5*(0.9182) + 2/5*1) 0.019973

 As the IG from Temperature is more, the first split is made on

 The Gini Index is a measure of total variance

 The Gini index takes on a small value if all the

 Consider a dataset with ‘Temperature’ and ‘Grades’ as

Hot Good Yes

Gini for Temperature (Hot) 4/5 * (1 – 4/5) 0.16

Gini for Temperature(cold) 1/5 * (1 – 1/5) 0.16

Gini for Temperature 0.16 + 0.16 0.32

Gini for Grades(Good) 3/5 * (1 - 3/5) 0.24

IG from Temperature 0.9709 – (4/5(0.8112)+1/5(0)) 0.321928

IG from Grades 0.9709 – (3/5(0.9182) + 2/51) 0.019973