Sei sulla pagina 1di 77

Decision Trees

Introduction
We now discuss tree-based methods.
Note that our main goal is to predict a target variable based on
several input variables.
Decision trees can be applied to both regression and classification
problems.
Introduction
Thus there are of two main types:
1. Classification tress used when the predicted outcome is a categorical
variable.
2. Regression tress used when the predicted outcome is a quantitative
variable.
The term Classification And Regression Tree (CART) analysis is a
popular umbrella term used to refer to both of the above methods.
Regression Tree
Dataset: Baseball Players Salaries
Major League Baseball Data for the two seasons.
A data frame with 322 observations of major league players on 20
variables.
Goal: To predict Salary based on a number of predictors, such as
various performance indicators, number of years, etc.
Dataset: Baseball Players Salaries
For the time being, we will consider following three variables
1. Salary (Thousands of dollars)
2. Years (Number of years he has played in the major leagues)
3. Hits (Number of hits he made in the previous year)
Goal: To predict Salary based on Years and Hits.
In order to reduce the skewness, we first log-transform Salary so that
it has more of a typical bell-shape.
Baseball Players Salaries
The General View
Consider two predictors 1 and 2 .
The predictor space is segmented into five
distinct regions.
Depending upon which region our
observation comes from, we would make
one of five possible predictions for Y.
We typically use the mean of the training
observations belonging to a particular
region as the predicted value for the
region.
The General View
Typically we create the
partitions by iteratively
splitting one of the
variables into two regions.
The General View
1. First split on 1 = 1 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
3. If 1 > 1 , split on 1 = 3 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
3. If 1 > 1 , split on 1 = 3 .
4. If 1 > 3 , split on 2 = 4 .
The General View

When we create partitions


like this, we can always
represent them using a tree-
like structure.
This tree-like representation
provides a very simple way to
explain the model to a non-
expert!!!
Baseball Players Salaries
The predictor space is segmented into
three regions:
1 = | < 4.5 ,
2 = | 4.5, < 117.55 ,
3 = | 4.5, 117.55 .
The predicted Salary for these three
groups are
$1,000 5.107 = $165,174
$1,000 5.998 = $402,834
$1,000 6.740 = $845,346,
respectively.
Baseball Players Salaries: Another Representation
Regression Tree: Two steps
1. Divide the predictor space, i.e., the set of possible values of the
predictors, into distinct and non-overlapping regions, namely
1 , 2 , , .
2. For every observation that falls into the region , we make the
same predictions, which is simply the mean of the response values
for the training observations in .
Some Natural Questions about Step 1
How do we construct the regions 1 , 2 , , ?
Though these regions could have any shape in theory, we choose to segment
the predictor space into high-dimensional rectangles, or boxes.
This is mainly done for simplicity and for ease of interpretation.
How should we decide where to split?
We take a top-down, greedy approach, that is known as recursive binary
splitting.
The approach is top-down, because it begins at the top of the tree.
The approach is greedy because at each step of the tree-building process, the
best split is made at that particular step, rather than looking ahead.
Where to split?
We consider splitting into two
regions, and > for all
possible values of and = 1,2.
We then choose the and that
results in the lowest MSE on the
training data.
Where to split?
Here the optimal split was on 1
at point 1 .
We now repeat the process
looking for the next best split
except that we must also
consider whether to split the
first region or the second region
up.
Again the criteria is smallest
MSE.
Where to split?
The optimal split was the left
region on 2 at point 2 .
This process continues until
our regions have too few
observations to continue e.g.
all regions have 5 or fewer
points.
Baseball Players Salaries
We first remove all the rows that have missing values in any variable.
This reduces the number of observations to 263.
We then randomly split the observations into two parts- the training
set containing 132 observations and the test set containing 131
observations.
We will first build a tree based on the training data set.
Baseball Players Salaries
Tree Pruning
A large tree (i.e., one with many terminal nodes) may tend to over-fit
the training data.
It may lead to poor test set performance.
A smaller tree with fewer splits may be easy to interpret.
Generally, we can improve accuracy by pruning the tree i.e. cutting
off some of the terminal nodes.
How do we know how far back to prune the tree?
We use six-fold cross validation to see which tree has the lowest
error rate.
Cross-Validation
Pruned Tree
Test Error

RMSE=408.29
Classification Tree
Example:-
Lets say we have a sample of 30 students with three variables Gender
(Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play
cricket in leisure time. Now, I want to create a model to predict who
will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based
on all values of three variable and identify the variable, which creates
the best homogeneous sets of students (which are heterogeneous to
each other). In the snapshot below, you can see that variable Gender
is able to identify best homogeneous sets compared to the other two
variables.
As mentioned above, decision tree identifies the most significant
variable and its value that gives best homogeneous sets of population.
Now the question which arises is, how does it identify the variable and
the split? To do this, decision tree uses various algorithms, which we
will shall discuss later.
Important Terminology related to Decision
Trees
Root Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every
algorithm has advantages and disadvantages, below are the important factors
which one should know.
Carseats Data Set
A data set containing sales of child car seats at 400 different stores.
A data set with 400 observations on the following 11 variables.
The variables are as follows:
1. Sales: Unit sales (in thousands) at each location.
2. CompPrice: Price charged by competitor at each location
3. Income: Community income level (in thousands of dollars)
4. Advertising: Local advertising budget for company at each location
(in thousands of dollars)
5. Population: Population size in region (in thousands)
Carseats Data Set
6. Price: Price company charges for car seats at each site
7. ShelveLoc: A factor with levels Bad, Medium and Good
indicating the quality of the shelving location.
8. Age: Average Age of the local population
9. Education: Education level at each location
10. Urban: A factor with levels No and Yes to indicate whether the
store is in an urban or rural location
11. US: A factor with levels No and Yes to indicate whether the
store is in the US or not.
Carseats Data Set
We now recode Sales as binary variable.
We create a dummy variable High, which takes on a value Yes if
the sales exceed 8 units and No otherwise.
We will model High with the help of ten predictors.
Classification Tree
A classification tree is very similar to a regression tree except that we
try to make a prediction for a categorical response rather than
continuous one.
In a regression tree, the predicted response for an observation is
given by the average response of the training observations that
belong to the same terminal node.
In a classification tree, we predict that each observation belongs to
the most commonly occurring class of the training observations in the
region to which it belongs.
Classification Tree
The tree is grown in exactly the same manner as with a regression
tree
However in classification tree, minimizing MSE no longer makes
sense.
A natural alternative is classification error rate.
The classification error rate is simply the fraction of the training
observations in that region that do not belong to the most common
class.
There are several other different criteria available as well, such as the
gini index and cross-entropy.
Carseats Data Set
We split the observations into a training data set and a test data set.
Both the training set and the test set contain 200 observations.
We next build a tree using the training set, and then evaluate its
performance based on the test data.
Carseats Data Set: Unpruned Tree
Confusion Matrix based on Test Data
True High status
Predicted No Yes Total
High No 88 28 116
Status
Yes 28 56 84
Total 116 84 200

56
= = 66.67%
84
88
= = 75.86%
116
56
= = 28%
200
Cross Validation
We now consider whether pruning the tree leads to a better
performance.
We decide the optimal level of tree complexity using cross-validation.
Cross Validation

We select a
tree with 9
terminal nodes.
Pruned Tree
Confusion Matrix based on Test Data for
Pruned Tree
True High status
Predicted No Yes Total
High No 94 24 118
Status
Yes 22 60 82
Total 116 84 200

60
= = 71.43%
84
94
= = 81.03%
116
46
= = 23%
200
Example: Portuguese Bank data
Summary(data)
job marital education default
blue-collar:435 divorced: 228 primary : 335 no :1961

management :423 married :1201 secondary:1010 yes: 39

technician :339 single : 571 tertiary : 564

admin. 235 unknown : 91


services :168

retired : 92
(Other) 308
housin loan contact month poutcome
g
no : no cellular may :581 failure:
916 :1717 :1287 210
yes:10 yes: telephone: jul :340 other :
84 283 136 79
unknown : aug :278 success:
577 58
jun :232 unknown:1
653
nov :183
apr :118
(Other): 268
subscribed
no :1789
yes: 211
Attribute job includes the following values.
Explanation of figure
At each split, the decision tree algorithm picks the most informative
attribute out of the remaining attributes. The extent to which an attribute
is informative is determined by measures such as entropy and information
gain
At the rst split, the decision tree algorithm chooses the poutcome
attribute. There are two nodes at depth=1. The left node is a leaf node
representing a group for which the outcome of the previous marketing
campaign contact is a failure, other, or unknown. For this group, 1,763 out
of 1,942 clients have not subscribed to the term deposit.
The right node represents the rest of the population, for which the
outcome of the previous marketing campaign contact is a success. For the
population of this node, 32 out of 58 clients have subscribed to the term
deposit.
Explanation of figure
This node further splits into two nodes based on the education level.
If the education level is either secondary or tertiary, then 26 out of 50
of the clients have not subscribed to the term deposit. If the
education level is primary or unknown, then 8 out of 8 times the
clients have subscribed.
The left node at depth 2 further splits based on attribute job . If the
occupation is admin, blue collar, management, retired, services, or
technician, then 26 out of 45 clients have not subscribed. If the
occupation is self-employed, student, or unemployed, then 5 out of 5
times the clients have subscribed.
The General Algorithm

the objective of a decision tree algorithm is to construct a tree T from


a training set S. If all the records in S belong to some class C
(subscribed = yes, for example), or if S is suciently pure (greater
than a preset threshold), then that node is considered a leaf node and
assigned the label C. The purity of a node is defined as its probability
of the corresponding class.

For example, in Figure, the root


P(subscribed = yes)=1 1789/2000 =10.55%; therefore, the root
is only 10.55% pure on the subscribed = yes
class. Conversely, it is 89.45% pure on the subscribed = no class.
Steps
In contrast, if not all the records in S belong to class C or if
S is not suciently pure, the algorithm selects the next most
informative attribute A (duration, marital, and so on) and
partitions S according to As values. The algorithm
constructs subtrees T , T for the subsets of S recursively
1 2

until one of the following criteria is met:


All the leaf nodes in the tree satisfy the minimum purity threshold.
The tree cannot be further split with the preset minimum purity
threshold.
Any other stopping criterion is satised (such as the maximum depth
of the tree).
Steps
The rst step in constructing a decision tree is to choose the most
informative attribute. A common way to identify the most informative
attribute is to use entropy-based methods, which are used by
decision tree learning algorithms such as ID3 (or Iterative
Dichotomiser 3) and C4.5. The entropy methods select the most
informative attribute based on two basic measures:
Entropy, which measures the impurity of an attribute
Information gain, which measures the purity of an attribute
Entropy
Now, we can build a conclusion that less impure node requires less
information to describe it. And, more impure node requires more
information. Information theory is a measure to define this degree of
disorganization in a system known as Entropy. If the sample is
completely homogeneous, then the entropy is zero and if the sample
is an equally divided (50% 50%), it has entropy of one.
Entropy
Given a class X and its label x X , let P(x) be the probability of
x . Hx ,the entropy of X , is defined as shown in

Hx =(0.5log2 0.5+0.5log2 0.5) =1


Entropy
For the bank marketing scenario, the output variable is subscribed.
The base entropy is defined as entropy of the output variable, that is
Hsubscribed . As seen previously, P(subscribed = yes) = 0.1055 and
P(subscribed = no) = 0.8945. According to Equation of entropy, the
base entropy Hsubscribed =0.1055log2 0.10550.8945log2 0.8945
0.4862.
Conditional Entropy
The next step is to identify the conditional entropy for each attribute.
Given an attribute X, its value x , its outcome Y, and its value y,
conditional entropy HY|X is the remaining entropy of Y given X ,
formally defined as,

Consider the banking marketing scenario, if the attribute contact is


chosen, X = {cellular, telephone, unknown}. The conditional entropy of
contact considers all three values.
Conditional Entropy Example
Cellular Telephone Unknown

P(contact) 0.6435 0.0680 0.2885

P(subscribed=yes | 0.1399 0.0809 0.0347


contact)
P(subscribed=no | 0.8601 0.9192 0.9653
contact)
Hsubscribed|contact =[0.6435(0.1399log2 0.1399 +0.8601log2
0.8601)+0.0680(0.0809log2 0.0809 +0.9192log2
0.9192)+0.2885(0.0347log2 0.0347 + 0.9653log2 0.9653)]
= 0.4661
Computation inside the parentheses is on the entropy of the class
labels within a single contact value. Note that the conditional entropy
is always less than or equal to the base entropythat is,
Hsubscribed |marital Hsubscribed . The conditional entropy is smaller
than the base entropy when the attribute and the outcome are
correlated. In the worst case, when the attribute is uncorrelated with
the outcome, the conditional entropy equals the base entropy
Information Gain:
Look at the image below and think which node can be described
easily. I am sure, your answer is C because it requires less information
as all values are similar. On the other hand, B requires more
information to describe it and A requires the maximum information.
In other words, we can say that C is a Pure node, B is less Impure and
A is more impure.
Steps to calculate entropy for a split
Calculate entropy of parent node
Calculate entropy of each individual node of split and calculate
weighted average of all sub-nodes available in split.
Information gain
information gain of an attribute A is defined as the difference
between the base entropy and the conditional entropy of the
attribute, as shown
InfoGainA = HS HS|A

In the bank marketing example, the information gain of the contact


attribute is show,

= 0.4862 - 0.4661=0.0201
Calculating Information Gain of Input
Variables for the First Split
Attribute Information Gain
poutcome 0.0289
contact 0.0201
housing 0.0133
job 0.0101
education 0.0034
marital 0.0018
loan 0.0010
default 0.0005
Decision Tree Algorithms
ID3 Algorithm
C4.5
CART
Gini Index
Gini index says, if we select two items from a population at random
then they must be of same class and probability for this is 1 if
population is pure.
It works with categorical target variable Success or Failure.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create
binary splits.
Steps to Calculate Gini for a split
What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision
trees. If there is no limit set of a decision tree, it will give you 100%
accuracy on training set because in the worse case it will end up
making 1 leaf for each observation. Thus, preventing overfitting
is pivotal while modeling a decision tree and it can be done in 2 ways:
Setting constraints on tree size
Tree pruning
Lets discuss both of these briefly.
Setting Constraints on Tree Size
This can be done by using various parameters which are used to
define a tree. First, lets look at the general structure of a decision
tree:
The parameters used for defining a tree are further explained below. The
parameters described below are irrespective of tool. It is important to
understand the role of parameters used in tree modeling. These parameters are
available in R & Python.
Minimum samples for a node split
Defines the minimum number of samples (or observations) which are required in a node to
be considered for splitting.
Used to control over-fitting. Higher values prevent a model from learning relations which
might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.
Minimum samples for a terminal node (leaf)
Defines the minimum samples (or observations) required in a terminal node or leaf.
Used to control over-fitting similar to min_samples_split.
Generally lower values should be chosen for imbalanced class problems because the regions
in which the minority class will be in majority will be very small.
Maximum depth of tree (vertical depth)
The maximum depth of a tree.
Used to control over-fitting as higher depth will allow model to learn relations very specific to
a particular sample.
Should be tuned using CV.
Maximum number of terminal nodes
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of n would
produce a maximum of 2^n leaves.
Maximum features to consider for split
The number of features to consider while searching for a best split. These will be randomly
selected.
As a thumb-rule, square root of the total number of features works great but we should
check upto 30-40% of the total number of features.
Higher values can lead to over-fitting but depends on case to case.
Tree Pruning
Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-
approach. In other words, it will check for the best split
instantaneously and move forward until one of the specified stopping
condition is reached. Lets consider the following case when youre
driving:
There are 2 lanes:
A lane with cars moving at 80km/h
A lane with trucks moving at 30km/h
At this instant, you are the yellow car and you have 2 choices:
Take a left and overtake the other 2 cars quickly
Keep moving in the present lane

Lets analyze these choice. In the former choice, youll immediately overtake the car ahead
and reach behind the truck and start moving at 30 km/h, looking for an opportunity to
move back right. All cars originally behind you move ahead in the meanwhile. This would
be the optimum choice if your objective is to maximize the distance covered in next say 10
seconds. In the later choice, you sale through at same speed, cross trucks and then
overtake maybe depending on situation ahead. Greedy you!
This is exactly the difference between normal decision tree & pruning. A decision tree
with constraints wont see the truck ahead and adopt a greedy approach by taking a left.
On the other hand if we use pruning, we in effect look at a few steps ahead and make a
choice.
So we know pruning is better. But how to implement it in decision tree? The idea is
simple.
We first make the decision tree to a large depth.
Then we start at the bottom and start removing leaves which are giving us negative
returns when compared from the top.
Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that
gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see
that the overall gain is +10 and keep both leaves.
Note that sklearns decision tree classifier does not currently support pruning. Advanced
packages like xgboost have adopted tree pruning in their implementation. But the
library rpart in R, provides a function to prune.
Trees vs. Linear Models
Which model is better?
If the relationship between the predictors and response is linear, then
classical linear models such as linear regression would outperform regression
trees
On the other hand, if the relationship between the predictors is non-linear,
then decision trees would outperform classical approaches
Trees vs. Linear Model: Classification Example
Top row: The true decision
boundary is linear
Left: linear model (Better)
Right: decision tree

Bottom row: The true decision


boundary is non-linear
Left: linear model
Right: decision tree (Better)
Advantages and Disadvantages of Decision
Trees
Advantages:
Trees are very easy to explain to people (even easier than linear regression).
Trees can be plotted graphically, and hence can be easily communicated even
to a non-expert.
They work fine for both classification and regression problems.

Disadvantages:
Trees dont have the same prediction accuracy as some of the more flexible
approaches available in practice.

Potrebbero piacerti anche