Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
We now discuss tree-based methods.
Note that our main goal is to predict a target variable based on
several input variables.
Decision trees can be applied to both regression and classification
problems.
Introduction
Thus there are of two main types:
1. Classification tress used when the predicted outcome is a categorical
variable.
2. Regression tress used when the predicted outcome is a quantitative
variable.
The term Classification And Regression Tree (CART) analysis is a
popular umbrella term used to refer to both of the above methods.
Regression Tree
Dataset: Baseball Players Salaries
Major League Baseball Data for the two seasons.
A data frame with 322 observations of major league players on 20
variables.
Goal: To predict Salary based on a number of predictors, such as
various performance indicators, number of years, etc.
Dataset: Baseball Players Salaries
For the time being, we will consider following three variables
1. Salary (Thousands of dollars)
2. Years (Number of years he has played in the major leagues)
3. Hits (Number of hits he made in the previous year)
Goal: To predict Salary based on Years and Hits.
In order to reduce the skewness, we first log-transform Salary so that
it has more of a typical bell-shape.
Baseball Players Salaries
The General View
Consider two predictors 1 and 2 .
The predictor space is segmented into five
distinct regions.
Depending upon which region our
observation comes from, we would make
one of five possible predictions for Y.
We typically use the mean of the training
observations belonging to a particular
region as the predicted value for the
region.
The General View
Typically we create the
partitions by iteratively
splitting one of the
variables into two regions.
The General View
1. First split on 1 = 1 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
3. If 1 > 1 , split on 1 = 3 .
The General View
1. First split on 1 = 1 .
2. If 1 1 , split on 2 = 2 .
3. If 1 > 1 , split on 1 = 3 .
4. If 1 > 3 , split on 2 = 4 .
The General View
RMSE=408.29
Classification Tree
Example:-
Lets say we have a sample of 30 students with three variables Gender
(Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play
cricket in leisure time. Now, I want to create a model to predict who
will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based
on all values of three variable and identify the variable, which creates
the best homogeneous sets of students (which are heterogeneous to
each other). In the snapshot below, you can see that variable Gender
is able to identify best homogeneous sets compared to the other two
variables.
As mentioned above, decision tree identifies the most significant
variable and its value that gives best homogeneous sets of population.
Now the question which arises is, how does it identify the variable and
the split? To do this, decision tree uses various algorithms, which we
will shall discuss later.
Important Terminology related to Decision
Trees
Root Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every
algorithm has advantages and disadvantages, below are the important factors
which one should know.
Carseats Data Set
A data set containing sales of child car seats at 400 different stores.
A data set with 400 observations on the following 11 variables.
The variables are as follows:
1. Sales: Unit sales (in thousands) at each location.
2. CompPrice: Price charged by competitor at each location
3. Income: Community income level (in thousands of dollars)
4. Advertising: Local advertising budget for company at each location
(in thousands of dollars)
5. Population: Population size in region (in thousands)
Carseats Data Set
6. Price: Price company charges for car seats at each site
7. ShelveLoc: A factor with levels Bad, Medium and Good
indicating the quality of the shelving location.
8. Age: Average Age of the local population
9. Education: Education level at each location
10. Urban: A factor with levels No and Yes to indicate whether the
store is in an urban or rural location
11. US: A factor with levels No and Yes to indicate whether the
store is in the US or not.
Carseats Data Set
We now recode Sales as binary variable.
We create a dummy variable High, which takes on a value Yes if
the sales exceed 8 units and No otherwise.
We will model High with the help of ten predictors.
Classification Tree
A classification tree is very similar to a regression tree except that we
try to make a prediction for a categorical response rather than
continuous one.
In a regression tree, the predicted response for an observation is
given by the average response of the training observations that
belong to the same terminal node.
In a classification tree, we predict that each observation belongs to
the most commonly occurring class of the training observations in the
region to which it belongs.
Classification Tree
The tree is grown in exactly the same manner as with a regression
tree
However in classification tree, minimizing MSE no longer makes
sense.
A natural alternative is classification error rate.
The classification error rate is simply the fraction of the training
observations in that region that do not belong to the most common
class.
There are several other different criteria available as well, such as the
gini index and cross-entropy.
Carseats Data Set
We split the observations into a training data set and a test data set.
Both the training set and the test set contain 200 observations.
We next build a tree using the training set, and then evaluate its
performance based on the test data.
Carseats Data Set: Unpruned Tree
Confusion Matrix based on Test Data
True High status
Predicted No Yes Total
High No 88 28 116
Status
Yes 28 56 84
Total 116 84 200
56
= = 66.67%
84
88
= = 75.86%
116
56
= = 28%
200
Cross Validation
We now consider whether pruning the tree leads to a better
performance.
We decide the optimal level of tree complexity using cross-validation.
Cross Validation
We select a
tree with 9
terminal nodes.
Pruned Tree
Confusion Matrix based on Test Data for
Pruned Tree
True High status
Predicted No Yes Total
High No 94 24 118
Status
Yes 22 60 82
Total 116 84 200
60
= = 71.43%
84
94
= = 81.03%
116
46
= = 23%
200
Example: Portuguese Bank data
Summary(data)
job marital education default
blue-collar:435 divorced: 228 primary : 335 no :1961
retired : 92
(Other) 308
housin loan contact month poutcome
g
no : no cellular may :581 failure:
916 :1717 :1287 210
yes:10 yes: telephone: jul :340 other :
84 283 136 79
unknown : aug :278 success:
577 58
jun :232 unknown:1
653
nov :183
apr :118
(Other): 268
subscribed
no :1789
yes: 211
Attribute job includes the following values.
Explanation of figure
At each split, the decision tree algorithm picks the most informative
attribute out of the remaining attributes. The extent to which an attribute
is informative is determined by measures such as entropy and information
gain
At the rst split, the decision tree algorithm chooses the poutcome
attribute. There are two nodes at depth=1. The left node is a leaf node
representing a group for which the outcome of the previous marketing
campaign contact is a failure, other, or unknown. For this group, 1,763 out
of 1,942 clients have not subscribed to the term deposit.
The right node represents the rest of the population, for which the
outcome of the previous marketing campaign contact is a success. For the
population of this node, 32 out of 58 clients have subscribed to the term
deposit.
Explanation of figure
This node further splits into two nodes based on the education level.
If the education level is either secondary or tertiary, then 26 out of 50
of the clients have not subscribed to the term deposit. If the
education level is primary or unknown, then 8 out of 8 times the
clients have subscribed.
The left node at depth 2 further splits based on attribute job . If the
occupation is admin, blue collar, management, retired, services, or
technician, then 26 out of 45 clients have not subscribed. If the
occupation is self-employed, student, or unemployed, then 5 out of 5
times the clients have subscribed.
The General Algorithm
= 0.4862 - 0.4661=0.0201
Calculating Information Gain of Input
Variables for the First Split
Attribute Information Gain
poutcome 0.0289
contact 0.0201
housing 0.0133
job 0.0101
education 0.0034
marital 0.0018
loan 0.0010
default 0.0005
Decision Tree Algorithms
ID3 Algorithm
C4.5
CART
Gini Index
Gini index says, if we select two items from a population at random
then they must be of same class and probability for this is 1 if
population is pure.
It works with categorical target variable Success or Failure.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create
binary splits.
Steps to Calculate Gini for a split
What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision
trees. If there is no limit set of a decision tree, it will give you 100%
accuracy on training set because in the worse case it will end up
making 1 leaf for each observation. Thus, preventing overfitting
is pivotal while modeling a decision tree and it can be done in 2 ways:
Setting constraints on tree size
Tree pruning
Lets discuss both of these briefly.
Setting Constraints on Tree Size
This can be done by using various parameters which are used to
define a tree. First, lets look at the general structure of a decision
tree:
The parameters used for defining a tree are further explained below. The
parameters described below are irrespective of tool. It is important to
understand the role of parameters used in tree modeling. These parameters are
available in R & Python.
Minimum samples for a node split
Defines the minimum number of samples (or observations) which are required in a node to
be considered for splitting.
Used to control over-fitting. Higher values prevent a model from learning relations which
might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.
Minimum samples for a terminal node (leaf)
Defines the minimum samples (or observations) required in a terminal node or leaf.
Used to control over-fitting similar to min_samples_split.
Generally lower values should be chosen for imbalanced class problems because the regions
in which the minority class will be in majority will be very small.
Maximum depth of tree (vertical depth)
The maximum depth of a tree.
Used to control over-fitting as higher depth will allow model to learn relations very specific to
a particular sample.
Should be tuned using CV.
Maximum number of terminal nodes
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of n would
produce a maximum of 2^n leaves.
Maximum features to consider for split
The number of features to consider while searching for a best split. These will be randomly
selected.
As a thumb-rule, square root of the total number of features works great but we should
check upto 30-40% of the total number of features.
Higher values can lead to over-fitting but depends on case to case.
Tree Pruning
Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-
approach. In other words, it will check for the best split
instantaneously and move forward until one of the specified stopping
condition is reached. Lets consider the following case when youre
driving:
There are 2 lanes:
A lane with cars moving at 80km/h
A lane with trucks moving at 30km/h
At this instant, you are the yellow car and you have 2 choices:
Take a left and overtake the other 2 cars quickly
Keep moving in the present lane
Lets analyze these choice. In the former choice, youll immediately overtake the car ahead
and reach behind the truck and start moving at 30 km/h, looking for an opportunity to
move back right. All cars originally behind you move ahead in the meanwhile. This would
be the optimum choice if your objective is to maximize the distance covered in next say 10
seconds. In the later choice, you sale through at same speed, cross trucks and then
overtake maybe depending on situation ahead. Greedy you!
This is exactly the difference between normal decision tree & pruning. A decision tree
with constraints wont see the truck ahead and adopt a greedy approach by taking a left.
On the other hand if we use pruning, we in effect look at a few steps ahead and make a
choice.
So we know pruning is better. But how to implement it in decision tree? The idea is
simple.
We first make the decision tree to a large depth.
Then we start at the bottom and start removing leaves which are giving us negative
returns when compared from the top.
Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that
gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see
that the overall gain is +10 and keep both leaves.
Note that sklearns decision tree classifier does not currently support pruning. Advanced
packages like xgboost have adopted tree pruning in their implementation. But the
library rpart in R, provides a function to prune.
Trees vs. Linear Models
Which model is better?
If the relationship between the predictors and response is linear, then
classical linear models such as linear regression would outperform regression
trees
On the other hand, if the relationship between the predictors is non-linear,
then decision trees would outperform classical approaches
Trees vs. Linear Model: Classification Example
Top row: The true decision
boundary is linear
Left: linear model (Better)
Right: decision tree
Disadvantages:
Trees dont have the same prediction accuracy as some of the more flexible
approaches available in practice.