Ma Mru DM Chapter06

Chapter 6
Decision Trees
2
An Example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Outlook
overcast
humidity windy
high normal false true
sunny
rain
N N P P
P
overcast
3
Another Example - Grades
Percent >= 90%?
Yes Grade = A
No 89% >= Percent >= 80%?
Yes Grade = B
No 79% >= Percent >= 70%?
Yes
Grade = C
No
Etc...
4
Yet Another Example
1 of 2
5
Yet Another Example
English Rules (for example):
2 of 2
If tear production rate = reduced then recommendation = none.
If age = young and astigmatic = no and tear production rate = normal
then recommendation = soft
If age = pre-presbyopic and astigmatic = no and tear production
rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and tear production rate =
normal
then recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
6
Decision Tree Template
Drawn top-to-bottom or left-
to-right
Top (or left-most) node =
Root Node
Descendent node(s) =
Child Node(s)
Bottom (or right-most)
node(s) = Leaf Node(s)
Unique path from root to
each leaf = Rule
Root
Child Child Leaf
Leaf Child
Leaf
7
Introduction
Decision Trees
Powerful/popular for classification & prediction
Represent rules
Rules can be expressed in English
IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
Rules can be expressed using SQL for query
Useful to explore data to gain insight into
relationships of a large number of candidate input
variables to a target (output) variable
You use mental decision trees often!
Game: Im thinking of Is it ?
8
Decision Tree What is it?
A structure that can be used to divide up a large
collection of records into successively smaller
sets of records by applying a sequence of simple
decision rules
A decision tree model consists of a set of rules
for dividing a large heterogeneous population
into smaller, more homogeneous groups with
respect to a particular target variable
9
Decision Tree Types
Binary trees only two choices in each
split. Can be non-uniform (uneven) in
depth
N-way trees or ternary trees three or
more choices in at least one of its splits (3-
way, 4-way, etc.)
10
A binary decision tree
classification example
Potential catalog recipients as likely (1) or
unlikely (0) to place an order if sent a new catalog
Each node is labeled with a node number
in the upper-right corner and the predicted
class in the center. The decision rules to
split each node are printed on the lines
connecting each node to its children.
Any record that reaches leaf
nodes 19, 14, 16, 17, or 18 is
classified as likely to respond,
because the predicted class in
this case is 1. The paths to these
leaf nodes describe the rules in
the tree. For example, the rule for
leaf 19 is If the customer has
made more than 6.5 orders and it
has been fewer than 765 days
since the last order, the customer
is likely to respond.
11
Scoring example
Often it is useful to show the proportion of
the data in each of the desired classes
Clarify Fig 6.2
12
Scoring example
13
A ternary decision tree example
14
Decision Tree Splits (Growth)
The best split at root or child nodes is defined as
one that does the best job of separating the data
into groups where a single class predominates
in each group
Example: US Population data input categorical
variables/attributes include:
Zip code
Gender
Age
Split the above according to the above best split rule
15
Split Criteria
The best split is defined as one that does
the best job of separating the data into
groups where a single class predominates
in each group
Measure used to evaluate a potential split
is purity
The best split is one that increases purity of
the sub-sets by the greatest amount
A good split also creates nodes of similar size
or at least does not create very small nodes
16
Example: Good & Poor Splits
The final split is a good one because:
- it leads to children of roughly same
size and
- with much higher purity than the
parent.
17
Tests for Choosing Best Split
Purity (Diversity) Measures:
For categorical targets:
Gini (population diversity)
Entropy (information gain)
Information Gain Ratio
Chi-square Test
For numeric targets:
Reduction in variance
F test
18
Gini (Population Diversity)
The Gini measure of a node is the sum of
the squares of the proportions of the
classes.
Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)
19
Entropy (Information Gain)
Entropy = a measure of how disorganized a
system is.
-1 * ( P(dark)log
2
P(dark) + P(light)log
2
P(light) )
-1 * (0.5 log
2
(0.5) + 0.5 log
2
(0.5)) = +1
The entropy of a node is the sum, over all the classes represented in the
node, of the proportion of records belonging to a particular class multiplied by
the base two logarithm of that proportion (usually multiplied by 1 in order to
obtain a positive number).
The entropy of a split is the sum of the entropies of all the nodes resulting
from the split weighted by each nodes proportion of the records.
-1 * (0.1 log
2
(0.1) + 0.9 log
2
(0.9)) = 0.33 + 0.14 = 0.47
-1 * (0.1 log
2
(0.1) + 0.9 log
2
(0.9)) = 0.33 + 0.14 = 0.47
weighted average = 0.47
total entropy reduction / information gain for split = 0.53
20
Information Gain Ratio
Problems with entropy:
By breaking the larger data set into many small
subsets, the number of classes represented in each
node tends to go down, and with it, the entropy
Decision trees tend to be quite bushy
In reaction to this problem, C5 and other descendants
of ID3 that once used information gain now use the
ratio of the total information gain due to a proposed
split to the intrinsic information attributable solely to
the number of branches created as the criterion for
evaluating proposed splits.
21
Chi-square
The chi-square test gives its name to
CHAID, a well-known decision tree algorithm
= Chi-square Automatic Interaction Detector.
CHAID makes use of the Chi-square test in
several ways:
to merge classes that do not have significantly
different effects on the target variable;
to choose a best split;
to decide whether it is worth performing any
additional splits on a node.
22
Reduction in Variance
The mean value in the parent node is 0.5 (0/1).
Every one of the 20 observations differs from the mean by
0.5, so the variance is (20 * 0.52) / 20 = 0.25.
After the split, the left child has 9 dark spots and one light
spot, so the node mean is 0.9.
Nine of the observations differ from the mean value by 0.1
and one observation differs from the mean value by 0.9 so
the variance is (0.92 + 9 * 0.12) / 10 = 0.09.
Since both nodes resulting from the split
have variance 0.09, the total variance after
the split is also 0.09.
The reduction in variance due to the split
is 0.25 0.09 = 0.16.
23
F Test
The F score is calculated by dividing the
between-sample estimate by the pooled
sample estimate.
The larger the score, the less likely it is that the
samples are all randomly drawn from the same
population.
In the decision tree context, a large F-score
indicates that a proposed split has successfully
split the population into subpopulations with
significantly different distributions.
24
Pruning
The decision tree keeps growing as long as new
splits can be found that improve the ability of the
tree to separate the records of the training set
into increasingly pure subsets.
Such a tree has been optimized for the training set,
so eliminating any leaves would only increase the
error rate of the tree on the training set.
Does this imply that the full tree will also do the
best job of classifying new datasets?
25
Pruning
A decision tree algorithm makes its best split first, at
the root node where there is a large population of
records.
As the nodes get smaller, idiosyncrasies of the
particular training records at a node come to dominate
the process.
= the tree finds general patterns at the big nodes and
patterns specific to the training set in the smaller
nodes;
= the tree over-fits the training set.
The result is an unstable tree that will not make good
predictions.
26
Pruning
The cure is to eliminate the unstable splits
by merging smaller leaves through a
process called pruning.
Three general approaches to pruning:
CART
C5
Stability-based
27
CART
Pruning
The goal is to prune first those branches providing the least additional
predictive power per leaf.
In order to identify these least useful branches, CART relies on a
concept called the adjusted error rate.
This is a measure that increases each nodes misclassification rate
on the training set by imposing a complexity penalty based on the
number of leaves in the tree.
The adjusted error rate is used to identify weak branches (those
whose misclassification rate is not low enough to overcome the
penalty) and mark them for pruning.
Creating
the
Candidate
Subtrees
28
CART
Pruning
Picking
the Best
Subtree
Each of the candidate subtrees is used to classify the records in the
validation set.
The tree that performs this task with the lowest overall error rate is
declared the winner.
The winning subtree has been pruned sufficiently to remove the effects
of overtraining, but not so much as to lose valuable information.
29
CART Pruning
Do not evaluate the performance of a model
by its lift or error rate on the validation set.
Like the training set, it has had a hand in creating
the model and so will overstate the models
accuracy.
Always measure the models accuracy on a
test set that is drawn from the same population
as the training and validation sets, but has not
been used in any way to create the model.
Using the Test Set to Evaluate the Final Tree
30
The C5 Pruning Algorithm
Like CART, the C5 algorithm first grows an
overfit tree and then prunes it back to
create a more stable model.
The pruning strategy is different:
C5 does not make use of a validation set to
choose from among candidate subtrees; the
same data used to grow the tree is also used
to decide how the tree should be pruned.
C5 prunes the tree by examining the error
rate at each node and assuming that the true
error rate is actually substantially worse.
31
Stability-
Based
Pruning
CART and C5 fail to prune some nodes that are clearly
unstable.
One of the main purposes of a model is to make
consistent predictions on previously unseen records.
Any rule that cannot achieve that goal should be
eliminated from the model.
32
Stability-Based Pruning
Small nodes cause big problems.
A common cause of unstable decision tree
models is allowing nodes with too few
records.
Most decision tree tools allow the user to
set a minimum node size.
As a rule of thumb, nodes that receive fewer
than about 100 training set records are likely
to be unstable.
33
Extracting Rules from Trees
Watch the game and out with
friends then beer.
34
Further
Refinements
Using More
Than One Field
at a Time
35
Further Refinements
Tilting the Hyperplane
36
Alternate Representations for
Decision Trees
Box
Diagrams
Shading is proportional to the purity of the box; size is
proportional to the number of records that land there.
37
Alternate Representations for
Decision Trees
Tree Ring Diagrams
The circle at the center of the
diagram represents the root
node, before any splits have
been made.
Moving out from the center, each
concentric ring represents a new
level in the tree.
The ring closest to the center
represents the root node split.
The arc length is proportional to
the number of records taking
each of the two paths, and the
shading represents the nodes
purity.
38
Decision Tree Advantages
1. Easy to understand
2. Map nicely to a set of business rules
3. Applied to real problems
4. Make no prior assumptions about the data
5. Able to process both numerical and
categorical data
39
Decision Tree Disadvantages
1. Output attribute must be categorical
2. Limited to one output attribute
3. Decision tree algorithms are unstable
4. Trees created from numeric datasets can
be complex
40
RapidMiner Practice
To see:
Training Videos\01 - Ralf Klinkenberg
RapidMinerResources\5 - Modelling -
Classification -1- Decision trees - Basic.mp4
To practice:
Do the exercises presented in the movie
using the files Iris.ioo and Sonar.ioo.
41
RapidMiner Practice
To see:
Training Videos\02 - Tom Ott - Neural Market
Trends\06 - Creating a Decision Tree with
Rapidminer 5.0
To practice:
Do the exercises presented in the movie.
42
Data Preprocessing
GermanCredit.xls GermanCredit.ioo
Process design
Take a look at the .ioo file and attributes / variables
Use desision tree to predict response (credit rating)
Try different criterion and options
Validate your model
Use validation operator
Inside this put the Decision Tree learner (left side) and Apply
Model and Performance (right side)
Model
Read and interpret the results
RapidMiner Practice

Ma Mru DM Chapter06

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ma Mru DM Chapter06

Caricato da

Copyright:

Formati disponibili

Chapter 6

Potrebbero piacerti anche