CS 6823 Data Mining: Classification Decision Tree

CS 6823
Data Mining
Lecture 03
Classification
Decision Tree
Sources:
Slides from the text book: Introduction to Data Mining 1/e, Tan, Steinbach, Kumar
Data Mining
Classification: Basic Concepts,
Decision Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Given a collection of records (training set )

Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function

of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
4/18/2004
Illustrating Classification Task
4/18/2004
Examples of Classification Task

Predicting
tumor cells as benign or malignant
Classifying
credit card transactions

as legitimate or fraudulent
Classifying
secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
Categorizing
news stories as finance,

weather, entertainment, sports, etc
4/18/2004
Classification Techniques
Decision
Tree based Methods

Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
4/18/2004
Example of a Decision Tree
ca
go
e
t
al
c
ri
ca
go
e
t
al
c
ri
us
o
u
in
t
ss
n
a
cl
co
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Model: Decision Tree
Training Data
4/18/2004
Another Example of Decision Tree

al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
ss
e
e
t
t
n
a
cl
ca
ca
co
Tid Refund Marital
Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
NO
YES
There could be more than one tree that

fits the same data!
10
4/18/2004
Decision Tree Classification Task
Decision
Tree
4/18/2004
Apply Model to Test Data

Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4/18/2004
10

Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4/18/2004
11

Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4/18/2004
12

Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4/18/2004
13

Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
4/18/2004
14

Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to No
NO
> 80K
YES
4/18/2004
15
Decision Tree Classification Task
Decision
Tree
4/18/2004
16
Decision Tree Induction

Many
Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
4/18/2004
17
Any idea?
Refund
Yes
No
Dont
Cheat
Refund
Refund
Yes
Dont
Cheat
Single,
Divorced
Yes
No
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
18
General Structure of Hunts

Algorithm
Let Dt be the set of training records

that reach a node t
General Procedure:
If Dt contains records that belong
to the same class yt, then t is a
leaf node labeled as yt
If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
If Dt contains records that belong
to more than one class, use an
attribute test to split the data into
smaller subsets. Recursively
apply the procedure to each
subset.
Dt
4/18/2004
19
Hunts Algorithm
Dont
Cheat
Refund
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
Dont
Cheat
Single,
Divorced
Cheat
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
20
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records

How
to specify the attribute test condition?

How to determine the best split?
Determine when to stop splitting
4/18/2004
21
Tree Induction
Greedy
strategy.
Issues

How

4/18/2004
22
How to Specify Test Condition?
Depends on attribute types

Nominal
Ordinal
Continuous
Depends on number of
ways to split
2-way split
Multi-way split
4/18/2004
23
Splitting Based on Nominal

Attributes
Multi-way split: Use as many partitions as distinct

values.
CarType
Family
Luxury
Sports
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
4/18/2004
24
Splitting Based on Ordinal

Attributes
Multi-way split: Use as many partitions as distinct

values.
Size
Small
Medium
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Small,
Medium}
Large
Size
{Large}
What about this split?
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
4/18/2004
25
Splitting Based on Continuous

Attributes
Different
ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v)

consider all possible splits and finds the best cut
can be more compute intensive
4/18/2004
26
Splitting Based on Continuous

Attributes
4/18/2004
27
Tree Induction
Greedy
strategy.
Issues

How

4/18/2004
28
How to determine the Best Split

Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
4/18/2004
29
How to determine the Best Split

Greedy
approach:
Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
4/18/2004
30
Measures of Node Impurity

Gini
Index
Entropy
Misclassification
error
4/18/2004
31
How to Find the Best Split

Before Splitting:
M0
A?
Yes
B?
No
Yes
No
Node N1
Node N2
Node N3
Node N4
M1
M2
M3
M4
M12
M34
Gain = M0 M12 vs M0 M34
4/18/2004
32
Measure of Impurity: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
4/18/2004
33
Examples for computing GINI

GINI (t ) 1 [ p ( j | t )]2
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0
P(C2) = 5/6
Gini = 1 (1/6)2 (5/6)2 = 0.278

P(C2) = 4/6
Gini = 1 (2/6)2 (4/6)2 = 0.444

4/18/2004
34
Splitting Based on GINI
Used in CART, SLIQ, SPRINT.

When a node p is split into k partitions (children), the
quality of split is computed as,
GINI split
where,
B?
ni
GINI (i )
i 1 n
Yes
No
Node N1
Node N2
ni = number of records at child i,

n = number of records at node p.
4/18/2004
35
Binary Attributes: Computing GINI Index
Splits into two partitions

Effect of Weighing partitions:
Larger and Purer Partitions are sought for.
Parent
B?
Yes
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/7)2 (2/7)2
= 0.408
Node N1
Node N2
Gini(Children)
= 7/12 * 0.408 +
5/12 * 0.320
= 0.371
Gini(N2)
= 1 (1/5)2 (4/5)2
= 0.320
4/18/2004
36
Categorical Attributes: Computing Gini

Index
For each distinct value, gather counts for each class in

the dataset
Use the count matrix to make decisions
Two-way split
(find best partition of values)
Multi-way split
CarType
C1
C2
Gini
Family Sports Luxury

1
2
1
4
1
1
0.393
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
4/18/2004
37
Continuous Attributes: Computing Gini

Index
Use Binary Decisions based on one

value
Several Choices for the splitting value
Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
Class counts in each of the
partitions, A < v and A v
Simple method to choose best v
For each v, scan the database to
gather count matrix and compute
its Gini index
Computationally Inefficient!
Repetition of work.
4/18/2004
38
Continuous Attributes: Computing Gini

Index...
For efficient computation: for each attribute,

Sort the attribute on values
Linearly scan these values, each time updating the count matrix and
computing gini index
Choose the split position that has the least gini index
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
4/18/2004
0.420
39

CS 6823 Data Mining: Classification Decision Tree

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CS 6823 Data Mining: Classification Decision Tree

Caricato da

Copyright:

Formati disponibili

CS 6823

Introduction to Data Mining

Given a collection of records (training set )

Find a model for class attribute as a function

Introduction to Data Mining

Illustrating Classification Task

Introduction to Data Mining

Examples of Classification Task

tumor cells as benign or malignant

credit card transactions

secondary structures of protein

news stories as finance,

Introduction to Data Mining

Tree based Methods

Introduction to Data Mining

Example of a Decision Tree

Tid Refund Marital

Model: Decision Tree

Introduction to Data Mining

Another Example of Decision Tree

There could be more than one tree that

Introduction to Data Mining

Decision Tree Classification Task

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Apply Model to Test Data

Introduction to Data Mining

Decision Tree Classification Task

Introduction to Data Mining

Decision Tree Induction

Introduction to Data Mining

Introduction to Data Mining

General Structure of Hunts

Let Dt be the set of training records

Introduction to Data Mining

Introduction to Data Mining

Determine how to split the records

to specify the attribute test condition?

Determine when to stop splitting

Introduction to Data Mining

Determine how to split the records

to specify the attribute test condition?

Determine when to stop splitting

Introduction to Data Mining

How to Specify Test Condition?

Depends on attribute types

Introduction to Data Mining

Splitting Based on Nominal

Multi-way split: Use as many partitions as distinct

Binary split: Divides values into two subsets.

Introduction to Data Mining

Splitting Based on Ordinal

Multi-way split: Use as many partitions as distinct

Binary split: Divides values into two subsets.