Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data Mining
Lecture 03
Classification
Decision Tree
Sources:
Slides from the text book: Introduction to Data Mining 1/e, Tan, Steinbach, Kumar
Data Mining
Classification: Basic Concepts,
Decision Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Classifying
Classifying
Categorizing
Tan,Steinbach, Kumar
4/18/2004
Classification Techniques
Decision
Tan,Steinbach, Kumar
4/18/2004
ca
go
e
t
al
c
ri
ca
go
e
t
al
c
ri
us
o
u
in
t
ss
n
a
cl
co
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
NO
YES
10
Tan,Steinbach, Kumar
4/18/2004
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
10
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
11
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
12
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
13
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
14
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
Assign Cheat to No
NO
> 80K
YES
4/18/2004
15
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
16
Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Tan,Steinbach, Kumar
4/18/2004
17
Any idea?
Refund
Yes
No
Dont
Cheat
Refund
Refund
Yes
Dont
Cheat
Single,
Divorced
Yes
No
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
Tan,Steinbach, Kumar
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
18
Tan,Steinbach, Kumar
Dt
4/18/2004
19
Hunts Algorithm
Dont
Cheat
Refund
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
Dont
Cheat
Single,
Divorced
Cheat
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
Tan,Steinbach, Kumar
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
20
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
21
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
22
Depends on number of
ways to split
2-way split
Multi-way split
Tan,Steinbach, Kumar
4/18/2004
23
Luxury
Sports
CarType
Tan,Steinbach, Kumar
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
4/18/2004
24
Large
Size
{Large}
Tan,Steinbach, Kumar
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
4/18/2004
25
ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Tan,Steinbach, Kumar
4/18/2004
26
Tan,Steinbach, Kumar
4/18/2004
27
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
28
Tan,Steinbach, Kumar
4/18/2004
29
approach:
Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
Non-homogeneous,
Homogeneous,
Tan,Steinbach, Kumar
4/18/2004
30
Index
Entropy
Misclassification
Tan,Steinbach, Kumar
error
4/18/2004
31
M0
A?
Yes
B?
No
Yes
No
Node N1
Node N2
Node N3
Node N4
M1
M2
M3
M4
M12
M34
Gain = M0 M12 vs M0 M34
Tan,Steinbach, Kumar
4/18/2004
32
GINI (t ) 1 [ p ( j | t )]2
j
0
6
Gini=0.000
Tan,Steinbach, Kumar
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
4/18/2004
33
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
34
GINI split
where,
B?
ni
GINI (i )
i 1 n
Yes
No
Node N1
Node N2
Tan,Steinbach, Kumar
4/18/2004
35
B?
Yes
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/7)2 (2/7)2
= 0.408
Node N1
Node N2
Gini(Children)
= 7/12 * 0.408 +
5/12 * 0.320
= 0.371
Gini(N2)
= 1 (1/5)2 (4/5)2
= 0.320
Tan,Steinbach, Kumar
4/18/2004
36
Multi-way split
CarType
C1
C2
Gini
Tan,Steinbach, Kumar
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
4/18/2004
37
Tan,Steinbach, Kumar
4/18/2004
38
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
Tan,Steinbach, Kumar
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
4/18/2004
0.420
39