CH03 Classification Part I

Data Mining
Classification
Hatem Haddad
Sheets are largely based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
2 What have we seen last time?
3
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
Data mining plays an essential
role in the knowledge
discovery process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
4 Data Mining Algorithms
Classification algorithms predict one or more variables, based on the other

attributes in the dataset.
Regression algorithms predict one or more continuous numeric variables,
such as profit or loss, based on other attributes in the dataset.
Clustering algorithms divide data into groups, or clusters, of items that have
similar properties.
Association algorithms find correlations between different attributes in a
dataset. The most common application of this kind of algorithm is for
creating association rules, which can be used in a market basket analysis.
Sequence analysis algorithms summarize frequent sequences or episodes in
data, such as a series of clicks in a web site, or a series of log events
preceding machine maintenance.
.
5 Data Mining Algorithms
Choosing an Algorithm by Task

There is no reason that you should be limited to one algorithm in your
solutions
Supervised vs. Unsupervised Learning
Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of 6
classes or clusters in the data

Prediction Problems: Classification
Classification
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
7
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
A test set is used to determine the accuracy of

the model.
70/30 rule: 70% for training and 30% for test.
Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
Status Income Cheat
NO Refund
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10 No Single 90K Yes that fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Model Decision
11 No Small 55K ? Tree
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to No
TaxInc NO
< 80K > 80K
NO YES
Decision Trees as a Computer Program
Rewrite the previous decision trees as a If-then-else statement

Tree Evaluation (next chapter)
Test set
Ground truth, data labeling, mechanical Turk
Confusion Matrix and cost matrix
True positive, true negative, false positive, false negative
Accuracy
Error rates
21 Exercise Tree Induction: Training Dataset
Class: buys_computer
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Solution: interpretation?
22
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes
Decision Tree Classification Task
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Decision
Model
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
General Structure of Hunts Algorithm
Let Dt be the set of training
Status Income Cheat
records that reach a node t 1 Yes Single 125K No
General Procedure: 2 No Married 100K No

3 No Single 70K No
If Dt contains records that 4 Yes Married 120K No
belong to the same class yt, 5 No Divorced 95K Yes
then t is a leaf node labeled 6 No Married 60K No
as yt 7 Yes Divorced 220K No
If Dt is an empty set, then t is

8 No Single 85K Yes
a leaf node labeled by the

9 No Married 75K No
default class, yd 10
If Dt contains records that Dt

belong to more than one
class, use an attribute test to ?
split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
General Structure of Hunts Algorithm
Start with a single node: we will
Status Income Cheat
assume that Refund is the best 1 Yes Single 125K No
criterion for splitting the data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
Refund
5 No Divorced 95K Yes
Yes No 6 No Married 60K No

Dont 8 No Single 85K Yes

? 9 No Married 75K No
Cheat
10
Dt
Hunts algorithm is then applied
recursively to each child of the
root node ?
Exercise: Continue the algorithm

Status Income Cheat
Hunts Algorithm 1
2
Yes
No
Single
Married
125K
100K
No
No
Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No
Dont 5 No Divorced 95K Yes
Cheat ?
6 No Married 60K No
8 No Single 85K Yes
Refund 9 No Married 75K No
Refund
Yes No 10 No Single 90K Yes
Yes No 10
Dont
Dont Marital
Marital Cheat Status
Cheat Status Single,
Single, Married
Married Divorced
Divorced
Dont
Dont Taxable
? Income Cheat
Cheat
< 80K >= 80K
Dont
Cheat
Cheat
Handling Missing Attribute Values
Missing values affect decision tree construction

in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing value
to child nodes
Affects how a test instance with missing value is
classified
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Determine how to cut back if tree is too deep
What is wrong with a tree that is too deep?
How to Specify Test Condition?
Depends on attribute types

Nominal
Ordinal
Continuous
Depends on number of ways to split

2-way split
Multi-way split
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
Size
{Small,
Large} {Medium}
Splitting Based on Continuous
Attributes
Different ways of handling

Discretization to form an ordinal categorical attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
Binary Decision: (A < v) or (A v)

consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student

Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
How to Find the Best Split: let M be the
measure
C0 N00
Before Splitting: M0
C1 N01
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41
M1 M2 M3 M4
M12 M34
Gain = M0 M12 vs M0 M34

Measures of Node Impurity
Entropy
Gain
Gini Index
Classification Error
Entropy Based Evaluation and Splitting
Entropy at a given node t:
Entropy (t ) p ( j | t ) log p ( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
Measures impurity of a node:

Maximum (log nc)
when records are equally distributed among all classes: implying least information,
where nc =the number of classes.
Minimum (0):
when all records belong to one class, implying most information
Examples for computing Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = 0 log 0 1 log 1 = 0 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...
Information Gain:
GAIN n
Entropy ( p ) Entropy (i )
k
i
n
split i 1
Parent Node, p is split into k partitions;

ni is number of records in partition i
Measures Reduction in Entropy achieved because
of the split. Choose the split that achieves most
reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in
large number of partitions, each being small but
pure.
Splitting Based on INFO...
Gain Ratio:
GAIN n n
GainRATIO SplitINFO log
Split k
i i
split
SplitINFO n n i 1
Parent Node, p is split into k partitions

ni is the number of records in partition i
Adjusts Information Gain by the entropy of the

partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of
Information Gain
Measure of Impurity: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 (1/6)2 (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 (2/6)2 (4/6)2 = 0.444
Splitting Based on GINI (overall Gini)
Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split GINI (i )
i 1 n
where,ni = number of records at child i,

n = number of records at node p.
Binary Attributes: Computing GINI
Index k
ni
l Splits into two partitions GINI split GINI (i )
i 1 n
l Effect of Weighing partitions:
Larger and Purer Partitions are sought for (desired).
Parent
B?
C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 (5/7)2 N1 N2 Gini(Children)
(2/7)2
C1 5 1 = 7/12 * 0.194 +
= 0.194
C2 2 4 5/12 * 0.528
Gini(N2) Gini=0.333 = 0.333
= 1 (1/5)2
(4/5)2
= 0.528
Multi-way Splits: Computing Gini Index
For each distinct value, gather counts for each class in

the dataset
Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one Tid Refund Marital
Status
Taxable
Income Cheat
value
Several Choices for the splitting 2 No Married 100K No
value 3 No Single 70K No
Number of possible splitting 4 Yes Married 120K No
values 5 No Divorced 95K Yes
= Number of distinct values 6 No Married 60K No

Each splitting value has a count 8 No Single 85K Yes
matrix associated with it 9 No Married 75K No
Class counts in each of the 10 No Single 90K Yes
partitions, A < v and A v
10
Taxable
Simple method to choose best v Income
> 80K?
For each v, scan the database
to gather count matrix and
compute its Gini index Yes No
Computationally Inefficient!
Repetition of work.
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,

Sort the attribute on values
Linearly scan these values, each time updating the count
matrix and computing gini index
Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No

Taxable Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Classification Error
Classification error at a node t :
Error (t ) 1 max P(i | t )

i
Measures misclassification error made by a

node.
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information
Examples for Computing Error
Error (t ) 1 max P(i | t )
i
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 max (0, 1) = 1 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Exercise: from Introduction to Data Mining
54
GINI (t ) 1 [ p( j | t )]2
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
j
2 M Sports Medium C0
4 M Sports Large C0 Compute the Gini index for the

5 M Sports Extra Large C0 Customer ID attribute
6 M Sports Extra Large C0
7 F Sports Small C0 Compute the Gini index for the

8 F Sports Small C0 Gender attribute.
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family Extra Large C1
13 M Family Medium C1
14 M Luxury Extra Large C1
15 F Luxury Small C1
17 F Luxury Medium C1
Exercise: from Introduction to Data Mining book
55
GINI (t ) 1 [ p( j | t )]2
1 M Family Small C0
j
Compute the Gini index for the
Customer ID attribute. Answer:
4 M Sports Large C0
5 M Sports Extra Large C0 The gini for each Customer ID value is 0.

6 M Sports Extra Large C0 Therefore, the overall gini for Customer
7 F Sports Small C0 ID is 0.
8 F Sports Small C0
10 F Luxury Large C0 Compute the Gini index for the

Gender attribute.
13 M Family Medium C1 Answer:

The gini for Male is 1 2 0.52 = 0.5. The
gini for Female is also 0.5. Therefore, the
overall gini for Gender is 0.5 0.5 + 0.5
0.5 = 0.5.
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
56
j i 1 n
1 M Family Small C0
2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0

Compute the Gini index for the Shirt Size attribute
using multiway split.
7 F Sports Small C0 Which attribute is better, Gender, Car Type, or Shirt

8 F Sports Small C0 Size? Answer: Car Type
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
57
j i 1 n
1 M Family Small C0
2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0

Answer:
6 M Sports Extra Large C0 The gini for Family car is 0.375, Sports car is 0, and Luxury
7 F Sports Small C0 car is 0.2188. The overall gini is 0.1625.
8 F Sports Small C0
Compute the Gini index for the Shirt Size attribute
using multiway split.
11 M Family Large C1 Answer:

The gini for Small shirt size is 0.48, Medium shirt size is
0.4898, Large shirt size is 0.5, and Extra Large shirt size is
14 M Luxury Extra Large C1 0.5. The overall gini for Shirt Size attribute is 0.4914.
17 F Luxury Medium C1 Which attribute is better, Gender, Car Type, or Shirt

18 F Luxury Medium C1 Size? Answer: Car Type
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple data
sets

CH03 Classification Part I

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CH03 Classification Part I

Caricato da

Copyright:

Formati disponibili

Data Mining

Data Warehouse Selection

Classification algorithms predict one or more variables, based on the other

Choosing an Algorithm by Task

Supervised learning (classification)

classes or clusters in the data

A test set is used to determine the accuracy of

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

1 Yes Single 125K No

Training Data Model: Decision Tree

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

13 Yes Large 110K ?

Rewrite the previous decision trees as a If-then-else statement

student? yes credit rating?

no yes excellent fair

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

General Procedure: 2 No Married 100K No

If Dt is an empty set, then t is

a leaf node labeled by the

If Dt contains records that Dt

Yes No 6 No Married 60K No

Dont 8 No Single 85K Yes

Exercise: Continue the algorithm

Missing values affect decision tree construction

Depends on attribute types

Depends on number of ways to split

Binary split: Divides values into two subsets.

Different ways of handling

Binary Decision: (A < v) or (A v)

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Own Car Student

Yes No Family Luxury c1 c20

Which test condition is the best?

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

Gain = M0 M12 vs M0 M34

(NOTE: p( j | t) is the relative frequency of class j at node t).

Measures impurity of a node:

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

Parent Node, p is split into k partitions;

Parent Node, p is split into k partitions

Adjusts Information Gain by the entropy of the

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

where,ni = number of records at child i,

For each distinct value, gather counts for each class in

Multi-way split Two-way split

CarType CarType CarType