Sei sulla pagina 1di 58

Data Mining

Classification
Hatem Haddad

Sheets are largely based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
2 What have we seen last time?
3
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
Data mining plays an essential
role in the knowledge
discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
4 Data Mining Algorithms

Classification algorithms predict one or more variables, based on the other


attributes in the dataset.
Regression algorithms predict one or more continuous numeric variables,
such as profit or loss, based on other attributes in the dataset.
Clustering algorithms divide data into groups, or clusters, of items that have
similar properties.
Association algorithms find correlations between different attributes in a
dataset. The most common application of this kind of algorithm is for
creating association rules, which can be used in a market basket analysis.
Sequence analysis algorithms summarize frequent sequences or episodes in
data, such as a series of clicks in a web site, or a series of log events
preceding machine maintenance.
.
5 Data Mining Algorithms

Choosing an Algorithm by Task


There is no reason that you should be limited to one algorithm in your
solutions
Supervised vs. Unsupervised Learning

Supervised learning (classification)


Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of 6

classes or clusters in the data


Prediction Problems: Classification

Classification
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data

7
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.

A test set is used to determine the accuracy of


the model.
70/30 rule: 70% for training and 30% for test.
Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10 No Single 90K Yes that fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model Decision
11 No Small 55K ? Tree
12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to No

TaxInc NO
< 80K > 80K

NO YES
Decision Trees as a Computer Program

Rewrite the previous decision trees as a If-then-else statement


Tree Evaluation (next chapter)

Test set
Ground truth, data labeling, mechanical Turk
Confusion Matrix and cost matrix
True positive, true negative, false positive, false negative
Accuracy
Error rates
21 Exercise Tree Induction: Training Dataset
Class: buys_computer
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Solution: interpretation?
22

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction

Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
General Structure of Hunts Algorithm
Let Dt be the set of training
Tid Refund Marital Taxable
Status Income Cheat
records that reach a node t 1 Yes Single 125K No

General Procedure: 2 No Married 100K No


3 No Single 70K No
If Dt contains records that 4 Yes Married 120K No
belong to the same class yt, 5 No Divorced 95K Yes
then t is a leaf node labeled 6 No Married 60K No
as yt 7 Yes Divorced 220K No

If Dt is an empty set, then t is


8 No Single 85K Yes

a leaf node labeled by the


9 No Married 75K No
10 No Single 90K Yes
default class, yd 10

If Dt contains records that Dt


belong to more than one
class, use an attribute test to ?
split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
General Structure of Hunts Algorithm
Start with a single node: we will
Tid Refund Marital Taxable
Status Income Cheat
assume that Refund is the best 1 Yes Single 125K No
criterion for splitting the data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
Refund
5 No Divorced 95K Yes

Yes No 6 No Married 60K No


7 Yes Divorced 220K No

Dont 8 No Single 85K Yes


? 9 No Married 75K No
Cheat
10 No Single 90K Yes
10

Dt
Hunts algorithm is then applied
recursively to each child of the
root node ?

Exercise: Continue the algorithm


Tid Refund Marital Taxable
Status Income Cheat

Hunts Algorithm 1
2
Yes
No
Single
Married
125K
100K
No
No
Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No
Dont 5 No Divorced 95K Yes
Cheat ?
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund 9 No Married 75K No
Refund
Yes No 10 No Single 90K Yes
Yes No 10

Dont
Dont Marital
Marital Cheat Status
Cheat Status Single,
Single, Married
Married Divorced
Divorced
Dont
Dont Taxable
? Income Cheat
Cheat
< 80K >= 80K
Dont
Cheat
Cheat
Handling Missing Attribute Values

Missing values affect decision tree construction


in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing value
to child nodes
Affects how a test instance with missing value is
classified
Tree Induction

Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Determine how to cut back if tree is too deep
What is wrong with a tree that is too deep?
How to Specify Test Condition?

Depends on attribute types


Nominal
Ordinal
Continuous

Depends on number of ways to split


2-way split
Multi-way split
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium

Binary split: Divides values into two subsets.


Need to find optimal partitioning.

Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}

Size
{Small,
Large} {Medium}
Splitting Based on Continuous
Attributes

Different ways of handling


Discretization to form an ordinal categorical attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.

Binary Decision: (A < v) or (A v)


consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous
Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.

Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split

Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
How to Find the Best Split: let M be the
measure
C0 N00
Before Splitting: M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34

Gain = M0 M12 vs M0 M34


Measures of Node Impurity

Entropy
Gain
Gini Index
Classification Error
Entropy Based Evaluation and Splitting
Entropy at a given node t:

Entropy (t ) p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Measures impurity of a node:


Maximum (log nc)
when records are equally distributed among all classes: implying least information,
where nc =the number of classes.

Minimum (0):
when all records belong to one class, implying most information
Examples for computing Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = 0 log 0 1 log 1 = 0 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...

Information Gain:

GAIN n
Entropy ( p ) Entropy (i )
k
i

n
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
Measures Reduction in Entropy achieved because
of the split. Choose the split that achieves most
reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in
large number of partitions, each being small but
pure.
Splitting Based on INFO...

Gain Ratio:
GAIN n n
GainRATIO SplitINFO log
Split k
i i
split
SplitINFO n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

Adjusts Information Gain by the entropy of the


partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of
Information Gain
Measure of Impurity: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 (1/6)2 (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 (2/6)2 (4/6)2 = 0.444
Splitting Based on GINI (overall Gini)
Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split GINI (i )
i 1 n

where,ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing GINI
Index k
ni
l Splits into two partitions GINI split GINI (i )
i 1 n
l Effect of Weighing partitions:
Larger and Purer Partitions are sought for (desired).
Parent
B?
C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 (5/7)2 N1 N2 Gini(Children)
(2/7)2
C1 5 1 = 7/12 * 0.194 +
= 0.194
C2 2 4 5/12 * 0.528
Gini(N2) Gini=0.333 = 0.333
= 1 (1/5)2
(4/5)2
= 0.528
Multi-way Splits: Computing Gini Index

For each distinct value, gather counts for each class in


the dataset
Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one Tid Refund Marital
Status
Taxable
Income Cheat
value
1 Yes Single 125K No
Several Choices for the splitting 2 No Married 100K No
value 3 No Single 70K No

Number of possible splitting 4 Yes Married 120K No

values 5 No Divorced 95K Yes

= Number of distinct values 6 No Married 60K No


7 Yes Divorced 220K No
Each splitting value has a count 8 No Single 85K Yes
matrix associated with it 9 No Married 75K No
Class counts in each of the 10 No Single 90K Yes
partitions, A < v and A v
10

Taxable
Simple method to choose best v Income
> 80K?
For each v, scan the database
to gather count matrix and
compute its Gini index Yes No

Computationally Inefficient!
Repetition of work.
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


Sort the attribute on values
Linearly scan these values, each time updating the count
matrix and computing gini index
Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220

Split Positions 55 65 72 80 87 92 97 110 122 172 230


<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Classification Error

Classification error at a node t :

Error (t ) 1 max P(i | t )


i

Measures misclassification error made by a


node.
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information
Examples for Computing Error
Error (t ) 1 max P(i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 max (0, 1) = 1 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
Tree Induction

Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.

Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have similar attribute values
Exercise: from Introduction to Data Mining
54
GINI (t ) 1 [ p( j | t )]2
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
j
2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0 Compute the Gini index for the


5 M Sports Extra Large C0 Customer ID attribute
6 M Sports Extra Large C0

7 F Sports Small C0 Compute the Gini index for the


8 F Sports Small C0 Gender attribute.
9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family Extra Large C1

13 M Family Medium C1

14 M Luxury Extra Large C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
55
GINI (t ) 1 [ p( j | t )]2
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
j
2 M Sports Medium C0
Compute the Gini index for the
3 M Sports Medium C0
Customer ID attribute. Answer:
4 M Sports Large C0

5 M Sports Extra Large C0 The gini for each Customer ID value is 0.


6 M Sports Extra Large C0 Therefore, the overall gini for Customer
7 F Sports Small C0 ID is 0.
8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0 Compute the Gini index for the


11 M Family Large C1
Gender attribute.
12 M Family Extra Large C1

13 M Family Medium C1 Answer:


14 M Luxury Extra Large C1
The gini for Male is 1 2 0.52 = 0.5. The
15 F Luxury Small C1
gini for Female is also 0.5. Therefore, the
16 F Luxury Small C1

17 F Luxury Medium C1
overall gini for Gender is 0.5 0.5 + 0.5
18 F Luxury Medium C1
0.5 = 0.5.
19 F Luxury Medium C1

20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
56
Customer ID Gender Car Type Shirt Size Class
j i 1 n
1 M Family Small C0

2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0

5 M Sports Extra Large C0


Compute the Gini index for the Shirt Size attribute
using multiway split.
6 M Sports Extra Large C0

7 F Sports Small C0 Which attribute is better, Gender, Car Type, or Shirt


8 F Sports Small C0 Size? Answer: Car Type
9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family Extra Large C1

13 M Family Medium C1

14 M Luxury Extra Large C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
57
Customer ID Gender Car Type Shirt Size Class
j i 1 n
1 M Family Small C0

2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0

5 M Sports Extra Large C0


Answer:
6 M Sports Extra Large C0 The gini for Family car is 0.375, Sports car is 0, and Luxury
7 F Sports Small C0 car is 0.2188. The overall gini is 0.1625.
8 F Sports Small C0
Compute the Gini index for the Shirt Size attribute
9 F Sports Medium C0
using multiway split.
10 F Luxury Large C0

11 M Family Large C1 Answer:


12 M Family Extra Large C1
The gini for Small shirt size is 0.48, Medium shirt size is
13 M Family Medium C1
0.4898, Large shirt size is 0.5, and Extra Large shirt size is
14 M Luxury Extra Large C1 0.5. The overall gini for Shirt Size attribute is 0.4914.
15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1 Which attribute is better, Gender, Car Type, or Shirt


18 F Luxury Medium C1 Size? Answer: Car Type
19 F Luxury Medium C1

20 F Luxury Large C1
Decision Tree Based Classification

Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple data
sets

Potrebbero piacerti anche