Chap4 - Basic - Classification-Admin and Economy

Data Mining
Classification: Basic Concepts, Decision

Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining

by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Classification: Definition
 Given a collection of records (training set )

– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class

Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Examples of Classification Task
 Predicting tumor cells as benign or malignant
 Classifying credit card transactions

as legitimate or fraudulent
 Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
 Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Techniques
 Decision Tree based Methods

 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

Example of a Decision Tree
cal cal us
ri ri uo
ego ego tin ss
t t n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt

5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
cal cal us
i i o
or or nu
t eg
t eg
nti
a ss Single,
ca ca co cl MarSt
Married Divorced
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

Decision Tree Classification Task

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES

Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES

Decision Tree Classification Task

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set

General Structure of Hunt’s Algorithm
 Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
 General Procedure: 2 No Married 100K No
3 No Single 70K No
– If Dt contains records that 4 Yes Married 120K No
belong the same class yt, then t 5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, yd 9 No Married 75K No

10 No Single 90K Yes
– If Dt contains records that 10
belong to more than one class, Dt

use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.

Hunt’s Algorithm
Refund
Don’t
Yes No
Cheat
Don’t Don’t
Cheat Cheat
Refund Refund
Yes No Yes No
Don’t Don’t Marital

Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable Don’t

Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction
 Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
 Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting

Tree Induction
 Issues

How to Specify Test Condition?
 Depends on attribute types

– Nominal
– Ordinal
– Continuous
 Depends on number of ways to split

– 2-way split
– Multi-way split

Splitting Based on Nominal Attributes
 Multi-way split: Use as many partitions as distinct

values.
CarType
Family Luxury
Sports
 Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct

values.
Size
Small Large
Medium
 Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
 What about this split? Large} {Medium}

Splitting Based on Continuous Attributes
 Different ways of handling

– Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)

 consider all possible splits and finds the best cut
 can be more compute intensive

Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Tree Induction
 Issues

How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Own Car Student

Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

Model Evaluation
 Metrics for Performance Evaluation

– How to evaluate the performance of a model?
 Methods for Performance Evaluation

– How to obtain reliable estimates?
 Methods for Model Comparison

– How to compare the relative performance
among competing models?

Model Evaluation
 Metrics for Performance Evaluation

– How to evaluate the performance of a model?
 Methods for Performance Evaluation

– How to obtain reliable estimates?
 Methods for Model Comparison

– How to compare the relative performance
among competing models?

Metrics for Performance Evaluation
 Focus on the predictive capability of a model

– Rather than how fast it takes to classify or
build models, scalability, etc.
 Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
CLASS Class=No c d
d: TN (true negative)

Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
 Most widely-used metric:
ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN

Chap4 - Basic - Classification-Admin and Economy

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chap4 - Basic - Classification-Admin and Economy

Caricato da

Copyright:

Formati disponibili

Data Mining

Classification: Basic Concepts, Decision

Lecture Notes for Chapter 4

Introduction to Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

 Given a collection of records (training set )

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions

 Classifying secondary structures of protein

 Categorizing news stories as finance,

 Decision Tree based Methods

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

1 Yes Single 125K No

4 Yes Married 120K No NO MarSt

Training Data Model: Decision Tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

– If Dt is an empty set, then t is a 7 Yes Divorced 220K No

leaf node labeled by the default 8 No Single 85K Yes

class, yd 9 No Married 75K No

belong to more than one class, Dt

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Don’t Don’t Marital

Don’t Taxable Don’t

– Determine when to stop splitting

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

– Determine when to stop splitting

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

 Depends on attribute types

 Depends on number of ways to split

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

 Multi-way split: Use as many partitions as distinct

 Binary split: Divides values into two subsets.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

 Multi-way split: Use as many partitions as distinct