Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DATA ANALYTICS
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
@X. Zhou Intro to Data Analytics 8
EXAMPLES OF CLASSIFICATION
• An emergency room in a hospital measures 17 variables (e.g.,
blood pressure, age, etc) of newly admitted patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may survive less
than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate them
from low-risk patients.
• Other applications:
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-
sheet, or random coil
• Categorizing news stories as finance, weather, entertainment,
sports, etc.
@X. Zhou Intro to Data Analytics 9
THE DATA AND THE GOAL
• Data: A set of data records (also called examples,
instances or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined
class – class attribute
• The class attribute has a set of discrete values, n,
>=2
• Goal: To learn a classification model from the data that
can be used to predict the classes of new (future, or
test) cases/instances.
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
leaf
Training Data Model: Decision Tree
@X. Zhou Intro to Data Analytics 13
USE THE DECISION TREE
Test Data
TID Refund Marital Taxable Cheat
Status Income
100 No Yes 65K NO
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
are heuristic
algorithms
There could be more than one tree that fits
the same data!
@X. Zhou Intro to Data Analytics 15
ALGORITHM FOR DECISION TREE
INDUCTION
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
@X. Zhou Intro to Data Analytics 16
DECISION TREE INDUCTION
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
@X. Zhou Intro to Data Analytics 17
HOW TO SPECIFY TEST CONDITION?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous
CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}
Size
• What about this split? {Small, {Medium}
Large}
@X. Zhou Intro to Data Analytics 20
SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
• Different ways of handling
• Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
∑
j =1
P (Y = y j ) = 1
• Define 0*log20=0
• Interpretation:
• Higher entropy: -> higher uncertainty
@X. Zhou Intro to Data Analytics 25
ATTRIBUTE SELECTION MEASURE:
INFORMATION GAIN (ID3/C4.5)
• Select the attribute with the highest information gain
• Let pi (P(Y=yi))be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a
tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
• Information needed (after using A to split D into v
partitions) to classify D: v | Dj |
Info A ( D ) = ∑ × Info( D j )
j =1 |D|
• Information gained by branching on attribute A
Gain(A) = Info(D) − Info A(D)
@X. Zhou Intro to Data Analytics 26
AN ILLUSTRATION EXAMPLE
Training examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
SHigh: [3+, 4-] SNormal: [6+, 1-] SSunny: [2+, 3-] SOvercast: [4+, 0-] SRain: [3+, 2-]
Entropy=0.985 Entropy=0.592 Entropy=0.97095 Entropy=0 Entropy=0.97095
Gain(S, Humidity) Gain(S, Outlook)
=.940- (7/14).985- (7/14).592=0.151 =.940- (5/14).97095- (4/14)0- (5/14).97095=0.2467
SWeak: [6+, 2-] SStrong: [3+, 3-] SHot: [2+, 2-] SMildt: [4+, 2-] SCool: [3+, 1-]
Entropy=0.811 Entropy=1 Entropy=1 Entropy=0.91826 Entropy=0.811
Gain(S, Wind) Gain(S, Outlook)
=.940- (8/14).811- (6/14)1.0=0.048 =.940- (4/14)1- (6/14).91826- (4/14).811=0.029
@X. Zhou Intro to Data Analytics 28
An illustrative example (Cont’d.)
S: {D1, D2, …, D14}
[9+, 5-]
Outlook
Sunny Overcast Rainy
SSunny:{D1, D2, D8, D9, D11} SOvercast:{D3, D7, D12, D13} SRain:{D4,D5,D6,D10,D14}
[2+, 3-] [4+, 0-] [3+, 2-]
Entropy=0.97095 Entropy=0 Entropy=0.97095
? Yes ?
Therefore, Humidity is chosen as the next test attribute for the left branch.
(D1, D2, D8) (D9, D11) (D6, D14) (D4, D5, D10)
[0+, 3-] [2+, 0-] [0+, 2-] [3+, 0-]
No Yes No Yes
library("rpart")
library("rpart.plot")
data("iris")
str(iris) # explore data set structure