Sei sulla pagina 1di 44

INTRODUCTION TO

DATA ANALYTICS

XIAOFENG ZHOU | ITEC3040

@X. Zhou Intro to Data 1


Analytics
CLASSIFICATION
• Basic Concept
• Decision Tree Induction
• Bayes Classification Method
• KNN method
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy

@X. Zhou Intro to Data Analytics 2


SUPERVISED VS. UNSUPERVISED
LEARNING
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
@X. Zhou Intro to Data Analytics 3
FUNDAMENTAL ASSUMPTION OF
LEARNING
Assumption: The distribution of training examples is
identical to the distribution of test examples (including
future unseen examples).

• In practice, this assumption is often violated to certain


degree.
• Strong violations will clearly result in poor classification
accuracy.
• To achieve good accuracy on the test data, training
examples must be sufficiently representative of the test
data.

@X. Zhou Intro to Data Analytics 4


CLASSIFICATION: DEFINITION
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.

@X. Zhou Intro to Data Analytics 5


CLASSIFICATION—A TWO-STEP
PROCESS
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

@X. Zhou Intro to Data Analytics 6


PROCESS (1): MODEL CONSTRUCTION
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
@X. Zhou Intro to Data Analytics 7
PROCESS (2): USING THE MODEL IN
PREDICTION

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
@X. Zhou Intro to Data Analytics 8
EXAMPLES OF CLASSIFICATION
• An emergency room in a hospital measures 17 variables (e.g.,
blood pressure, age, etc) of newly admitted patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may survive less
than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate them
from low-risk patients.
• Other applications:
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-
sheet, or random coil
• Categorizing news stories as finance, weather, entertainment,
sports, etc.
@X. Zhou Intro to Data Analytics 9
THE DATA AND THE GOAL
• Data: A set of data records (also called examples,
instances or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined
class – class attribute
• The class attribute has a set of discrete values, n,
>=2
• Goal: To learn a classification model from the data that
can be used to predict the classes of new (future, or
test) cases/instances.

@X. Zhou Intro to Data Analytics 10


Table is from Web Mining written by Bin Liu
@X. Zhou Intro to Data Analytics 11
DECISION TREE INDUCTION
 Decision tree induction is one of the most widely used
techniques for classification.
◦ Its classification accuracy is competitive with
other methods, and
◦ it is very efficient.
 The classification model is a tree, called decision tree.
 How a decision tree works?

@X. Zhou Intro to Data Analytics 12


EXAMPLE OF A DECISION TREE

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Internal TaxInc NO
7 Yes Divorced 220K No
nodes
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

leaf
Training Data Model: Decision Tree
@X. Zhou Intro to Data Analytics 13
USE THE DECISION TREE
Test Data
TID Refund Marital Taxable Cheat
Status Income
100 No Yes 65K NO

Start from the root of tree.

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

@X. Zhou Intro to Data Analytics 14


IS THE DECISION TREE UNIQUE?
• No. Here is a simpler
tree.
• We want smaller tree Married
MarSt Single,
Divorced
and accurate tree.
• Easy to NO Refund
No
understand and Yes

perform better. NO TaxInc


< 80K > 80K
• All current tree
building algorithms NO YES

are heuristic
algorithms
There could be more than one tree that fits
the same data!
@X. Zhou Intro to Data Analytics 15
ALGORITHM FOR DECISION TREE
INDUCTION
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
@X. Zhou Intro to Data Analytics 16
DECISION TREE INDUCTION
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
@X. Zhou Intro to Data Analytics 17
HOW TO SPECIFY TEST CONDITION?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split


• 2-way split
• Multi-way split

@X. Zhou Intro to Data Analytics 18


SPLITTING BASED ON NOMINAL
ATTRIBUTES
• Multi-way split: Use as many partitions as distinct
values.
CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}

@X. Zhou Intro to Data Analytics 19


SPLITTING BASED ON ORDINAL
ATTRIBUTES
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}

Size
• What about this split? {Small, {Medium}
Large}
@X. Zhou Intro to Data Analytics 20
SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
• Different ways of handling
• Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

• Binary Decision: (A < v) or (A ≥ v)


• consider all possible splits and finds the best cut
• can be more compute intensive

@X. Zhou Intro to Data Analytics 21


SPLITTING BASED ON CONTINUOUS
ATTRIBUTES

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

@X. Zhou Intro to Data Analytics 22


HOW TO DETERMINE THE BEST
SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


@X. Zhou Intro to Data Analytics 23
HOW TO DETERMINE THE BEST
SPLIT
• The key to building a decision tree - which
attribute to choose in order to branch?
• The objective is to reduce impurity or uncertainty in
data as much as possible.
• A subset of data is pure if all instances belong to
the same class.
• Need a measure of node impurity
• Information gain
• Gain ratio
• Gini Index

@X. Zhou Intro to Data Analytics 24


BRIEF INTRODUCTION OF ENTROPY
• Entropy(Information Theory)
• A measure of uncertainty associated with a random variable
• Calculation: for a discrete random variable Y taking m
distinct values {y1, y2, …, ym}
|C |
entropy ( D ) = −∑ P (Y = y j ) log 2 P (Y = y j )
j =1
Where |C |


j =1
P (Y = y j ) = 1

• Define 0*log20=0
• Interpretation:
• Higher entropy: -> higher uncertainty
@X. Zhou Intro to Data Analytics 25
ATTRIBUTE SELECTION MEASURE:
INFORMATION GAIN (ID3/C4.5)
• Select the attribute with the highest information gain
• Let pi (P(Y=yi))be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a
tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
• Information needed (after using A to split D into v
partitions) to classify D: v | Dj |
Info A ( D ) = ∑ × Info( D j )
j =1 |D|
• Information gained by branching on attribute A
Gain(A) = Info(D) − Info A(D)
@X. Zhou Intro to Data Analytics 26
AN ILLUSTRATION EXAMPLE
Training examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No

@X. Zhou Intro to Data Analytics 27


Which attribute is the best for the root?
S: [9+, 5-] S: [9+, 5-]
Entropy=0.940 Entropy=0.940
Humidity Outlook
High Sunny Overcast Rainy
Normal

SHigh: [3+, 4-] SNormal: [6+, 1-] SSunny: [2+, 3-] SOvercast: [4+, 0-] SRain: [3+, 2-]
Entropy=0.985 Entropy=0.592 Entropy=0.97095 Entropy=0 Entropy=0.97095
Gain(S, Humidity) Gain(S, Outlook)
=.940- (7/14).985- (7/14).592=0.151 =.940- (5/14).97095- (4/14)0- (5/14).97095=0.2467

S: [9+, 5-] S: [9+, 5-]


E=0.940 E=0.940
Wind
Temperature
Weak Strong Hot Mild Cool

SWeak: [6+, 2-] SStrong: [3+, 3-] SHot: [2+, 2-] SMildt: [4+, 2-] SCool: [3+, 1-]
Entropy=0.811 Entropy=1 Entropy=1 Entropy=0.91826 Entropy=0.811
Gain(S, Wind) Gain(S, Outlook)
=.940- (8/14).811- (6/14)1.0=0.048 =.940- (4/14)1- (6/14).91826- (4/14).811=0.029
@X. Zhou Intro to Data Analytics 28
An illustrative example (Cont’d.)
S: {D1, D2, …, D14}
[9+, 5-]
Outlook
Sunny Overcast Rainy

SSunny:{D1, D2, D8, D9, D11} SOvercast:{D3, D7, D12, D13} SRain:{D4,D5,D6,D10,D14}
[2+, 3-] [4+, 0-] [3+, 2-]
Entropy=0.97095 Entropy=0 Entropy=0.97095
? Yes ?

Which attribute should be tested here, Humidity, Temperature, or Wind?

Gain(Ssunny, Humidity) = .97095- (3/5)0.0- (2/5)0.0 = 0.97095


Gain(Ssunny, Temperature) = .97095- (2/5)0.0- (2/5)1.0- (1/5)0.0 = 0.57095
Gain(Ssunny, Wind) = .97095- (2/5)1.0- (3/5).918 = 0.02015

Therefore, Humidity is chosen as the next test attribute for the left branch.

@X. Zhou Intro to Data Analytics 29


An illustrative example (Cont’d.)
S: {D1, D2, …, D14}
[9+, 5-]
Outlook
Sunny Overcast Rain

SSunny:{D1,D2,D8,D9,D11} SOvercast:{D3,D7,D12,D13} SRain:{D4,D5,D6,D10,D14}


[2+, 3-] [4+, 0-] [3+, 2-]
Entropy=0.97095 Entropy=0 Entropy=0.97095
Humidity Yes Wind

High Normal Strong Weak

(D1, D2, D8) (D9, D11) (D6, D14) (D4, D5, D10)
[0+, 3-] [2+, 0-] [0+, 2-] [3+, 0-]

No Yes No Yes

@X. Zhou Intro to Data Analytics 30


BASIC DECISION TREE LEARNING
ALGORITHM
1. Select the “best” attribute A for the root node
2. Create new descendent of the node according to the values
of A
3. Put training examples to the descendent nodes.
4. For each descendent node
 if the training examples associated with the node belong to the same
class, the node is marked as a leaf node and labeled with the class
when to
terminate  else if there are no remaining attributes on which the examples can be
the further partitioned, the node is marked as a leaf node and labeled with
recursive the most common class among the training cases for classification;
process  else if there is no example for the node, the node is marked as a leaf
node and labeled with the majority class in its parent node.
 otherwise, recursively apply the process on the new node.
@X. Zhou Intro to Data Analytics 31
GAIN RATIO FOR ATTRIBUTE SELECTION
(C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = −∑ × log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
4 4 6 6 4
• 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝐷𝐷) = - × log2 ( ) - × log2 - ×
14 14 14 14 14
4
log2 ( ) =1.557
14

• gain_ratio(temperature) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the splitting
attribute
32
GINI INDEX
(CART, IBM INTELLIGENTMINER)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D) = 1− ∑ p 2j
j =1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D1| |D2 |
gini A ( D) = gini( D1) + gini( D 2)
|D| |D|
• Reduction in Impurity:
∆gini( A) = gini( D) − giniA ( D)
• The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
33

possible splitting points for each attribute)


COMPUTATION OF GINI INDEX
• Ex. D has 9 tuples in playTennis= “yes” and 5 in “no”
2 2
9 5
gini ( D) = 1 −   −   = 0.459
 14   14 
• Suppose the attribute temperature partitions D into 10 in D1: {low,
medium} and 4 in D2  10  4
giniTemperature∈{Cool , Mild } ( D ) =  Gini ( D1 ) +  Gini ( D2 )
 14   14 
 10  7 3 ∈  4  2 2 
=  1 − ( ) 2 − ( ) 2  +  1 − ( ) 2 − ( ) 2 
 14  10 10   14  4 4 
= 0.443
= Gini{Temperature∈Hot } ( D)
Gini{Cool,Hot} is 0.458; Gini{Mild,Hot} is 0.450. Thus, split on the {Cool,
Mild} (and {hot}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values 34

• Can be modified for categorical attributes


COMPARING ATTRIBUTE SELECTION
MEASURES
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and
purity in both partitions

@X. Zhou Intro to Data Analytics 35


LIMITATION: OVERFITTING
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to
noise or outliers
• Poor accuracy for unseen samples
• Two approaches to avoid overfitting
• Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
• Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
• C4.5
@X. Zhou Intro to Data Analytics 36
AN EXAMPLE: OVERFITTING(1)

This example is from “Web Mining” written by Bin Liu

@X. Zhou Intro to Data Analytics 37


AN EXAMPLE: OVERFITTING(2)

This example is from “Web Mining” written by Bin Liu


@X. Zhou Intro to Data Analytics 38
OTHER ISSUES IN DECISION TREE
INDUCTION
• From tree to rules, and rule pruning
• Handling of miss values
• Handing skewed distributions
• Handling attributes and classes with different costs.
• Attribute construction
• Etc.

@X. Zhou Intro to Data Analytics 39


IRIS EXAMPLE
# install package
install.packages("rpart")
install.packages("rpart.plot")

library("rpart")
library("rpart.plot")
data("iris")
str(iris) # explore data set structure

# divide the data into training and testing set


indexes <- sample(150, 110)
iris_train <- iris[indexes,]
iris_test <- iris[-indexes,]

# build the decision tree


decesion_tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data =
iris_train, method = "class")
rpart.plot(decesion_tree)

# check the accuracy


predictions <- predict(decesion_tree, iris_test)
@X. Zhou Intro to Data Analytics 40
Measure used: Gini Index

@X. Zhou Intro to Data Analytics 41


Measure used: Information gain

@X. Zhou Intro to Data Analytics 42


REFERENCE
• Data Mining: Concepts and Techniques, Third Edition.
By Jiawei Han, Micheline Kamber, Jian Pei
• Chapter 8
• Introduction to Data Mining. By Pang-Ning Tan, Michael
Steinbach, Vipin Kumar
• Chapter 3

@X. Zhou Intro to Data Analytics 43


QUESTIONS?

@X. Zhou Intro to Data Analytics 44

Potrebbero piacerti anche