Sei sulla pagina 1di 45

University of Human Development College of science and technology Computer department 4rth class

CLASSIFICATION
Dr. Aso Mohammad Darwesh

Chapter Four 1

aso.darwesh@yahoo.fr

Outlines
2

Classification and prediction Decision Tree Induction Bayesian classification Nearest Neighbor Classification Rule-Based Classification Artificial Neural Network Support Vector Machines
Data Mining - 4rth class UHD Aso M. Darwesh

Classification and prediction


3

Classification: Dividing up objects to one and only one class


Classes

are mutually exhaustive and exclusive Classification predicts categorical classes

Example
Medical

diagnosis: analyzing a tumor (cancerous or

benign)

Prediction: Models continuous valued functions Example


Predicting

the benefit of a new costumer


Aso M. Darwesh

Data Mining - 4rth class UHD

Classification
4

Given dataset
Collection

of instances (records)

Model building
Find

a model for class attribute as a function of the values of other attributes

Goal
Classify

previously unseen instances as accurately as possible


Data Mining - 4rth class UHD Aso M. Darwesh

Classification illustration
5 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes Age Gender Specialty Sportive 23 F IT ? 30 M IT ? 28 F Medicine ? 27 M Engineering ? 29 F Sociology ?

Classification algorithm Building Model Using

Learning set

Test set
Data Mining - 4rth class UHD Aso M. Darwesh

Classification: model building


6

Model: represents one of the following


Classification

rules (e.g., if x then y) Decision tree (Automatically generating classification rules) Mathematical formulae (e.g., f(attributes) = class label)

Model building
The


given dataset is divided into

Training set used to build the model  Test set used to validate it and find the accuracy of the model
Data Mining - 4rth class UHD Aso M. Darwesh

Model building contd


7

Using the model


For

classifying future (unseen) instances

Estimating model accuracy


known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set instances that are correctly classified by the model If the accuracy is acceptable, use the model to classify new data
The
Data Mining - 4rth class UHD Aso M. Darwesh

Decision Tree Induction


8

Partitioning dataset based on the value of an attribute (or un attribute) Creating a branch for each value of its possible values For continuous attributes the test is normally like less than or equal to or greater than The splitting process continues until each branch can be labeled with just one classification
Data Mining - 4rth class UHD Aso M. Darwesh

Decision Tree Induction: Example


9 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes

Specialty

Engineering

Medicine

Sociology

IT

No

No

Gender F Age 30 Yes

Gender

F Yes

M No

M Yes <30 No

Data Mining - 4rth class UHD

Aso M. Darwesh

Decision Tree Induction contd


10

Used for all types of data


Categorical Continuous

Decision tree has two main functions


Data


compression

Tree representation is equivalent to the dataset in the sense that the values of all attributes will lead to identical classification

Prediction

unseen instances
Aso M. Darwesh

Data Mining - 4rth class UHD

TDIDT algorithm contd


11

Top-Down Induction of Decision Trees Has no preconditions The same attribute cannot be assumed twice in the same branch Production decision rules in the implicit form of a decision tree At each non-leaf node an attribute is chosen for splitting
Data Mining - 4rth class UHD Aso M. Darwesh

TDIDT algorithm problems


12

TDIDT algorithms have two main distinguish aspects Impurity measure Selection method (underspecified)
The

algorithm specifies Select an attribute A to split on but no method is given for doing this

This led to the introduction of various TDIDT algorithms


ID3

(Iterative Dichotomiser) [Quinlan] C4.5


Data Mining - 4rth class UHD Aso M. Darwesh

TDIDT algorithm contd


13

To construct decision tree T from learning set S:


If

all examples in S belong to some class C Then make leaf labeled C


  

Otherwise

select the most informative attribute A partition S according to As values recursively construct subtrees T1, T2, ..., for the subsets of S

Data Mining - 4rth class UHD

Aso M. Darwesh

TDIDT algorithm contd


14

Resulting tree T is:


A
Attribute A

v1

v2

vn

As values

T1

T2
Data Mining - 4rth class UHD

Tn
Aso M. Darwesh

Subtrees

Measurement of attribute selection


15

Select attribute which partitions the learning set into subsets as pure as possible
Entropy Gini

index tables

Frequency G2

Data Mining - 4rth class UHD

Aso M. Darwesh

Entropy
16

The average amount of information needed to classify an object n is the number of classes in the dataset
n

E !  pi log 2 pi
i

Where pi { 0 For K classes, pi is the relative frequency of class i


Data Mining - 4rth class UHD Aso M. Darwesh

The Logarithm Function log2X


17

log2x=y 2y=x (x>0) e.g., log28=3, because 23=8 Properties The value of log2x is: Positive when x>1 Negative when x<1 Zero when x=1
Data Mining - 4rth class UHD Aso M. Darwesh

The Logarithm Function: properties


18

log2(a b) = log2 a + log2 b log2 (a/b) = log2 a log2 b log2 (an) = n log2 a log2 (1/a) = log2 a

Data Mining - 4rth class UHD

Aso M. Darwesh

The function x loog2x


19

The value of -x log2x is in [0,1] when x is in [0,1] Maximum value of -x log2x is when x=1/e (e2.71828) The initial minus sign (-) is included to make the value of the function positive (or zero)
Data Mining - 4rth class UHD Aso M. Darwesh

The logarithm of other bases


20

Natural logarithm
logex

written as lnx

Other common uses base of logarithm is 10


log10 ( x ) log e ( x) log a ( x) ! ! log10 ( a ) log e ( a ) ln( x ) log a ( x) ! ln(a )
Data Mining - 4rth class UHD Aso M. Darwesh

Entropy: Example
21

E !  pi log 2 pi
i

C1 C2
C1 C2

0 6
1 5

P1 = 0/6 = 0

P2 = 6/6 = 1

Entropy = 0 log2 0 1 log2 1 = 0 0 = 0 P1= 1/6 P2 = 5/6

Entropy = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65

C1 C2

2 4

P1= 2/6

P2 = 4/6

Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92


Data Mining - 4rth class UHD Aso M. Darwesh

Entropy: Example
22 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes

For the initial dataset we calculate Estart


There are only two classes Class (Sportive = Yes) = 6 Class (Sportive = No) = 7 PYes=6/13 PNo=7/13

EStart = -(6/13)log2(6/13)-(7/13)log2 (7/13) = 0,99572745


Aso M. Darwesh

Data Mining - 4rth class UHD

Entropy: Example
23

Splitting on attribute Specialty For subset (Specialty = IT)


Class

Age Gender Specialty 19 21 35 35 23 F F F M M IT IT IT IT IT

Sportive Yes Yes Yes Yes No

(Sportive = Yes) = 4 Class (Sportive = No) = 1 PYes=4/5 PNo=1/5

EIT

= [(4/5)log2(4/5)+(1/5)log2 (1/5)] = 0,72192809


Data Mining - 4rth class UHD Aso M. Darwesh

Example contd
24

For subset (Specialty = Medicine)


Class

Age Gender 20 34 40 23 M M F F

Specialty Medicine Medicine Medicine Medicine

Sportive No Yes No No

(Sportive = Yes) = 1 Class (Sportive = No) = 3 PYes=1/4 PNo=3/4


EMedicine

= [(1/4)log2(1/4)+(3/4)log2 (3/4)] = 0,81127812

Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
25

For subset (Specialty = Engineering)


Class

(Sportive = Yes) = 0 Class (Sportive = No) = 2 PYes=0/2 PNo=2/2


EEngineering

Age Gender 35 24 M F

Specialty Engineering Engineering

Sportive No No

= 0 (when all instances belong to the same

class)

Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
26

For subset (Specialty = Sociology)


Class

Age Gender 28 24 M F

Specialty Sociology Sociology

Sportive No Yes

(Sportive = Yes) = 1 Class (Sportive = No) = 1 PYes=1/2 PNo=1/2


ESociolgy

= 1 (when instances are equally distributed amongst classes)

Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
27

Now, we calculate ENew ENew is the weighted means of Eis Weights are the proportion of the original instances in each subset ENew = (5/13)EIT+(4/13)EMed+(2/13)EEng+(2/13)ESoc = (5/13)* 0,72192809 +(4/13)* 0,81127812 +(2/13)*0+(2/13)*1 = 0,68113484
Data Mining - 4rth class UHD Aso M. Darwesh

Example contd
28

We define Information Gain = EStart - Enew IG = 0,99572745 - 0,68113484 = 0,3 It must be calculate Enew for all other attributes of the original dataset Homework

Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
29

We define Information Gain = EStart - Enew The entropy method of attribute selection is to choose to split on the attribute that maximizes the value of Information Gain This is equivalent to minimizing the value of ENew as EStart is fixed

Data Mining - 4rth class UHD

Aso M. Darwesh

References
30

Principles of Data Mining, by Max Bramer, Springer-Verlag London Limited,2006. 342 pages, ISSN 1863-7310 Data Mining: Concepts and Techniquesby Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, 2006. 772 pages. ISBN 1-55860-489-8 Seyed R. Mousavi and Krysia Broda, Impact of Binary Coding on Multiway-split TDIDT Algorithms. International Journal of Electrical and Electronics Engineering 2:3 2008, P 150-159.
Data Mining - 4rth class UHD Aso M. Darwesh

University of Human Development College of science and technology Computer department 4rth class

CLASSIFICATION
Dr. Aso Mohammad Darwesh

Chapter Four - 2

aso.darwesh@yahoo.fr

Gini index
32

Data Mining - 4rth class UHD

Aso M. Darwesh

TDIDT algorithm
33

Principles of Data Mining, Max Bramer. Page 48

Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
34

Nave Bayes classifiers Does not use rules, decision tree, etc. Using probability theory to find the most likely of the possible classifications The sum of the probabilities of a set of mutually exclusive and exhaustive events must always be 1 The outcome of each trial is recorded in one row of a table. Each row must have one and only one classification
Data Mining - 4rth class UHD Aso M. Darwesh

Bayesian classification
35

Define dataset as in page 26 and the paragraph of the training set constitutes The probability of an event occuring if we know that an attribute has a particular value (or that several variables have particular values) is called the conditional probability of the event occuring and is written as e.g., p(class=on time|season=winter)

Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
36

Nave Bayes (1702-1761) Combining the prior and conditional probability in a single formula Nave: The effect of the value of one attribute on the probability of a given classification is independent of the values of the other attributes

Data Mining - 4rth class UHD

Aso M. Darwesh

Nave Bayes Classification: algo.


37

Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
38

Problems It relies on all attributes being categorical Estimating probabilities by relative frequencies can give a poor estimate if the number of instances with a given attribute / value combination is small

Data Mining - 4rth class UHD

Aso M. Darwesh

39

A test set is used to determine the accuracy of the model Usually, the given data set is divided into Training set
Used

to build the model to validate it

Test set
Used

Data Mining - 4rth class UHD

Aso M. Darwesh

Classification and prediction contd


40

Model construction: j Prediction: p

Data Mining - 4rth class UHD

Aso M. Darwesh

Nearest Neighbour Classification


41

Used when all attribute values are continuous The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define

Data Mining - 4rth class UHD

Aso M. Darwesh

Nearest Neighbour Classification


42

Data Mining - 4rth class UHD

Aso M. Darwesh

Rule-Based Classification
43

Data Mining - 4rth class UHD

Aso M. Darwesh

Artificial Neural Network


44

Data Mining - 4rth class UHD

Aso M. Darwesh

Support Vector Machines


45

Data Mining - 4rth class UHD

Aso M. Darwesh