Sei sulla pagina 1di 45

# University of Human Development College of science and technology Computer department 4rth class

CLASSIFICATION

Chapter Four 1

aso.darwesh@yahoo.fr

Outlines
2

Classification and prediction Decision Tree Induction Bayesian classification Nearest Neighbor Classification Rule-Based Classification Artificial Neural Network Support Vector Machines
Data Mining - 4rth class UHD Aso M. Darwesh

3

Classes

Example
Medical

benign)

Predicting

Aso M. Darwesh

## Data Mining - 4rth class UHD

Classification
4

Given dataset
Collection

of instances (records)

Model building
Find

Goal
Classify

## previously unseen instances as accurately as possible

Data Mining - 4rth class UHD Aso M. Darwesh

Classification illustration
5 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes Age Gender Specialty Sportive 23 F IT ? 30 M IT ? 28 F Medicine ? 27 M Engineering ? 29 F Sociology ?

## Classification algorithm Building Model Using

Learning set

Test set
Data Mining - 4rth class UHD Aso M. Darwesh

6

## Model: represents one of the following

Classification

rules (e.g., if x then y) Decision tree (Automatically generating classification rules) Mathematical formulae (e.g., f(attributes) = class label)

Model building
The


## given dataset is divided into

Training set used to build the model  Test set used to validate it and find the accuracy of the model
Data Mining - 4rth class UHD Aso M. Darwesh

7

For

## Estimating model accuracy

known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set instances that are correctly classified by the model If the accuracy is acceptable, use the model to classify new data
The
Data Mining - 4rth class UHD Aso M. Darwesh

## Decision Tree Induction

8

Partitioning dataset based on the value of an attribute (or un attribute) Creating a branch for each value of its possible values For continuous attributes the test is normally like less than or equal to or greater than The splitting process continues until each branch can be labeled with just one classification
Data Mining - 4rth class UHD Aso M. Darwesh

## Decision Tree Induction: Example

9 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes

Specialty

Engineering

Medicine

Sociology

IT

No

No

Gender

F Yes

M No

M Yes <30 No

Aso M. Darwesh

10

## Used for all types of data

Categorical Continuous

## Decision tree has two main functions

Data


compression

Tree representation is equivalent to the dataset in the sense that the values of all attributes will lead to identical classification

Prediction

unseen instances
Aso M. Darwesh

## TDIDT algorithm contd

11

Top-Down Induction of Decision Trees Has no preconditions The same attribute cannot be assumed twice in the same branch Production decision rules in the implicit form of a decision tree At each non-leaf node an attribute is chosen for splitting
Data Mining - 4rth class UHD Aso M. Darwesh

## TDIDT algorithm problems

12

TDIDT algorithms have two main distinguish aspects Impurity measure Selection method (underspecified)
The

algorithm specifies Select an attribute A to split on but no method is given for doing this

ID3

## (Iterative Dichotomiser) [Quinlan] C4.5

Data Mining - 4rth class UHD Aso M. Darwesh

13

If

## all examples in S belong to some class C Then make leaf labeled C

  

Otherwise

select the most informative attribute A partition S according to As values recursively construct subtrees T1, T2, ..., for the subsets of S

Aso M. Darwesh

14

## Resulting tree T is:

A
Attribute A

v1

v2

vn

As values

T1

T2
Data Mining - 4rth class UHD

Tn
Aso M. Darwesh

Subtrees

## Measurement of attribute selection

15

Select attribute which partitions the learning set into subsets as pure as possible
Entropy Gini

index tables

Frequency G2

## Data Mining - 4rth class UHD

Aso M. Darwesh

Entropy
16

The average amount of information needed to classify an object n is the number of classes in the dataset
n

E !  pi log 2 pi
i

## Where pi { 0 For K classes, pi is the relative frequency of class i

Data Mining - 4rth class UHD Aso M. Darwesh

## The Logarithm Function log2X

17

log2x=y 2y=x (x>0) e.g., log28=3, because 23=8 Properties The value of log2x is: Positive when x>1 Negative when x<1 Zero when x=1
Data Mining - 4rth class UHD Aso M. Darwesh

## The Logarithm Function: properties

18

log2(a b) = log2 a + log2 b log2 (a/b) = log2 a log2 b log2 (an) = n log2 a log2 (1/a) = log2 a

Aso M. Darwesh

## The function x loog2x

19

The value of -x log2x is in [0,1] when x is in [0,1] Maximum value of -x log2x is when x=1/e (e2.71828) The initial minus sign (-) is included to make the value of the function positive (or zero)
Data Mining - 4rth class UHD Aso M. Darwesh

## The logarithm of other bases

20

Natural logarithm
logex

written as lnx

## Other common uses base of logarithm is 10

log10 ( x ) log e ( x) log a ( x) ! ! log10 ( a ) log e ( a ) ln( x ) log a ( x) ! ln(a )
Data Mining - 4rth class UHD Aso M. Darwesh

Entropy: Example
21

E !  pi log 2 pi
i

C1 C2
C1 C2

0 6
1 5

P1 = 0/6 = 0

P2 = 6/6 = 1

C1 C2

2 4

P1= 2/6

P2 = 4/6

## Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

Data Mining - 4rth class UHD Aso M. Darwesh

Entropy: Example
22 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes

## For the initial dataset we calculate Estart

There are only two classes Class (Sportive = Yes) = 6 Class (Sportive = No) = 7 PYes=6/13 PNo=7/13

Aso M. Darwesh

Entropy: Example
23

Class

EIT

## = [(4/5)log2(4/5)+(1/5)log2 (1/5)] = 0,72192809

Data Mining - 4rth class UHD Aso M. Darwesh

Example contd
24

## For subset (Specialty = Medicine)

Class

Age Gender 20 34 40 23 M M F F

## Specialty Medicine Medicine Medicine Medicine

Sportive No Yes No No

EMedicine

Aso M. Darwesh

Example contd
25

Class

## (Sportive = Yes) = 0 Class (Sportive = No) = 2 PYes=0/2 PNo=2/2

EEngineering

Age Gender 35 24 M F

Sportive No No

class)

Aso M. Darwesh

Example contd
26

## For subset (Specialty = Sociology)

Class

Age Gender 28 24 M F

Sportive No Yes

ESociolgy

## Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
27

Now, we calculate ENew ENew is the weighted means of Eis Weights are the proportion of the original instances in each subset ENew = (5/13)EIT+(4/13)EMed+(2/13)EEng+(2/13)ESoc = (5/13)* 0,72192809 +(4/13)* 0,81127812 +(2/13)*0+(2/13)*1 = 0,68113484
Data Mining - 4rth class UHD Aso M. Darwesh

Example contd
28

We define Information Gain = EStart - Enew IG = 0,99572745 - 0,68113484 = 0,3 It must be calculate Enew for all other attributes of the original dataset Homework

## Data Mining - 4rth class UHD

Aso M. Darwesh

Example contd
29

We define Information Gain = EStart - Enew The entropy method of attribute selection is to choose to split on the attribute that maximizes the value of Information Gain This is equivalent to minimizing the value of ENew as EStart is fixed

## Data Mining - 4rth class UHD

Aso M. Darwesh

References
30

Principles of Data Mining, by Max Bramer, Springer-Verlag London Limited,2006. 342 pages, ISSN 1863-7310 Data Mining: Concepts and Techniquesby Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, 2006. 772 pages. ISBN 1-55860-489-8 Seyed R. Mousavi and Krysia Broda, Impact of Binary Coding on Multiway-split TDIDT Algorithms. International Journal of Electrical and Electronics Engineering 2:3 2008, P 150-159.
Data Mining - 4rth class UHD Aso M. Darwesh

University of Human Development College of science and technology Computer department 4rth class

CLASSIFICATION

Chapter Four - 2

aso.darwesh@yahoo.fr

Gini index
32

Aso M. Darwesh

TDIDT algorithm
33

## Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
34

Nave Bayes classifiers Does not use rules, decision tree, etc. Using probability theory to find the most likely of the possible classifications The sum of the probabilities of a set of mutually exclusive and exhaustive events must always be 1 The outcome of each trial is recorded in one row of a table. Each row must have one and only one classification
Data Mining - 4rth class UHD Aso M. Darwesh

Bayesian classification
35

Define dataset as in page 26 and the paragraph of the training set constitutes The probability of an event occuring if we know that an attribute has a particular value (or that several variables have particular values) is called the conditional probability of the event occuring and is written as e.g., p(class=on time|season=winter)

## Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
36

Nave Bayes (1702-1761) Combining the prior and conditional probability in a single formula Nave: The effect of the value of one attribute on the probability of a given classification is independent of the values of the other attributes

Aso M. Darwesh

37

## Data Mining - 4rth class UHD

Aso M. Darwesh

Bayesian classification
38

Problems It relies on all attributes being categorical Estimating probabilities by relative frequencies can give a poor estimate if the number of instances with a given attribute / value combination is small

## Data Mining - 4rth class UHD

Aso M. Darwesh

39

A test set is used to determine the accuracy of the model Usually, the given data set is divided into Training set
Used

Test set
Used

Aso M. Darwesh

40

Aso M. Darwesh

## Nearest Neighbour Classification

41

Used when all attribute values are continuous The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define

Aso M. Darwesh

42

## Data Mining - 4rth class UHD

Aso M. Darwesh

Rule-Based Classification
43

Aso M. Darwesh

44

Aso M. Darwesh

45

Aso M. Darwesh