Sei sulla pagina 1di 120

Data Warehouse Techniques

Type of data
Data preprocessing or data scrubbing
Data exploration, data cube and OLAP
Data Mining
Similarity between data objects
Data clustering and clustering evaluation
Data classification and classification evaluation
Interesting association rules

A.Merceron

Data Warehouse Techniques

2010

What is Cluster Analysis?


Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized

Intra-cluster
distances are
minimized

A.Merceron

Data Warehouse Techniques

2010

Applications of Cluster Analysis


Understanding
Group students who
succeed and fails in
the same exercises
Summarization
Reduce the size of
large data sets

Clustering precipitation
in Australia
A.Merceron

Data Warehouse Techniques

2010

What is not Cluster Analysis?

Supervised classification
Have class label information

Simple segmentation
Dividing students into different registration groups
alphabetically, by last name

A.Merceron

Data Warehouse Techniques

2010

Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters

Partitional Clustering
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly
one subset

Hierarchical clustering
A set of nested clusters organized as a hierarchical
tree
A.Merceron

Data Warehouse Techniques

2010

K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

A.Merceron

Data Warehouse Techniques

2010

K-means Clustering

A.Merceron

Data Warehouse Techniques

2010

Partitional Clustering

Original Points
A.Merceron

Data Warehouse Techniques

2010

Partitional Clustering

Original Points with initial centres


A.Merceron

Data Warehouse Techniques

2010

Partitional Clustering

Original Points with clusters iteration 1


A.Merceron

Data Warehouse Techniques

2010

10

Partitional Clustering

Original Points with new centres


A.Merceron

Data Warehouse Techniques

2010

11

Partitional Clustering

Original Points with clusters and new centres iteration 2


A.Merceron

Data Warehouse Techniques

2010

12

Partitional Clustering

Original Points with clusters and new centres iteration 3


A.Merceron

Data Warehouse Techniques

2010

13

Partitional Clustering

Final clusters and centres


A.Merceron

Data Warehouse Techniques

A Partitional
Clustering
2010

14

K-means Clustering Details


Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
Closeness is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Complexity is O( n * K * I * d ): linear for n.
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

A.Merceron

Data Warehouse Techniques

2010

15

Two different K-means Clusterings


3

2 .5

Original Points

1 .5

0 .5
0

-2

- 1 .5

-1

- 0 .5

0 .5

1 .5

-2

- 1 .5

2 .5

2 .5

1 .5

1 .5

0 .5

0 .5

-2

- 1 .5

-1

- 0 .5

0 .5

1 .5

Optimal Clustering
A.Merceron

-1

- 0 .5

0 .5

1 .5

Data Warehouse Techniques

Sub-optimal Clustering
2010

16

Property of K-means

Sum of Squared Error (SSE) diminishes after each


iteration.
The SSE is not necessarily the optimal one .

SSE =

i = 1 x Ci

A.Merceron

dist 2 ( mi , x )

Data Warehouse Techniques

2010

17

Advantages of K-means

Is efficient.
Can be computed in a distributive way.
Is easy to apply.

A.Merceron

Data Warehouse Techniques

2010

18

Limitations of K-means

How to determine the best K?


May give a sub-optimal solution.
K-means has problems when clusters are of
differing
Sizes
Densities
Non-globular shapes

K-means is sensible to outliers.

A.Merceron

Data Warehouse Techniques

2010

19

Limitations of K-means: Non-globular Shapes

Original Points

A.Merceron

Data Warehouse Techniques

K-means (2 Clusters)

2010

20

Overcoming K-means Limitations

Original Points

A.Merceron

Data Warehouse Techniques

K-means Clusters

2010

21

Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
5

6
0 .2

4
3

0 .1 5

5
2

0 .1

0 .0 5

3
0

A.Merceron

Data Warehouse Techniques

2010

22

Agglomerative Clustering Algorithm


More popular hierarchical clustering technique
Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity of two
clusters
Different approaches to defining the distance between
clusters distinguish the different algorithms
A.Merceron

Data Warehouse Techniques

2010

23

Distance between Clusters


Min (Single link): smallest distance between an element
in one cluster and an element in the other, i.e., dis(K i,
Kj) = min(tip, tjq).
Max (Complete link): largest distance between an
element in one cluster and an element in the other, i.e.,
dis(Ki, Kj) = max(tip, tjq).
Average: avg distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) =
avg(tip, tjq).
Centroid: distance between the centroids of two
clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj).
A.Merceron

Data Warehouse Techniques

2010

24

Distance between clusters: example

Consider 5 Objects a(1,1), b(1,2), c(5,4), d(7,5),


e(7,7) and two classes C_1 = {a, b} and C_2 =
{c,d,e}. Calculate d(C_1, C_2) with different
distances.

a
b
c
d
e
A.Merceron

a
0
1
5
7,2
8,5

0
4,5
6,7
7,8

0
2,2
3,6

0
2

Data Warehouse Techniques

2010

25

Distance between clusters: example


C_1 = {a, b} and C_2 = {c,d,e}.
d_min(C_1, C_2) = 4.5
d_max(C_1, C_2) = 8.5
d_avg(C_1, C_2) = 6.62

a
b
c
d
e
A.Merceron

a
0
1
5
7,2
8,5

0
4,5
6,7
7,8

0
2,2
3,6

0
2

Data Warehouse Techniques

2010

26

Hierarchical Clustering: MIN

3
5

5
0 .2

0 .1 5

0 .1

0 .0 5

Nested Clusters
A.Merceron

Data Warehouse Techniques

Dendrogram
2010

27

Hierarchical Clustering: MAX


0 .4

0 .3 5

2
5

0 .3
0 .2 5

0 .2

6
1

0 .1 5
0 .1
0 .0 5

Nested Clusters
A.Merceron

Data Warehouse Techniques

Dendrogram
2010

28

Hierarchical Clustering: Group Average

0 .2 5

1
0 .2

2
5

0 .1 5

0 .1

6
1

0 .0 5
0

Nested Clusters
A.Merceron

Data Warehouse Techniques

Dendrogram
2010

29

Hierarchical Clustering: Group Average


Compromise between Single and Complete
Link
Strengths
Less susceptible to noise and outliers

Limitations
Biased towards globular clusters

A.Merceron

Data Warehouse Techniques

2010

30

Strengths of Hierarchical Clustering


Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
cutting the dendogram at the proper level

They may correspond to meaningful taxonomies


(also true for K-means):
Iris-Setosa
Points with high values for x and low values for y.

A.Merceron

Data Warehouse Techniques

2010

31

Hierarchical Clustering: Time and Space requirements

O(N2) for updating the proximity matrix.


N is the number of points.

O(N3) time in many cases


There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for
some approaches

A.Merceron

Data Warehouse Techniques

2010

32

Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two clusters,


it cannot be undone
No global objective function is minimized
Different schemes have problems with one or
more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex
shapes
Breaking large clusters
A.Merceron

Data Warehouse Techniques

2010

33

Curse of Dimensionality (K-Means)

n Attributes, m Objects: if n is large enough, compared


to m, clustering cannot be performed :
A: 1, 1, 0, 0, 0, 0
B: 0, 0, 1, 1, 0, 0
C: 1, 1, 1, 1, 1, 1

All objects are equally distant from each other: no


clustering is possible.

Concrete experience: clustering data collected with the


software pepite.
Students answer 72 questions in Math. Pb: cluster them
according to their abilities. Students who have
answered the same questions the same way should be
in same clusters.

A.Merceron

Data Warehouse Techniques

2010

34

Curse of Dimensionality: Pepite

A.Merceron

Data Warehouse Techniques

2010

35

Curse of Dimensionality: Pepite

A.Merceron

Data Warehouse Techniques

2010

36

Curse of Dimensionality: Best Practice


The number of objects must be at least 3 times the number
of attributes.

A.Merceron

Data Warehouse Techniques

2010

37

Cluster Validity
How to evaluate the goodness of the resulting clusters?
Then why do we want to evaluate them?

A.Merceron

To avoid finding patterns in noise (random data)


To compare clustering algorithms
To compare two sets of clusters
To compare two clusters

Data Warehouse Techniques

2010

38

0 .9

0 .9

0 .8

0 .8

0 .7

0 .7

0 .6

0 .6

0 .5

0 .5

Random
Points

Clusters found in Random Data

0 .4

0 .4

0 .3

0 .3

0 .2

0 .2

0 .1

0 .1

0.2

0 .4

0.6

0 .8

DBSCAN

0.2

0 .4

x
1

0 .9

0 .9

0 .8

0 .8

0 .7

0 .7

0 .6

0 .6

0 .5

0 .5

K-means

0 .4

0 .4

0 .3

0 .3

0 .2

0 .2

0 .1

0 .1

0.2

0 .4

0 .8

0 .6

0 .8

Complete
Link

0.2

A.Merceron

0 .6

0 .4

0 .6

0 .8

Data Warehouse Techniques

2010

39

Best Practice: Random Data

Explore the data first: do all objects seem


uniformly distributed through all possible values
for each attribute?

A.Merceron

Histogram
Spread

Data Warehouse Techniques

2010

40

Internal Measures: SSE - Cohesion

Clusters in more complicated figures arent well separated


Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
SSE

SSE is good for comparing two clusterings or two clusters


(average SSE).
Can also be used to estimate the number of clusters
10
9

7
2

SSE

5
4

-2

3
-4

-6

1
5

10

15

10

15

20

25

A.Merceron

Data Warehouse Techniques

2010

41

30

Clustering: Large scale application

Size for the ready made clothes.


Measure waist, shoulders, arms and so on
A cluster gives a size.

A.Merceron

Data Warehouse Techniques

2010

42

Data Warehouse Techniques


Type of data
Data preprocessing or data scrubbing
Data exploration, data cube and OLAP
Data Mining
Similarity between data objects
Data clustering and clustering evaluation
Data classification and classification evaluation
Interesting association rules

A.Merceron

Data Warehouse Techniques

2010

43

Illustrating Classification Task


Tid

Attrib1

Attrib2

Attrib3

Yes

Large

125K

No

No

Medium

100K

No

No

Small

70K

No

Yes

Medium

120K

No

No

Large

95K

Yes

No

Medium

60K

No

Yes

Large

220K

No

No

Small

85K

Yes

No

Medium

75K

No

10

No

Small

90K

Yes

Learning
algorithm

Class

Induction
Learn
Model

Model

10

Training Set
Tid

Attrib1

11

No

Small

55K

12

Yes

Medium

80K

13

Yes

Large

110K

14

No

Small

95K

15

No

Large

67K

Attrib2

Attrib3

Apply
Model

Class

Deduction

10

Test Set
A.Merceron

Data Warehouse Techniques

2010

44

Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.

Find a model for class attribute as a function


of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

A.Merceron

Data Warehouse Techniques

2010

45

Examples of Classification Task


Predicting students failure
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
Categorizing mails as spam.
A.Merceron

Data Warehouse Techniques

2010

46

Many Classification Techniques


Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
K-Nearest Neighbours

A.Merceron

Data Warehouse Techniques

2010

47

Example of a Decision Tree


al
al
us
c
c
i
i
o
u
or
or
g
g
in
ss
t
e
e
t
t
a
n
cl
ca
ca
co

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Splitting Attributes

Refund
Yes

No

NO

MarSt
Married

Single, Divorced
TaxInc
< 80K
NO

NO
> 80K
YES

10

Training Data
A.Merceron

Data Warehouse Techniques

Model: Decision Tree


2010

48

Decision Tree Classification Task


Tid

Attrib1

Attrib2

Attrib3

Yes

Large

125K

No

No

Medium

100K

No

No

Small

70K

No

Yes

Medium

120K

No

No

Large

95K

Yes

No

Medium

60K

No

Yes

Large

220K

No

No

Small

85K

Yes

No

Medium

75K

No

10

No

Small

90K

Yes

Tree
Induction
algorithm

Class

Induction
Learn
Model

Model

10

Training Set
Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

12

Yes

Medium

80K

13

Yes

Large

110K

14

No

Small

95K

15

No

Large

67K

Apply
Model

Class

Decision
Tree

Deduction

10

Test Set
A.Merceron

Data Warehouse Techniques

2010

49

Apply Model to Test Data


Test Data
Start from the root of tree.

Refund
No

NO

MarSt
Single, Divorced
TaxInc

NO

A.Merceron

Taxable
Income Cheat

No

80K

Married

10

Yes

< 80K

Refund Marital
Status

Married
NO

> 80K
YES

Data Warehouse Techniques

2010

50

Decision Tree Induction


Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT

A.Merceron

Data Warehouse Techniques

2010

51

Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.

Issues
Determine how to split the records
How

to specify the attribute test condition?

How

to determine the best split?

Determine when to stop splitting

A.Merceron

Data Warehouse Techniques

2010

52

General Structure of Hunts Algorithm


Let Dt be the set of training records
that reach a node t
General Procedure:
If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

If Dt contains records that


belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the
procedure to each subset.
A.Merceron

Data Warehouse Techniques

Dt

2010

53

How to Specify Test Condition?


Depends on attribute types
Nominal
Ordinal
Continuous

Depends on number of ways to split


2-way split or binary split
Multi-way split

A.Merceron

Data Warehouse Techniques

2010

54

Splitting Based on Nominal Attributes


Multi-way split: Use as many partitions as distinct
values.
CarType
Family

Luxury
Sports

Binary split: Divides values into two subsets.


Need to find optimal partitioning.
{Sports,
Luxury}

A.Merceron

CarType
{Family}

{Family,
Luxury}

OR

Data Warehouse Techniques

2010

CarType
{Sports}

55

Splitting Based on Continuous Attributes


Different ways of handling
Discretization to form an ordinal categorical attribute

Static discretize once at the beginning

Dynamic ranges can be found by equal interval


bucketing, equal frequency bucketing
(percentiles), or clustering.

Binary Decision: (A < v) or (A v)

A.Merceron

consider all possible splits and finds the best cut

can be more compute intensive

Data Warehouse Techniques

2010

56

Splitting Based on Continuous Attributes

Taxable
Income
> 80K?

Taxable
Income?
< 10K

Yes

> 80K

No
[10K,25K)

(i) Binary split

A.Merceron

Data Warehouse Techniques

[25K,50K)

[50K,80K)

(ii) Multi-way split

2010

57

Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.

Issues
Determine how to split the records
How

to specify the attribute test condition?

How

to determine the best split?

Determine when to stop splitting

A.Merceron

Data Warehouse Techniques

2010

58

How to determine the Best Split


Greedy approach:
Nodes with homogeneous class distribution are
preferred

Need a measure of node impurity:

C0: 9
C1: 1

C0: 5
C1: 5
Non-homogeneous,

Quite homogeneous,

High degree of impurity

Low degree of impurity

A.Merceron

Data Warehouse Techniques

2010

59

Measures of Node Impurity


Gini Index
Entropy
Misclassification error

A.Merceron

Data Warehouse Techniques

2010

60

Measure of Impurity: GINI


Gini Index for a given node t :

GINI (t ) = 1

[ p ( j | t )]

( p( j | t) is the relative frequency of class j at node t).


Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2

0
6

Gini=0.000
A.Merceron

C1
C2

1
5

Gini=0.278
Data Warehouse Techniques

C1
C2

2
4

Gini=0.444
2010

C1
C2

3
3

Gini=0.500
61

Examples for computing GINI

GINI (t ) = 1

[ p( j | t )]

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

A.Merceron

P(C2) = 6/6 = 1

Gini = 1 P(C1)2 P(C2)2 = 1 0 1 =


0
P(C2) = 5/6

Gini = 1 (1/6)2 (5/6)2 = 0.278


P(C2) = 4/6

Gini = 1 (2/6)2 (4/6)2 = 0.444


Data Warehouse Techniques

2010

62

Splitting Based on GINI


Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,

GINI split =

i= 1

ni
GINI (i )
n

where, ni = number of records at child i,


n = number of records at node p.

A.Merceron

Data Warehouse Techniques

2010

63

Splitting Based on GINI


GINI (t ) = 1

al
al
us
c
c
i
i
o
u
or
or
g
g
in
t
e
e
ss
t
t
n
a
ca
ca
cl
co

[ p ( j | t )]2

ni
GINI (i )
n

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

Example: Refund
GINIRyes=1-((0/3)2 + (3/3)2) = 0

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

GINIRno=1-(3/7)2 + (4/7)2)= 0.49

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

GINI split =

i= 1

GINIR=(3/10)*0 + (7/10)*0.49 =
= 0.343

60K

10

A.Merceron

Data Warehouse Techniques

2010

64

Splitting Based on GINI


GINI (t ) = 1

[ p ( j | t )]

al
al
c
c
i
i
or
or
g
g
te
te
ca
ca

GINI split =

i= 1

ni
GINI (i )
n

Example: Marital Status


GINIMsin=1-((2/4)2 + (2/4)2) = 0.5
GINIMma=1-(0/4)2 + (1/4)2)= 0
GINIMdiv=1-(1/2)2 + (1/2)2)= 0.5
GINIM=(4/10)*0.5 + (4/10)*0 +
(2/10)*0.5 = 0.45

i
nt
o
c

Data Warehouse Techniques

ss
a
cl

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

A.Merceron

us
o
nu

2010

65

Alternative Splitting Criteria based on INFO


Entropy at a given node t:

Entropy (t ) = p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Measures homogeneity of a node.


Maximum

(log nc) when records are equally distributed among all


classes implying least information

Minimum

(0.0) when all records belong to one class, implying


most information

Entropy based computations are similar to the GINI index


computations

A.Merceron

Data Warehouse Techniques

2010

66

Examples for computing Entropy

Entropy (t ) = p ( j | t ) log p ( j | t )
2

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

A.Merceron

P(C2) = 6/6 = 1

Entropy = 0 log 0 1 log 1 = 0 0 = 0

P(C2) = 5/6

Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65


P(C2) = 4/6

Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92


Data Warehouse Techniques

2010

67

Splitting Criteria based on Classification Error


Classification error at a node t :

Error (t ) = 1 max P (i | t )
i

Measures misclassification error made by a node.


Maximum

(1 - 1/nc) when records are equally distributed


among all classes, implying least interesting information

Minimum

(0.0) when all records belong to one class, implying


most interesting information

A.Merceron

Data Warehouse Techniques

2010

68

Examples for Computing Error

Error (t ) = 1 max P (i | t )
i

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

A.Merceron

P(C2) = 6/6 = 1

Error = 1 max (0, 1) = 1 1 = 0

P(C2) = 5/6

Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6


P(C2) = 4/6

Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3


Data Warehouse Techniques

2010

69

Misclassification Error vs Gini


Parent

A?
Yes

No

Node N1

Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489

A.Merceron

Node N2

C1
C2

N1
3
0

N2
4
3

C1

C2

Gini = 0.42

Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342

Gini=0.361

Gini improves !! but not


missclassification error

Data Warehouse Techniques

2010

70

Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.

Issues
Determine how to split the records
How

to specify the attribute test condition?

How

to determine the best split?

Determine when to stop splitting

A.Merceron

Data Warehouse Techniques

2010

71

Stopping Criteria for Tree Induction


Stop expanding a node when all the records
belong to the same class
Stop expanding a node when all the records have
similar attribute values
Early termination (like depth)

A.Merceron

Data Warehouse Techniques

2010

72

Decision Tree Based Classification


Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets

A.Merceron

Data Warehouse Techniques

2010

73

Practical Issues of Classification


Underfitting and Overfitting
Performance and Costs of Classification

A.Merceron

Data Warehouse Techniques

2010

74

Underfitting and Overfitting


Underfitting: tree is not complete enough and does
not predict well.
Overfitting: tree predicts well only the training data:
many long branches with few objects in each leaf,
that reflect particular cases in the training set.

A.Merceron

Data Warehouse Techniques

2010

75

Overfitting: Best Practice


Overfitting results in decision trees that are more
complex than necessary
Make sure the tree makes sense: each branch
says something meaningful and sensible.
Branches should not be too long!
Build two trees, one with no-pruning, one with
pruning and compare.

A.Merceron

Data Warehouse Techniques

2010

76

Estimating Generalization Errors


Re-substitution errors: error on training ( e(t) )
Generalization errors: error on testing ( e(t))
Methods for estimating generalization errors:
Optimistic approach: e(t) = e(t)
Pessimistic approach:

For each leaf node: e(t) = (e(t)+0.5)


Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%

Reduced error pruning (REP):

A.Merceron

uses validation data set to estimate generalization


error

Data Warehouse Techniques

2010

77

Occams Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Therefore, one should include model complexity
when evaluating a model

A.Merceron

Data Warehouse Techniques

2010

78

How to Address Overfitting


Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:

Stop if all instances belong to the same class

Stop if all the attribute values are the same

More restrictive conditions:


Stop if number of instances is less than some user-specified
threshold See RapidMiner

Stop if class distribution of instances are independent of the


available features (e.g., using 2 test)

A.Merceron

Stop if expanding the current node does not improve impurity


measures (e.g., Gini or information gain).

Data Warehouse Techniques

2010

79

How to Address Overfitting


Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up
fashion
If generalization error improves after trimming, replace
sub-tree by a leaf node.
Class label of leaf node is determined from majority
class of instances in the sub-tree
Can use MDL (Minimum Description Length) for postpruning

A.Merceron

Data Warehouse Techniques

2010

80

Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes

20

Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No

10

Training Error (After splitting) = 9/30


Pessimistic error (After splitting)

Error = 10/30

= (9 + 4 0.5)/30 = 11/30
PRUNE!

A?
A1

A4
A3

A2
Class = Yes

Class = Yes

Class = Yes

Class = Yes

Class = No

Class = No

Class = No

Class = No

A.Merceron

Data Warehouse Techniques

2010

81

Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation


How to obtain reliable estimates?

A.Merceron

Data Warehouse Techniques

2010

82

Metrics for Performance Evaluation


Focus on the predictive capability of a model
Rather than how fast it takes to classify or build
models, scalability, etc.

Confusion Matrix:

a: TP (true
positive)

PREDICTED CLASS
Class=Yes

ACTUAL
CLASS

A.Merceron

Class=No

b: FN (false
negative)

Class=Yes

c: FP (false
positive)

Class=No

d: TN (true
negative)

Data Warehouse Techniques

2010

83

Metrics for Performance Evaluation


PREDICTED CLASS
Class=Yes

ACTUAL
CLASS

Class=No

Class=Yes

a
(TP)

b
(FN)

Class=No

c
(FP)

d
(TN)

Most widely-used metric:

a+ d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
A.Merceron

Data Warehouse Techniques

2010

84

Limitation of Accuracy
False positive and false negative may not have
the same weight: predict wrongly a student's
failure (and having her abandon a degree) is
worse than predict wrongly a student's success
(and encourage her to continue her degree
though she may fail).

A.Merceron

Data Warehouse Techniques

2010

85

Accuracy: Best Practice


Do false positives and false negatives have the
same significance for your data? And handle
accordingly: minimize one category, or establish
a cost matrix.

A.Merceron

Data Warehouse Techniques

2010

86

Cost Matrix

PREDICTED CLASS
C(i|j)
Class=Yes

ACTUAL
CLASS Class=No

Class=Yes Class=No
C(Yes|Yes)

C(No|Yes)

C(Yes|No)

C(No|No)

C(i|j): Cost of misclassifying class j example as class i

A.Merceron

Data Warehouse Techniques

2010

87

Computing Cost of Classification


Cost
Matrix

PREDICTED CLASS

ACTUAL
CLASS

Model M1

ACTUAL
CLASS

C(i|j)

-1

100

PREDICTED CLASS

150

40

60

250

Accuracy = 80%
Cost = 3910
A.Merceron

Data Warehouse Techniques

Model M2

ACTUAL
CLASS

PREDICTED CLASS

250

45

200

Accuracy = 90%
Cost = 4255
2010

88

Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?

Methods for Performance Evaluation


How to obtain reliable estimates?
Training

A.Merceron

and test

Data Warehouse Techniques

2010

89

Data Warehouse Techniques


Type of data
Data preprocessing or data scrubbing
Data exploration, data cube and OLAP
Data Mining
Similarity between data objects
Data clustering and clustering evaluation
Data classification and classification evaluation
Interesting association rules

A.Merceron

Data Warehouse Techniques

2010

90

Amazon.com Example

A.Merceron

Data Warehouse Techniques

2010

91

Association Rule Mining


Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

A.Merceron

Data Warehouse Techniques

Example of Association
Rules

{Diaper} {Beer}
{Milk, Bread} {Eggs,Coke}
{Beer, Bread} {Milk}

Implication means cooccurrence, not causality!


2010

92

Definition: Frequent Itemset


Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
K-itemset: an itemset that contains k items
Support count ()
Frequency of occurrence of an itemset, e.g.
({Milk, Bread,Diaper}) = 2
Support (s)

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Fraction of transactions that contain an


itemset, e.g. s({Milk, Bread, Diaper}) =
2/5
Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
A.Merceron

Data Warehouse Techniques

2010

93

Definition: Association Rule


Association Rule

An implication expression of the form 1


X Y, where X and Y are disjoint
2
itemsets
3
Example:
4
{Milk, Diaper} {Beer}
5
Rule Evaluation Metrics

TID

Items

Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example:

Support (s)
{Milk, Diaper} Beer

Fraction of transactions that contain


both X and Y
(Milk, Diaper, Beer) 2
s=
= = 0.4
Confidence (c)
|T|
5

Measures how often items in Y


(Milk, Diaper, Beer) 2
c
=
= = 0.67
appear in transactions that
(Milk, Diaper )
3
contain X

A.Merceron

Data Warehouse Techniques

2010

94

Evaluation metrics and probabilities


What gives its direction to X Y?
Rule Evaluation Metrics

Support (s)

P(X Y)
It is symmetric

Confidence (c)

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example:

P(Y | X)
{Milk, Diaper} Beer
It is not symmetric and gives its
(Milk, Diaper, Beer) 2
direction to a rule.
s=
= = 0.4
|T|

c=
A.Merceron

Data Warehouse Techniques

(Milk, Diaper, Beer) 2


= = 0.67
(Milk, Diaper )
3

2010

95

Association Rule Mining Task


Given a set of transactions T, the goal of
association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds

Computationally prohibitive!
A.Merceron

Data Warehouse Techniques

2010

96

Mining Association Rules


Example of Rules:

TID

Items

Bread, Milk

2
3
4

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer

Bread, Milk, Diaper, Coke

{Milk,Diaper} {Beer} (s=0.4, c=0.67)


{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus: decouple the support and confidence requirements
A.Merceron

Data Warehouse Techniques

2010

97

Mining Association Rules


Two-step approach:
Frequent Itemset Generation

Generate all itemsets whose support minsup


A priori algorithm

Rule Generation

Generate high confidence rules from each frequent


itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still


computationally expensive
A.Merceron

Data Warehouse Techniques

2010

98

Illustrating Apriori Principle


Item
Bread
Coke
Milk
Beer
Diaper
Eggs

Count
4
2
4
3
4
1

Items (1-itemsets)

Minimum Support = 3

Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}

Count
3
2
3
2
3
3

Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)

Triplets (3-itemsets)

If every subset is considered,


6
C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13

Itemset
{Bread,Milk,Diaper}

A.Merceron

2010

Data Warehouse Techniques

Count
3

99

Apriori Algorithm
Ck: Candidate itemset of size k - Lk : frequent itemset of
size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = apriori-gen( Lk);
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
A.Merceron

Data Warehouse Techniques

2010

100

Apriori-Gen

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1


insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck


A.Merceron

Data Warehouse Techniques

2010

101

Exercise 01
Find all frequent itemsets using Apriori-Algorithm.
Mininimum support count 2.
I1,
I2,
I2,
I1,
I1,
I2,
I1,
I1,
I1,

A.Merceron

I2,
I4
I3
I2,
I3
I3
I3
I2,
I2,

I5

I4

I3, I5
I3, I5

Data Warehouse Techniques

2010

102

Effect of Support Distribution


How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
If minsup is set too low, it is computationally
expensive and the number of itemsets is very large

Using a single minimum support threshold may


not be effective
A.Merceron

Data Warehouse Techniques

2010

103

Support: Best practice


With data exploration get an overview of the
items or products.
Use this exploration to select the items you want
to find associations for and to fix support and
confidence.

A.Merceron

Data Warehouse Techniques

2010

104

Rule Generation
Given a frequent itemset L, find all non-empty
subsets f L such that f L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C, ACD B, BCD A,


B ACD, C ABD, D ABC
AC BD, AD BC, BC AD,
CD AB,

If |L| = k, then there are 2k 2 candidate


association rules (ignoring L and L)

A.Merceron

Data Warehouse Techniques

2010

105

Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same


itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule

A.Merceron

Data Warehouse Techniques

2010

106

Pattern Evaluation
Association rule algorithms tend to produce too
many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence

Interestingness measures can be used to


prune/rank the derived patterns
In the original formulation of association rules,
support & confidence are the only measures used
A.Merceron

Data Warehouse Techniques

2010

107

Support & Confidence are limited


5000 transactions
|X, Y| = 1000, |X| = 1000 and |Y| = 2500
|X, Y| = 1000, |X| = 1000 and |Y| = 5000
Support (XY) = 20%, confidence (XY) = 100%
Y

!Y

1000

1000

!X

1500

2500

4000

2500

2500

5000

A.Merceron

Data Warehouse Techniques

!Y

1000

1000

!X

4000

4000

5000

5000

2010

108

Interesting rules

5 000 transactions

|X|=|X^Y|=1000, |Y| = 2500

sup(XY)= 20 %
conf(XY) = 100 %

X and Y

|X|=|X^Y|=1000, |Y| = 5000

sup(XY)= 20 %
conf(XY) = 100 %

|X|=|X^Y|=4800, |Y| = 5000

sup(XY)= 96 %
conf(XY) = 100 %
109

Application of Interestingness Measure


Knowledge

Interestingness
Measures

Patterns

Postprocessing
Preprocessed
Data
Prod
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
uct

Featur
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e

Mining

Selected
Data

Preprocessing

Data

Selection
A.Merceron

Data Warehouse Techniques

2010

110

Interestingness Measures
Lift, Added Value, cosine

A.Merceron

Data Warehouse Techniques

2010

111

Added Value
X and Y are related if the proportion of transactions
containing Y among the transactions containing X is
greater than the proportion of transactions containing
Y among all transactions. AV(XY) and AV(YX)
are linked!

AV X Y =P Y / X P Y =conf X Y P Y
AV Y X =P X /Y P X =conf Y X P X

A.Merceron

Data Warehouse Techniques

2010

112

Added Value and lift


Lift is exactly 1 when added value is 0, greater than 1
when added value is positive and below 1 when added
value is negative.
Rule not interesting if lift around or below 1.
Lift is 1: X and Y are independent in the sense of
probability theory.

P X ,Y
conf X Y X , Y . n
lift X Y =
=
=
P X . P Y
P Y
X .Y

A.Merceron

Data Warehouse Techniques

2010

113

Drawback of Lift with strong symmetric rules


Lift does not have the null-invariant property: it is
sensitive to transactions containing neither item X
nor item Y.

A.Merceron

Data Warehouse Techniques

2010

114

Cosine
A, B two vectors of length n: A=(a1, ..., an), B=(b1, ..., bn)
AB
cosine A , B =
A.B
n

AB = a k b k
k =1

X =

k =1

A.Merceron

xk

Data Warehouse Techniques

2010

115

Cosine XY
X = (x1, ..., xn)
xk is 1 if transaction tk contains X, 0 otherwise.
Example X is {Bread, Milk}, Y is {Diaper} gives
vector X = (1, 0, 0, 1, 1) and vector Y = (0, 1, 1, 1, 1)
TID

Items

Bread, Milk

Bread, Diaper, Beer, Eggs

3
4

Milk, Diaper, Beer, Coke


Bread, Milk, Diaper, Beer

Bread, Milk, Diaper, Coke

A.Merceron

Data Warehouse Techniques

2010

116

Cosine
Rule not interesting if cosine is below 0.66.
null-invariant property: transactions not containing
neither item X nor item Y have no influence.
X and Y are the most related (value 1) when each
transaction contains either both X and Y or neither X
nor Y.

P X ,Y
X , Y
cosine X Y =
=
P X . P Y X .Y
A.Merceron

Data Warehouse Techniques

2010

117

Interesting rules

5 000 transactions

|X|=|X^Y|=1000, |Y| = 2500

cosine(XY)= .63
lift(XY) = 2

X and Y

|X|=|X^Y|=1000, |Y| = 5000

cosine(XY)= .45
lift(XY) = 1

|X|=|X^Y|=4800, |Y| = 5000

cosine(XY)= .98
lift(XY) = 1
118

Interestingness: Best practice


Prune the rules with two distinct measures like lift
and cosine.
If cosine and lift agree, easy.
If they do not agree: look at support and confidence.
Strong rule? Follow cosine.
Is knowing that X occur more important than knowing it
did not occur? If yes, follow cosine, if not follow lift.

Ponder whether the associations make sense.

A.Merceron

Data Warehouse Techniques

2010

119

Tools
Commercial:
IBM: Intelligent Miner.
SPSS: Clementine.

Open source:
Weka
RAPIDMINER
KNIME

A.Merceron

Data Warehouse Techniques

2010

120

Potrebbero piacerti anche