Lecture# 4, 5 & 6-2016

Data Warehouse Techniques
Type of data
Data preprocessing or data scrubbing
Data exploration, data cube and OLAP
Data Mining
Similarity between data objects
Data clustering and clustering evaluation
Data classification and classification evaluation
Interesting association rules
A.Merceron
2010
What is Cluster Analysis?

Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
A.Merceron
2010
Applications of Cluster Analysis

Understanding
Group students who
succeed and fails in
the same exercises
Summarization
Reduce the size of
large data sets
Clustering precipitation
in Australia
A.Merceron
2010
What is not Cluster Analysis?
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups
alphabetically, by last name
A.Merceron
2010
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly
one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical
tree
A.Merceron
2010
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
A.Merceron
2010
K-means Clustering
A.Merceron
2010
Original Points
A.Merceron
2010
Original Points with initial centres

A.Merceron
2010
Original Points with clusters iteration 1

A.Merceron
2010
10
Original Points with new centres

A.Merceron
2010
11
Original Points with clusters and new centres iteration 2

A.Merceron
2010
12
Original Points with clusters and new centres iteration 3

A.Merceron
2010
13
Final clusters and centres

A.Merceron
A Partitional
Clustering
2010
14
K-means Clustering Details

Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
Closeness is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Complexity is O( n * K * I * d ): linear for n.
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
A.Merceron
2010
15
Two different K-means Clusterings

3
2 .5
Original Points
1 .5
0 .5
0
-2
- 1 .5
-1
- 0 .5
0 .5
1 .5
-2
- 1 .5
2 .5
2 .5
1 .5
1 .5
0 .5
0 .5
-2
- 1 .5
-1
- 0 .5
0 .5
1 .5
Optimal Clustering
A.Merceron
-1
- 0 .5
0 .5
1 .5
Sub-optimal Clustering
2010
16
Property of K-means
Sum of Squared Error (SSE) diminishes after each

iteration.
The SSE is not necessarily the optimal one .
SSE =
i = 1 x Ci
A.Merceron
dist 2 ( mi , x )
2010
17
Advantages of K-means
Is efficient.
Can be computed in a distributive way.
Is easy to apply.
A.Merceron
2010
18
Limitations of K-means
How to determine the best K?

May give a sub-optimal solution.
K-means has problems when clusters are of
differing
Sizes
Densities
Non-globular shapes
K-means is sensible to outliers.
A.Merceron
2010
19
Limitations of K-means: Non-globular Shapes
Original Points
A.Merceron
K-means (2 Clusters)
2010
20
Overcoming K-means Limitations
Original Points
A.Merceron
K-means Clusters
2010
21
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
5
6
0 .2
4
3
0 .1 5
5
2
0 .1
0 .0 5
3
0
A.Merceron
2010
22
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique
Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity of two
clusters
Different approaches to defining the distance between
clusters distinguish the different algorithms
A.Merceron
2010
23
Distance between Clusters

Min (Single link): smallest distance between an element
in one cluster and an element in the other, i.e., dis(K i,
Kj) = min(tip, tjq).
Max (Complete link): largest distance between an
element in one cluster and an element in the other, i.e.,
dis(Ki, Kj) = max(tip, tjq).
Average: avg distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) =
avg(tip, tjq).
Centroid: distance between the centroids of two
clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj).
A.Merceron
2010
24
Distance between clusters: example
Consider 5 Objects a(1,1), b(1,2), c(5,4), d(7,5),

e(7,7) and two classes C_1 = {a, b} and C_2 =
{c,d,e}. Calculate d(C_1, C_2) with different
distances.
a
b
c
d
e
A.Merceron
a
0
1
5
7,2
8,5
0
4,5
6,7
7,8
0
2,2
3,6
0
2
2010
25
Distance between clusters: example

C_1 = {a, b} and C_2 = {c,d,e}.
d_min(C_1, C_2) = 4.5
d_max(C_1, C_2) = 8.5
d_avg(C_1, C_2) = 6.62
a
b
c
d
e
A.Merceron
a
0
1
5
7,2
8,5
0
4,5
6,7
7,8
0
2,2
3,6
0
2
2010
26
Hierarchical Clustering: MIN
3
5
5
0 .2
0 .1 5
0 .1
0 .0 5
Nested Clusters
A.Merceron
Dendrogram
2010
27
Hierarchical Clustering: MAX

0 .4
0 .3 5
2
5
0 .3
0 .2 5
0 .2
6
1
0 .1 5
0 .1
0 .0 5
Nested Clusters
A.Merceron
Dendrogram
2010
28
Hierarchical Clustering: Group Average
0 .2 5
1
0 .2
2
5
0 .1 5
0 .1
6
1
0 .0 5
0
Nested Clusters
A.Merceron
Dendrogram
2010
29
Hierarchical Clustering: Group Average

Compromise between Single and Complete
Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
A.Merceron
2010
30
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
cutting the dendogram at the proper level
They may correspond to meaningful taxonomies

(also true for K-means):
Iris-Setosa
Points with high values for x and low values for y.
A.Merceron
2010
31
Hierarchical Clustering: Time and Space requirements
O(N2) for updating the proximity matrix.

N is the number of points.
O(N3) time in many cases

There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for
some approaches
A.Merceron
2010
32
Hierarchical Clustering: Problems and Limitations
Once a decision is made to combine two clusters,

it cannot be undone
No global objective function is minimized
Different schemes have problems with one or
more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex
shapes
Breaking large clusters
A.Merceron
2010
33
Curse of Dimensionality (K-Means)
n Attributes, m Objects: if n is large enough, compared

to m, clustering cannot be performed :
A: 1, 1, 0, 0, 0, 0
B: 0, 0, 1, 1, 0, 0
C: 1, 1, 1, 1, 1, 1
All objects are equally distant from each other: no

clustering is possible.
Concrete experience: clustering data collected with the

software pepite.
Students answer 72 questions in Math. Pb: cluster them
according to their abilities. Students who have
answered the same questions the same way should be
in same clusters.
A.Merceron
2010
34
Curse of Dimensionality: Pepite
A.Merceron
2010
35
Curse of Dimensionality: Pepite
A.Merceron
2010
36
Curse of Dimensionality: Best Practice

The number of objects must be at least 3 times the number
of attributes.
A.Merceron
2010
37
Cluster Validity
How to evaluate the goodness of the resulting clusters?
Then why do we want to evaluate them?
A.Merceron
To avoid finding patterns in noise (random data)

To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
2010
38
0 .9
0 .9
0 .8
0 .8
0 .7
0 .7
0 .6
0 .6
0 .5
0 .5
Random
Points
Clusters found in Random Data
0 .4
0 .4
0 .3
0 .3
0 .2
0 .2
0 .1
0 .1
0.2
0 .4
0.6
0 .8
DBSCAN
0.2
0 .4
x
1
0 .9
0 .9
0 .8
0 .8
0 .7
0 .7
0 .6
0 .6
0 .5
0 .5
K-means
0 .4
0 .4
0 .3
0 .3
0 .2
0 .2
0 .1
0 .1
0.2
0 .4
0 .8
0 .6
0 .8
Complete
Link
0.2
A.Merceron
0 .6
0 .4
0 .6
0 .8
2010
39
Best Practice: Random Data
Explore the data first: do all objects seem

uniformly distributed through all possible values
for each attribute?
A.Merceron
Histogram
Spread
2010
40
Internal Measures: SSE - Cohesion
Clusters in more complicated figures arent well separated

Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters

(average SSE).
Can also be used to estimate the number of clusters
10
9
7
2
SSE
5
4
-2
3
-4
-6
1
5
10
15
10
15
20
25
A.Merceron
2010
41
30
Clustering: Large scale application
Size for the ready made clothes.

Measure waist, shoulders, arms and so on
A cluster gives a size.
A.Merceron
2010
42

Type of data
Data Mining
A.Merceron
2010
43
Illustrating Classification Task

Tid
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Attrib2
Attrib3
Apply
Model
Class
Deduction
10
Test Set
A.Merceron
2010
44
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function

of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
A.Merceron
2010
45
Examples of Classification Task

Predicting students failure
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
Categorizing mails as spam.
A.Merceron
2010
46
Many Classification Techniques

Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
K-Nearest Neighbours
A.Merceron
2010
47
Example of a Decision Tree

al
al
us
c
c
i
i
o
u
or
or
g
g
in
ss
t
e
e
t
t
a
n
cl
ca
ca
co
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
A.Merceron
Model: Decision Tree

2010
48
Decision Tree Classification Task

Tid
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
A.Merceron
2010
49
Apply Model to Test Data

Test Data
Start from the root of tree.
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
A.Merceron
Taxable
Income Cheat
No
80K
Married
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
2010
50
Decision Tree Induction

Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
A.Merceron
2010
51
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How
to specify the attribute test condition?
How
to determine the best split?
Determine when to stop splitting
A.Merceron
2010
52
General Structure of Hunts Algorithm

Let Dt be the set of training records
that reach a node t
General Procedure:
If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
If Dt contains records that

belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the
procedure to each subset.
A.Merceron
Dt
2010
53
How to Specify Test Condition?

Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split

2-way split or binary split
Multi-way split
A.Merceron
2010
54
Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct
values.
CarType
Family
Luxury
Sports
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Sports,
Luxury}
A.Merceron
CarType
{Family}
{Family,
Luxury}
OR
2010
CarType
{Sports}
55
Splitting Based on Continuous Attributes

Different ways of handling
Discretization to form an ordinal categorical attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval

bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v)
A.Merceron
consider all possible splits and finds the best cut
can be more compute intensive
2010
56
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
A.Merceron
[25K,50K)
[50K,80K)
(ii) Multi-way split
2010
57
Tree Induction
Greedy strategy.
Issues
How
How
A.Merceron
2010
58
How to determine the Best Split

Greedy approach:
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:
C0: 9
C1: 1
C0: 5
C1: 5
Non-homogeneous,
Quite homogeneous,
High degree of impurity
Low degree of impurity
A.Merceron
2010
59
Measures of Node Impurity

Gini Index
Entropy
Misclassification error
A.Merceron
2010
60
Measure of Impurity: GINI

Gini Index for a given node t :
GINI (t ) = 1
[ p ( j | t )]
( p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2
0
6
Gini=0.000
A.Merceron
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
2010
C1
C2
3
3
Gini=0.500
61
Examples for computing GINI
GINI (t ) = 1
[ p( j | t )]
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
Gini = 1 P(C1)2 P(C2)2 = 1 0 1 =

0
P(C2) = 5/6
Gini = 1 (1/6)2 (5/6)2 = 0.278

P(C2) = 4/6
Gini = 1 (2/6)2 (4/6)2 = 0.444

2010
62
Splitting Based on GINI

Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,
GINI split =
i= 1
ni
GINI (i )
n
where, ni = number of records at child i,

n = number of records at node p.
A.Merceron
2010
63

GINI (t ) = 1
al
al
us
c
c
i
i
o
u
or
or
g
g
in
t
e
e
ss
t
t
n
a
ca
ca
cl
co
[ p ( j | t )]2
ni
GINI (i )
n
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
Example: Refund
GINIRyes=1-((0/3)2 + (3/3)2) = 0
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
GINIRno=1-(3/7)2 + (4/7)2)= 0.49
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
GINI split =
i= 1
GINIR=(3/10)*0 + (7/10)*0.49 =
= 0.343
60K
10
A.Merceron
2010
64

GINI (t ) = 1
[ p ( j | t )]
al
al
c
c
i
i
or
or
g
g
te
te
ca
ca
GINI split =
i= 1
ni
GINI (i )
n
Example: Marital Status

GINIMsin=1-((2/4)2 + (2/4)2) = 0.5
GINIMma=1-(0/4)2 + (1/4)2)= 0
GINIMdiv=1-(1/2)2 + (1/2)2)= 0.5
GINIM=(4/10)*0.5 + (4/10)*0 +
(2/10)*0.5 = 0.45
i
nt
o
c
ss
a
cl
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
A.Merceron
us
o
nu
2010
65
Alternative Splitting Criteria based on INFO

Entropy at a given node t:
Entropy (t ) = p ( j | t ) log p ( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
Measures homogeneity of a node.

Maximum
(log nc) when records are equally distributed among all

classes implying least information
Minimum
(0.0) when all records belong to one class, implying

most information
Entropy based computations are similar to the GINI index

computations
A.Merceron
2010
66
Examples for computing Entropy
Entropy (t ) = p ( j | t ) log p ( j | t )
2
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
Entropy = 0 log 0 1 log 1 = 0 0 = 0
P(C2) = 5/6
Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65

P(C2) = 4/6
Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

2010
67
Splitting Criteria based on Classification Error

Classification error at a node t :
Error (t ) = 1 max P (i | t )
i
Measures misclassification error made by a node.

Maximum
(1 - 1/nc) when records are equally distributed

among all classes, implying least interesting information
Minimum
(0.0) when all records belong to one class, implying

most interesting information
A.Merceron
2010
68
Examples for Computing Error
Error (t ) = 1 max P (i | t )
i
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
Error = 1 max (0, 1) = 1 1 = 0
P(C2) = 5/6
Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6

P(C2) = 4/6
Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3

2010
69
Misclassification Error vs Gini

Parent
A?
Yes
No
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
A.Merceron
Node N2
C1
C2
N1
3
0
N2
4
3
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini=0.361
Gini improves !! but not

missclassification error
2010
70
Tree Induction
Greedy strategy.
Issues
How
How
A.Merceron
2010
71
Stopping Criteria for Tree Induction

Stop expanding a node when all the records
belong to the same class
Stop expanding a node when all the records have
similar attribute values
Early termination (like depth)
A.Merceron
2010
72
Decision Tree Based Classification

Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
A.Merceron
2010
73
Practical Issues of Classification

Underfitting and Overfitting
Performance and Costs of Classification
A.Merceron
2010
74
Underfitting and Overfitting

Underfitting: tree is not complete enough and does
not predict well.
Overfitting: tree predicts well only the training data:
many long branches with few objects in each leaf,
that reflect particular cases in the training set.
A.Merceron
2010
75
Overfitting: Best Practice

Overfitting results in decision trees that are more
complex than necessary
Make sure the tree makes sense: each branch
says something meaningful and sensible.
Branches should not be too long!
Build two trees, one with no-pruning, one with
pruning and compare.
A.Merceron
2010
76
Estimating Generalization Errors

Re-substitution errors: error on training ( e(t) )
Generalization errors: error on testing ( e(t))
Methods for estimating generalization errors:
Optimistic approach: e(t) = e(t)
Pessimistic approach:
For each leaf node: e(t) = (e(t)+0.5)

Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
Reduced error pruning (REP):
A.Merceron
uses validation data set to estimate generalization

error
2010
77
Occams Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Therefore, one should include model complexity
when evaluating a model
A.Merceron
2010
78
How to Address Overfitting

Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
More restrictive conditions:

Stop if number of instances is less than some user-specified
threshold See RapidMiner
Stop if class distribution of instances are independent of the

available features (e.g., using 2 test)
A.Merceron
Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).
2010
79
How to Address Overfitting

Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up
fashion
If generalization error improves after trimming, replace
sub-tree by a leaf node.
Class label of leaf node is determined from majority
class of instances in the sub-tree
Can use MDL (Minimum Description Length) for postpruning
A.Merceron
2010
80
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Class = No
10
Training Error (After splitting) = 9/30

Pessimistic error (After splitting)
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1
A4
A3
A2
Class = Yes
Class = Yes
Class = Yes
Class = Yes
Class = No
Class = No
Class = No
Class = No
A.Merceron
2010
81
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation

How to obtain reliable estimates?
A.Merceron
2010
82

Focus on the predictive capability of a model
Rather than how fast it takes to classify or build
models, scalability, etc.
Confusion Matrix:
a: TP (true
positive)
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
A.Merceron
Class=No
b: FN (false
negative)
Class=Yes
c: FP (false
positive)
Class=No
d: TN (true
negative)
2010
83

PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Most widely-used metric:
a+ d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
A.Merceron
2010
84
Limitation of Accuracy
False positive and false negative may not have
the same weight: predict wrongly a student's
failure (and having her abandon a degree) is
worse than predict wrongly a student's success
(and encourage her to continue her degree
though she may fail).
A.Merceron
2010
85
Accuracy: Best Practice

Do false positives and false negatives have the
same significance for your data? And handle
accordingly: minimize one category, or establish
a cost matrix.
A.Merceron
2010
86
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
ACTUAL
CLASS Class=No
Class=Yes Class=No
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
C(i|j): Cost of misclassifying class j example as class i
A.Merceron
2010
87
Computing Cost of Classification

Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model M1
ACTUAL
CLASS
C(i|j)
-1
100
PREDICTED CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
A.Merceron
Model M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
2010
88
Model Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation

How to obtain reliable estimates?
Training
A.Merceron
and test
2010
89

Type of data
Data Mining
A.Merceron
2010
90
Amazon.com Example
A.Merceron
2010
91
Association Rule Mining

Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs

Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
A.Merceron
Example of Association
Rules
{Diaper} {Beer}
{Milk, Bread} {Eggs,Coke}
{Beer, Bread} {Milk}
Implication means cooccurrence, not causality!

2010
92
Definition: Frequent Itemset

Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
K-itemset: an itemset that contains k items
Support count ()
Frequency of occurrence of an itemset, e.g.
({Milk, Bread,Diaper}) = 2
Support (s)
TID
Items
Bread, Milk
2
3
4
5

Fraction of transactions that contain an

itemset, e.g. s({Milk, Bread, Diaper}) =
2/5
Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
A.Merceron
2010
93
Definition: Association Rule

Association Rule
An implication expression of the form 1

X Y, where X and Y are disjoint
2
itemsets
3
Example:
4
{Milk, Diaper} {Beer}
5
Rule Evaluation Metrics
TID
Items
Bread, Milk
Example:
Support (s)
{Milk, Diaper} Beer
Fraction of transactions that contain

both X and Y
(Milk, Diaper, Beer) 2
s=
= = 0.4
Confidence (c)
|T|
5
Measures how often items in Y

c
=
= = 0.67
appear in transactions that
(Milk, Diaper )
3
contain X
A.Merceron
2010
94
Evaluation metrics and probabilities

What gives its direction to X Y?
Rule Evaluation Metrics
Support (s)
P(X Y)
It is symmetric
Confidence (c)
TID
Items
Bread, Milk
2
3
4
5

Example:
P(Y | X)
{Milk, Diaper} Beer
It is not symmetric and gives its
direction to a rule.
s=
= = 0.4
|T|
c=
A.Merceron

= = 0.67
(Milk, Diaper )
3
2010
95
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
A.Merceron
2010
96
Mining Association Rules

Example of Rules:
TID
Items
Bread, Milk
2
3
4

{Milk,Diaper} {Beer} (s=0.4, c=0.67)

{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus: decouple the support and confidence requirements
A.Merceron
2010
97
Mining Association Rules

Two-step approach:
Frequent Itemset Generation
Generate all itemsets whose support minsup

A priori algorithm
Rule Generation
Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a
frequent itemset
Frequent itemset generation is still

computationally expensive
A.Merceron
2010
98
Illustrating Apriori Principle

Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
If every subset is considered,

6
C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk,Diaper}
A.Merceron
2010
Count
3
99
Apriori Algorithm
Ck: Candidate itemset of size k - Lk : frequent itemset of
size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = apriori-gen( Lk);
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
A.Merceron
2010
100
Apriori-Gen
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1

insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

A.Merceron
2010
101
Exercise 01
Find all frequent itemsets using Apriori-Algorithm.
Mininimum support count 2.
I1,
I2,
I2,
I1,
I1,
I2,
I1,
I1,
I1,
A.Merceron
I2,
I4
I3
I2,
I3
I3
I3
I2,
I2,
I5
I4
I3, I5
I3, I5
2010
102
Effect of Support Distribution

How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
If minsup is set too low, it is computationally
expensive and the number of itemsets is very large
Using a single minimum support threshold may

not be effective
A.Merceron
2010
103
Support: Best practice

With data exploration get an overview of the
items or products.
Use this exploration to select the items you want
to find associations for and to fix support and
confidence.
A.Merceron
2010
104
Rule Generation
Given a frequent itemset L, find all non-empty
subsets f L such that f L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C, ACD B, BCD A,

B ACD, C ABD, D ABC
AC BD, AD BC, BC AD,
CD AB,
If |L| = k, then there are 2k 2 candidate

association rules (ignoring L and L)
A.Merceron
2010
105
Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same

itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
A.Merceron
2010
106
Pattern Evaluation
Association rule algorithms tend to produce too
many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
Interestingness measures can be used to

prune/rank the derived patterns
In the original formulation of association rules,
support & confidence are the only measures used
A.Merceron
2010
107
Support & Confidence are limited

5000 transactions
|X, Y| = 1000, |X| = 1000 and |Y| = 2500
|X, Y| = 1000, |X| = 1000 and |Y| = 5000
Support (XY) = 20%, confidence (XY) = 100%
Y
!Y
1000
1000
!X
1500
2500
4000
2500
2500
5000
A.Merceron
!Y
1000
1000
!X
4000
4000
5000
5000
2010
108
Interesting rules
5 000 transactions
|X|=|X^Y|=1000, |Y| = 2500
sup(XY)= 20 %
conf(XY) = 100 %
X and Y
|X|=|X^Y|=1000, |Y| = 5000
sup(XY)= 20 %
conf(XY) = 100 %
|X|=|X^Y|=4800, |Y| = 5000
sup(XY)= 96 %
conf(XY) = 100 %
109
Application of Interestingness Measure

Knowledge
Interestingness
Measures
Patterns
Postprocessing
Preprocessed
Data
Prod
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
uct
Featur
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
Mining
Selected
Data
Preprocessing
Data
Selection
A.Merceron
2010
110
Interestingness Measures
Lift, Added Value, cosine
A.Merceron
2010
111
Added Value
X and Y are related if the proportion of transactions
containing Y among the transactions containing X is
greater than the proportion of transactions containing
Y among all transactions. AV(XY) and AV(YX)
are linked!
AV X Y =P Y / X P Y =conf X Y P Y
AV Y X =P X /Y P X =conf Y X P X
A.Merceron
2010
112
Added Value and lift

Lift is exactly 1 when added value is 0, greater than 1
when added value is positive and below 1 when added
value is negative.
Rule not interesting if lift around or below 1.
Lift is 1: X and Y are independent in the sense of
probability theory.
P X ,Y
conf X Y X , Y . n
lift X Y =
=
=
P X . P Y
P Y
X .Y
A.Merceron
2010
113
Drawback of Lift with strong symmetric rules

Lift does not have the null-invariant property: it is
sensitive to transactions containing neither item X
nor item Y.
A.Merceron
2010
114
Cosine
A, B two vectors of length n: A=(a1, ..., an), B=(b1, ..., bn)
AB
cosine A , B =
A.B
n
AB = a k b k
k =1
X =
k =1
A.Merceron
xk
2010
115
Cosine XY
X = (x1, ..., xn)
xk is 1 if transaction tk contains X, 0 otherwise.
Example X is {Bread, Milk}, Y is {Diaper} gives
vector X = (1, 0, 0, 1, 1) and vector Y = (0, 1, 1, 1, 1)
TID
Items
Bread, Milk
3
4

A.Merceron
2010
116
Cosine
Rule not interesting if cosine is below 0.66.
null-invariant property: transactions not containing
neither item X nor item Y have no influence.
X and Y are the most related (value 1) when each
transaction contains either both X and Y or neither X
nor Y.
P X ,Y
X , Y
cosine X Y =
=
P X . P Y X .Y
A.Merceron
2010
117
Interesting rules
5 000 transactions
|X|=|X^Y|=1000, |Y| = 2500
cosine(XY)= .63
lift(XY) = 2
X and Y
|X|=|X^Y|=1000, |Y| = 5000
cosine(XY)= .45
lift(XY) = 1
|X|=|X^Y|=4800, |Y| = 5000
cosine(XY)= .98
lift(XY) = 1
118
Interestingness: Best practice

Prune the rules with two distinct measures like lift
and cosine.
If cosine and lift agree, easy.
If they do not agree: look at support and confidence.
Strong rule? Follow cosine.
Is knowing that X occur more important than knowing it
did not occur? If yes, follow cosine, if not follow lift.
Ponder whether the associations make sense.
A.Merceron
2010
119
Tools
Commercial:
IBM: Intelligent Miner.
SPSS: Clementine.
Open source:
Weka
RAPIDMINER
KNIME
A.Merceron
2010
120

Lecture# 4, 5 &amp; 6-2016

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture# 4, 5 &amp; 6-2016

Caricato da

Copyright:

Formati disponibili

Data Warehouse Techniques

Data Warehouse Techniques

What is Cluster Analysis?

Data Warehouse Techniques

Applications of Cluster Analysis

Data Warehouse Techniques

What is not Cluster Analysis?

Data Warehouse Techniques

Data Warehouse Techniques

Data Warehouse Techniques

Data Warehouse Techniques

Data Warehouse Techniques

Original Points with initial centres

Data Warehouse Techniques

Original Points with clusters iteration 1

Data Warehouse Techniques

Original Points with new centres

Data Warehouse Techniques

Original Points with clusters and new centres iteration 2

Data Warehouse Techniques

Original Points with clusters and new centres iteration 3

Data Warehouse Techniques

Final clusters and centres

Data Warehouse Techniques

K-means Clustering Details

Data Warehouse Techniques

Two different K-means Clusterings

Data Warehouse Techniques

Sum of Squared Error (SSE) diminishes after each

Data Warehouse Techniques

Data Warehouse Techniques

How to determine the best K?

K-means is sensible to outliers.

Data Warehouse Techniques

Limitations of K-means: Non-globular Shapes

Data Warehouse Techniques

Overcoming K-means Limitations

Data Warehouse Techniques

Data Warehouse Techniques

Agglomerative Clustering Algorithm

Data Warehouse Techniques

Distance between Clusters

Data Warehouse Techniques

Distance between clusters: example

Consider 5 Objects a(1,1), b(1,2), c(5,4), d(7,5),

Data Warehouse Techniques

Distance between clusters: example

Data Warehouse Techniques

Hierarchical Clustering: MIN

Data Warehouse Techniques

Hierarchical Clustering: MAX

Data Warehouse Techniques

Hierarchical Clustering: Group Average

Data Warehouse Techniques

Hierarchical Clustering: Group Average

Data Warehouse Techniques

Strengths of Hierarchical Clustering

They may correspond to meaningful taxonomies

Data Warehouse Techniques

Hierarchical Clustering: Time and Space requirements

O(N2) for updating the proximity matrix.

O(N3) time in many cases

Data Warehouse Techniques

Lecture# 4, 5 & 6-2016

Lecture# 4, 5 & 6-2016