Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Type of data
Data preprocessing or data scrubbing
Data exploration, data cube and OLAP
Data Mining
Similarity between data objects
Data clustering and clustering evaluation
Data classification and classification evaluation
Interesting association rules
A.Merceron
2010
Intra-cluster
distances are
minimized
A.Merceron
2010
Clustering precipitation
in Australia
A.Merceron
2010
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups
alphabetically, by last name
A.Merceron
2010
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly
one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical
tree
A.Merceron
2010
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
A.Merceron
2010
K-means Clustering
A.Merceron
2010
Partitional Clustering
Original Points
A.Merceron
2010
Partitional Clustering
2010
Partitional Clustering
2010
10
Partitional Clustering
2010
11
Partitional Clustering
2010
12
Partitional Clustering
2010
13
Partitional Clustering
A Partitional
Clustering
2010
14
A.Merceron
2010
15
2 .5
Original Points
1 .5
0 .5
0
-2
- 1 .5
-1
- 0 .5
0 .5
1 .5
-2
- 1 .5
2 .5
2 .5
1 .5
1 .5
0 .5
0 .5
-2
- 1 .5
-1
- 0 .5
0 .5
1 .5
Optimal Clustering
A.Merceron
-1
- 0 .5
0 .5
1 .5
Sub-optimal Clustering
2010
16
Property of K-means
SSE =
i = 1 x Ci
A.Merceron
dist 2 ( mi , x )
2010
17
Advantages of K-means
Is efficient.
Can be computed in a distributive way.
Is easy to apply.
A.Merceron
2010
18
Limitations of K-means
A.Merceron
2010
19
Original Points
A.Merceron
K-means (2 Clusters)
2010
20
Original Points
A.Merceron
K-means Clusters
2010
21
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
5
6
0 .2
4
3
0 .1 5
5
2
0 .1
0 .0 5
3
0
A.Merceron
2010
22
2010
23
2010
24
a
b
c
d
e
A.Merceron
a
0
1
5
7,2
8,5
0
4,5
6,7
7,8
0
2,2
3,6
0
2
2010
25
a
b
c
d
e
A.Merceron
a
0
1
5
7,2
8,5
0
4,5
6,7
7,8
0
2,2
3,6
0
2
2010
26
3
5
5
0 .2
0 .1 5
0 .1
0 .0 5
Nested Clusters
A.Merceron
Dendrogram
2010
27
0 .3 5
2
5
0 .3
0 .2 5
0 .2
6
1
0 .1 5
0 .1
0 .0 5
Nested Clusters
A.Merceron
Dendrogram
2010
28
0 .2 5
1
0 .2
2
5
0 .1 5
0 .1
6
1
0 .0 5
0
Nested Clusters
A.Merceron
Dendrogram
2010
29
Limitations
Biased towards globular clusters
A.Merceron
2010
30
A.Merceron
2010
31
A.Merceron
2010
32
2010
33
A.Merceron
2010
34
A.Merceron
2010
35
A.Merceron
2010
36
A.Merceron
2010
37
Cluster Validity
How to evaluate the goodness of the resulting clusters?
Then why do we want to evaluate them?
A.Merceron
2010
38
0 .9
0 .9
0 .8
0 .8
0 .7
0 .7
0 .6
0 .6
0 .5
0 .5
Random
Points
0 .4
0 .4
0 .3
0 .3
0 .2
0 .2
0 .1
0 .1
0.2
0 .4
0.6
0 .8
DBSCAN
0.2
0 .4
x
1
0 .9
0 .9
0 .8
0 .8
0 .7
0 .7
0 .6
0 .6
0 .5
0 .5
K-means
0 .4
0 .4
0 .3
0 .3
0 .2
0 .2
0 .1
0 .1
0.2
0 .4
0 .8
0 .6
0 .8
Complete
Link
0.2
A.Merceron
0 .6
0 .4
0 .6
0 .8
2010
39
A.Merceron
Histogram
Spread
2010
40
7
2
SSE
5
4
-2
3
-4
-6
1
5
10
15
10
15
20
25
A.Merceron
2010
41
30
A.Merceron
2010
42
A.Merceron
2010
43
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Attrib2
Attrib3
Apply
Model
Class
Deduction
10
Test Set
A.Merceron
2010
44
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
A.Merceron
2010
45
2010
46
A.Merceron
2010
47
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
A.Merceron
48
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
A.Merceron
2010
49
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
A.Merceron
Taxable
Income Cheat
No
80K
Married
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
2010
50
A.Merceron
2010
51
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How
How
A.Merceron
2010
52
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Dt
2010
53
A.Merceron
2010
54
Luxury
Sports
A.Merceron
CarType
{Family}
{Family,
Luxury}
OR
2010
CarType
{Sports}
55
A.Merceron
2010
56
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
A.Merceron
[25K,50K)
[50K,80K)
2010
57
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How
How
A.Merceron
2010
58
C0: 9
C1: 1
C0: 5
C1: 5
Non-homogeneous,
Quite homogeneous,
A.Merceron
2010
59
A.Merceron
2010
60
GINI (t ) = 1
[ p ( j | t )]
0
6
Gini=0.000
A.Merceron
C1
C2
1
5
Gini=0.278
Data Warehouse Techniques
C1
C2
2
4
Gini=0.444
2010
C1
C2
3
3
Gini=0.500
61
GINI (t ) = 1
[ p( j | t )]
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
2010
62
GINI split =
i= 1
ni
GINI (i )
n
A.Merceron
2010
63
al
al
us
c
c
i
i
o
u
or
or
g
g
in
t
e
e
ss
t
t
n
a
ca
ca
cl
co
[ p ( j | t )]2
ni
GINI (i )
n
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
Example: Refund
GINIRyes=1-((0/3)2 + (3/3)2) = 0
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
GINI split =
i= 1
GINIR=(3/10)*0 + (7/10)*0.49 =
= 0.343
60K
10
A.Merceron
2010
64
[ p ( j | t )]
al
al
c
c
i
i
or
or
g
g
te
te
ca
ca
GINI split =
i= 1
ni
GINI (i )
n
i
nt
o
c
ss
a
cl
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
A.Merceron
us
o
nu
2010
65
Entropy (t ) = p ( j | t ) log p ( j | t )
j
Minimum
A.Merceron
2010
66
Entropy (t ) = p ( j | t ) log p ( j | t )
2
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
P(C2) = 5/6
2010
67
Error (t ) = 1 max P (i | t )
i
Minimum
A.Merceron
2010
68
Error (t ) = 1 max P (i | t )
i
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
A.Merceron
P(C2) = 6/6 = 1
P(C2) = 5/6
2010
69
A?
Yes
No
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
A.Merceron
Node N2
C1
C2
N1
3
0
N2
4
3
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini=0.361
2010
70
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How
How
A.Merceron
2010
71
A.Merceron
2010
72
A.Merceron
2010
73
A.Merceron
2010
74
A.Merceron
2010
75
A.Merceron
2010
76
A.Merceron
2010
77
Occams Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Therefore, one should include model complexity
when evaluating a model
A.Merceron
2010
78
A.Merceron
2010
79
A.Merceron
2010
80
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1
A4
A3
A2
Class = Yes
Class = Yes
Class = Yes
Class = Yes
Class = No
Class = No
Class = No
Class = No
A.Merceron
2010
81
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
A.Merceron
2010
82
Confusion Matrix:
a: TP (true
positive)
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
A.Merceron
Class=No
b: FN (false
negative)
Class=Yes
c: FP (false
positive)
Class=No
d: TN (true
negative)
2010
83
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
a+ d
TP + TN
Accuracy =
=
a + b + c + d TP + TN + FP + FN
A.Merceron
2010
84
Limitation of Accuracy
False positive and false negative may not have
the same weight: predict wrongly a student's
failure (and having her abandon a degree) is
worse than predict wrongly a student's success
(and encourage her to continue her degree
though she may fail).
A.Merceron
2010
85
A.Merceron
2010
86
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
ACTUAL
CLASS Class=No
Class=Yes Class=No
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
A.Merceron
2010
87
PREDICTED CLASS
ACTUAL
CLASS
Model M1
ACTUAL
CLASS
C(i|j)
-1
100
PREDICTED CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
A.Merceron
Model M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
2010
88
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
A.Merceron
and test
2010
89
A.Merceron
2010
90
Amazon.com Example
A.Merceron
2010
91
Items
Bread, Milk
2
3
4
5
A.Merceron
Example of Association
Rules
{Diaper} {Beer}
{Milk, Bread} {Eggs,Coke}
{Beer, Bread} {Milk}
92
TID
Items
Bread, Milk
2
3
4
5
2010
93
TID
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example:
Support (s)
{Milk, Diaper} Beer
A.Merceron
2010
94
Support (s)
P(X Y)
It is symmetric
Confidence (c)
TID
Items
Bread, Milk
2
3
4
5
Example:
P(Y | X)
{Milk, Diaper} Beer
It is not symmetric and gives its
(Milk, Diaper, Beer) 2
direction to a rule.
s=
= = 0.4
|T|
c=
A.Merceron
2010
95
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
A.Merceron
2010
96
TID
Items
Bread, Milk
2
3
4
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus: decouple the support and confidence requirements
A.Merceron
2010
97
Rule Generation
2010
98
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
A.Merceron
2010
Count
3
99
Apriori Algorithm
Ck: Candidate itemset of size k - Lk : frequent itemset of
size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = apriori-gen( Lk);
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
A.Merceron
2010
100
Apriori-Gen
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
2010
101
Exercise 01
Find all frequent itemsets using Apriori-Algorithm.
Mininimum support count 2.
I1,
I2,
I2,
I1,
I1,
I2,
I1,
I1,
I1,
A.Merceron
I2,
I4
I3
I2,
I3
I3
I3
I2,
I2,
I5
I4
I3, I5
I3, I5
2010
102
2010
103
A.Merceron
2010
104
Rule Generation
Given a frequent itemset L, find all non-empty
subsets f L such that f L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
A.Merceron
2010
105
Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
A.Merceron
2010
106
Pattern Evaluation
Association rule algorithms tend to produce too
many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
2010
107
!Y
1000
1000
!X
1500
2500
4000
2500
2500
5000
A.Merceron
!Y
1000
1000
!X
4000
4000
5000
5000
2010
108
Interesting rules
5 000 transactions
sup(XY)= 20 %
conf(XY) = 100 %
X and Y
sup(XY)= 20 %
conf(XY) = 100 %
sup(XY)= 96 %
conf(XY) = 100 %
109
Interestingness
Measures
Patterns
Postprocessing
Preprocessed
Data
Prod
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
uct
Featur
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
Mining
Selected
Data
Preprocessing
Data
Selection
A.Merceron
2010
110
Interestingness Measures
Lift, Added Value, cosine
A.Merceron
2010
111
Added Value
X and Y are related if the proportion of transactions
containing Y among the transactions containing X is
greater than the proportion of transactions containing
Y among all transactions. AV(XY) and AV(YX)
are linked!
AV X Y =P Y / X P Y =conf X Y P Y
AV Y X =P X /Y P X =conf Y X P X
A.Merceron
2010
112
P X ,Y
conf X Y X , Y . n
lift X Y =
=
=
P X . P Y
P Y
X .Y
A.Merceron
2010
113
A.Merceron
2010
114
Cosine
A, B two vectors of length n: A=(a1, ..., an), B=(b1, ..., bn)
AB
cosine A , B =
A.B
n
AB = a k b k
k =1
X =
k =1
A.Merceron
xk
2010
115
Cosine XY
X = (x1, ..., xn)
xk is 1 if transaction tk contains X, 0 otherwise.
Example X is {Bread, Milk}, Y is {Diaper} gives
vector X = (1, 0, 0, 1, 1) and vector Y = (0, 1, 1, 1, 1)
TID
Items
Bread, Milk
3
4
A.Merceron
2010
116
Cosine
Rule not interesting if cosine is below 0.66.
null-invariant property: transactions not containing
neither item X nor item Y have no influence.
X and Y are the most related (value 1) when each
transaction contains either both X and Y or neither X
nor Y.
P X ,Y
X , Y
cosine X Y =
=
P X . P Y X .Y
A.Merceron
2010
117
Interesting rules
5 000 transactions
cosine(XY)= .63
lift(XY) = 2
X and Y
cosine(XY)= .45
lift(XY) = 1
cosine(XY)= .98
lift(XY) = 1
118
A.Merceron
2010
119
Tools
Commercial:
IBM: Intelligent Miner.
SPSS: Clementine.
Open source:
Weka
RAPIDMINER
KNIME
A.Merceron
2010
120