Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Outline
Introduction
K-means clustering
Hierarchical clustering: COBWEB
Clustering
Unsupervised learning:
Finds natural grouping of
instances given un-labeled data
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
Clusters:
exclusive vs. overlapping
Simple 2-D representation
Venn diagram
Non-overlapping
Overlapping
d
j
k
g
c
h
i
j
k
g
c
h
i
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across
clusters
k1
Y
Pick 3
initial
cluster
centers
(randomly)
k2
k3
X
k1
Y
Assign
each point
to the closest
cluster
center
k2
k3
X
k1
k1
Y
Move
each cluster
center
to the mean
of each cluster
k2
k3
k2
k3
X
Reassign
points
Y
closest to a
different new
cluster center
Q: Which
points are
reassigned?
k1
k3
k2
k1
Y
A: three
points with
animation
k3
k2
k1
Y
re-compute
cluster
means
k3
k2
move cluster
centers to
cluster means
k2
k3
Discussion, 1
What can be the problems with
K-means clustering?
Discussion, 2
Example:
initial cluster
centers
instances
Discussion, 3
A: To increase chance of finding
global optimum: restart with
different random seeds.
K-means clustering
summary
Advantages
Disadvantages
Simple,
understandable
items automatically
assigned to clusters
K-means variations
K-medoids instead of mean, use medians
of each cluster
Mean of 1, 3, 5, 7, 9 is5
Mean of 1, 3, 5, 7, 1009 is205
Median of 1, 3, 5, 7, 1009 is5
Median advantage: not affected by extreme
values
*Hierarchical clustering
Bottom up
Top down
g a c i e d k b j f h
*Incremental clustering
Start:
Then:
Outlook
Temp.
Humidity
Windy
Sunny
Hot
High
False
Sunny
Hot
High
True
Overcast
Hot
High
False
Rainy
Mild
High
False
Rainy
Cool
Normal
False
Rainy
Cool
Normal
True
Overcast
Cool
Normal
True
Sunny
Mild
High
False
Sunny
Cool
Normal
False
Rainy
Mild
Normal
False
Sunny
Mild
Normal
True
Overcast
Mild
High
True
Overcast
Hot
Normal
False
Rainy
Mild
High
True
Outlook
Temp.
Humidity
Windy
Sunny
Hot
High
False
Sunny
Hot
High
True
Overcast
Hot
High
False
Rainy
Mild
High
False
Rainy
Cool
Normal
False
Rainy
Cool
Normal
True
Overcast
Cool
Normal
True
Sunny
Mild
High
False
Sunny
Cool
Normal
False
Rainy
Mild
Normal
False
Sunny
Mild
Normal
True
Overcast
Mild
High
True
Overcast
Hot
Normal
False
Rainy
Mild
High
True
5
Merge best
host and
runner-up
*Final hierarchy
ID
Outlook
Temp.
Humidity
Windy
Sunny
Hot
High
False
Sunny
Hot
High
True
Overcast
Hot
High
False
Rainy
Mild
High
False
(subset)
*Category utility
Pr[Cl ]
maximum
*Overfitting-avoidance
heuristic
If every instance gets put into a different category
the numerator becomes (maximal):
n Pr[a v ]2
i ij
i j
Maximum
Maximumvalue
valueof
ofCU
CU
Discussion
pre-processing step
Examples of Clustering
Applications
Marketing: discover customer groups and use
them for targeted marketing and re-organization
Astronomy: find groups of similar stars and
galaxies
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults
Genomics: finding groups of gene with similar
expressions
Clustering Summary
unsupervised
many approaches
K-means simple, sometimes useful
K-medoids is less sensitive to outliers
Evaluation is a problem