Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Clustering:
Overview and • Given set of objects and measure of
similarity between them, group similar
K-means algorithm objects together
1
Issues Issues continued
• What attributes represent items for clustering
purposes?
• Cluster goals?
• What is measure of similarity between items? – Number of clusters?
• General objects and matrix of pairwise similarities – flat or hierarchical clustering?
• Objects with specific properties that allow other – cohesiveness of clusters?
specifications of measure • How evaluate cluster results?
– Most common:
– relates to measure of closeness between clusters
Objects are d-dimensional vectors
» Euclidean distance • Efficiency of clustering algorithms
» cosine similarity – large data sets => external storage
• Maintain clusters in dynamic setting?
• What is measure of similarity between clusters?
7
• Clustering methods? - MANY! 8
Vector model:
K-means overview
K- means algorithm
• Choose k points among set to cluster
• Well known, well used – Call them k centroids
2
An Example An Example
start: choose centroids and cluster recompute centroids
13 14
An Example An Example
re-cluster around new centroids 2nd recompute centroids and re-cluster
15 16
17 18
3
K-means performance Time Complexity of K-means
• Can prove RSS decreases with each • Let tdist be the time to calculate the distance
iteration, so converge between two objects
• Can achieve local optimum • Each iteration time complexity:
– No change in centroids O(K*n*tdist)
n = number of objects
• Running time depends on how • Bound number of iterations I giving
demanding stopping criteria O(I*K*n*tdist)
• Works well in practice • for m-dimensional vectors:
– speed O(I*K*n*m)
m large and centroids not sparse
– quality
19 20
4
K-means weakness K-means weakness
Wrong number of clusters Outliers and empty clusters
25 26
27