Soft Vs Hard Clustering

Informal goal
Clustering:
Overview and •  Given set of objects and measure of
similarity between them, group similar
K-means algorithm objects together
•  What mean by “similar”?

•  What is good grouping?
K-Means illustrations thanks to •  Computation time / quality tradeoff
2006 student Martin Makowiecki 1 2
General types of clustering Applications:

•  “Soft” versus “hard” clustering Many
–  Hard: partition the objects –  biology
•  each object in exactly one partition –  astronomy
–  Soft: assign degree to which object in –  computer aided design of circuits
cluster –  information organization
•  view as probability or score –  marketing
–  …
•  flat versus hierarchical clustering
–  hierarchical = clusters within clusters
3 4
Clustering in information Example applications in search

search and analysis
•  Query evaluation: cluster pruning (§7.1.6)
•  Group information objects - cluster all documents
⇒  discover topics - choose representative for each cluster
?  other groupings desirable - evaluate query w.r.t. cluster reps.
•  Clustering versus classifying - evaluate query for docs in cluster(s) having
- classifying: have pre-determined classes most similar cluster rep.(s)
with example members •  Results presentation: labeled clusters
- clustering: - cluster only query results
- get groups of similar objects - e.g. Yippy.com (metasearch)
- added problem of labeling clusters by topic
- e.g. common terms within cluster of docs. 5 hard / soft? flat / hier? 6
1
Issues Issues continued
•  What attributes represent items for clustering
purposes?
•  Cluster goals?
•  What is measure of similarity between items? –  Number of clusters?
•  General objects and matrix of pairwise similarities –  flat or hierarchical clustering?
•  Objects with specific properties that allow other –  cohesiveness of clusters?
specifications of measure •  How evaluate cluster results?
– Most common:
–  relates to measure of closeness between clusters
Objects are d-dimensional vectors
» Euclidean distance •  Efficiency of clustering algorithms
» cosine similarity –  large data sets => external storage
•  Maintain clusters in dynamic setting?
•  What is measure of similarity between clusters?
7
•  Clustering methods? - MANY! 8
Quality of clustering General types of clustering

methods
•  In applications, quality of clustering depends on
how well solves problem at hand
•  constructive versus iterative improvement
•  Algorithm uses measure of quality that can be
–  constructive: decide in what cluster each
optimized, but that may or may not do a good
object belongs and don’t change
job of capturing application needs.
•  often faster
•  Underlying graph-theoretic problems usually –  iterative improvement: start with a clustering
NP-complete and move objects around to see if can
–  e.g. graph partitioning improve clustering
•  often slower but better
•  Usually algorithm not finding optimal clustering
9 10
Vector model:
K-means overview
K- means algorithm
•  Choose k points among set to cluster
•  Well known, well used –  Call them k centroids
•  Flat clustering •  For each point not selected, assign it to its

closest centroid
•  Number of clusters picked ahead of time –  All assignment give initial clustering
•  Iterative improvement •  Until “happy” do:
•  Uses notion of centroid –  Recompute centroids of clusters
•  New centroids may not be points of original set
•  Typically uses Euclidean distance –  Reassign all points to closest centroid
•  Updates clusters
11 12
2
An Example An Example
start: choose centroids and cluster recompute centroids
13 14
An Example An Example
re-cluster around new centroids 2nd recompute centroids and re-cluster
15 16
An Example Details for K-means

3rd (final) recompute and re-cluster •  Need definition of centroid
ci = 1/|Ci| ∑x for ith cluster Ci containing objects x
x∈Ci
notion of sum of objects ?
•  Need definition of distance to (similarity to)
centroid
•  Typically vector model with Euclidean distance
•  minimizing sum of squared distances of each
point to its centroid = Residual Sum of Squares
K
RSS = ∑ ∑ dist(ci,x)2
i=1 x∈Ci
17 18
3
K-means performance Time Complexity of K-means
•  Can prove RSS decreases with each •  Let tdist be the time to calculate the distance
iteration, so converge between two objects
•  Can achieve local optimum •  Each iteration time complexity:
–  No change in centroids O(K*n*tdist)
n = number of objects
•  Running time depends on how •  Bound number of iterations I giving
demanding stopping criteria O(I*K*n*tdist)
•  Works well in practice •  for m-dimensional vectors:
–  speed O(I*K*n*m)
m large and centroids not sparse
–  quality
19 20
Space Complexity of K-means Choosing Initial Centroids

•  Store points and centroids •  Bad initialization leads to poor results
–  vector model: O((n + K)m)
•  External algorithm versus internal?

–  store k centroids in memory
–  run through points each iteration
Optimal Not Optimal

21 22
Choosing Initial Centroids

K-means weakness
Many people spent much time examining
how to choose seeds
•  Random Non-globular clusters
•  Fast and easy, but often poor results
•  Run random multiple times, take best
–  Slower, and still no guarantee of results
•  Pre-conditioning
–  remove outliers
•  Choose seeds algorithmically
–  run hierarchical clustering on sample points and
use resulting centroids
–  Works well on small samples and for few initial
centroids
23 24
4
K-means weakness K-means weakness
Wrong number of clusters Outliers and empty clusters
25 26
Real cases tend to be harder

•  Different attributes of the feature vector
have vastly different sizes
–  size of star versus color
•  Can weight different features
–  how weight greatly affects outcome
•  Difficulties can be overcome
27

Soft Vs Hard Clustering

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Soft Vs Hard Clustering

Caricato da

Copyright:

Formati disponibili

Informal goal

•  What mean by “similar”?

General types of clustering Applications:

Clustering in information Example applications in search

Quality of clustering General types of clustering

•  Flat clustering •  For each point not selected, assign it to its

An Example Details for K-means

Space Complexity of K-means Choosing Initial Centroids

•  External algorithm versus internal?

Optimal Not Optimal

Choosing Initial Centroids

Real cases tend to be harder

•  Difficulties can be overcome

Potrebbero piacerti anche

Soft Vs Hard Clustering

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Soft Vs Hard Clustering

Caricato da

Copyright:

Formati disponibili

Informal goal

• What mean by “similar”?

General types of clustering Applications:

Clustering in information Example applications in search

Quality of clustering General types of clustering

• Flat clustering • For each point not selected, assign it to its

An Example Details for K-means

Space Complexity of K-means Choosing Initial Centroids

• External algorithm versus internal?

Optimal Not Optimal

Choosing Initial Centroids

Real cases tend to be harder

• Difficulties can be overcome

Potrebbero piacerti anche

•  What mean by “similar”?

•  Flat clustering •  For each point not selected, assign it to its

•  External algorithm versus internal?

•  Difficulties can be overcome