Sei sulla pagina 1di 19

Alternative Clusterings: Current

Progress and Open Challenges

James Bailey

Department of Computer Science and Software Engineering

The University of Melbourne, Australia

Cluster analysis: group similar
objects into clusters

No single solution

Cluster by pose or individual ?

=> Equally important, different views or
hypotheses regarding the data
Multiple explanations of the data
user doesnt initially know what they want, needs
different viewpoints of users
may be aiming to verify that multiple explanations do
not exist (hypothesis verification, or for benchmarking
clustering algorithms)
Contrast with consensus clustering
Every clustering should be accompanied by at least
one alternative clustering !?
Alternative Clustering: Is it
new ?
From one perspective, alternative clustering is not so new
Generation of clusterings often goes like
Generate and assess a clustering with 2 clusters
Generate and assess a clustering with 3 clusters

Generate and assess a clustering with k clusters
We now have k-1 alternative clusterings .
But some of them may be very similar
Alternative Clustering Algorithms

Growing number of approaches

ADFT, CAMI, COALA, Condens, Convolutional EM,
Decorrelated k-means, MAXIMUS, Meta clustering,
Multiview orthogonal clustering, NACI, Non redundant

Papers have appeared at

KDD10, ICML10, SDM10, KDD09,
How do these approaches
differ ?

Task formulation:
Number of alternatives to generate
Sequential or Simultaneous Generation
Mathematical basis
Linear algebra
Information theory
Other objective functions
Sequential Alternative
Clustering Generation
Task: Given input clusterings {C1,..Cn}, generate an
alternative clustering C, such that C is of high quality
and C is different from {C1Cn}
Important special case: n=1

Existing Alternative

C1 generate
C2 ------> C

Simultaneous Alternative
Clustering Generation
Task: Simultaneously generate n clusterings
{C1,Cn}, such that each Ci is of high quality
and each pair (Ci,Cj) is different from one
Important special case: n=2

----------> C2

Sequential vs. Simultaneous
Sequential (greedy)
For i=2 to n
{generate the optimal alternative clustering with
respect to the previous i clusterings}
Locally optimal at each step
Simultaneous (non-greedy)
In parallel, generate optimal set of n clusterings
Globally optimal clustering collection
but might miss some strong clusterings which would
be generated by a sequential technique
More difficult optimisation problem
Style of Algorithm
Projection based
Project the data into an orthogonal subspace and then
Appealing linear algebra formulation
Relatively efficient
Orthogonality may be too strict
More complex objective function
Generate the alternative clustering, trading off
dissimilarity and quality in the objective function
More flexible
May require parameter choices
Simple Example

Most existing techniques seem to work well (a

canonical example)
Circle of Gaussians

-Techniques which trade off dissimilarity and quality

more likely to produce the second clustering
-Orthogonal projection doesnt work so well here
Other issues
Evaluation: Measuring quality/dissimilarity of alternatives
Clustering setting:
Desired shape of clusters: spherical versus elongated, linear
versus non linear separation
low versus high dimensionality data
continuous versus discrete features
soft versus hard clusters
EM versus K-means versus hierarchical versus constraint
Number of clusters desired in each clustering
Alternative Clustering
Measuring dissimilarity: Mathematical measures - Rand
index, Jaccard index, normalised mutual information
Measuring quality:
Internal validation measures: Dunn index, David
Bouldin index, silhouette width
External validation: Synthetic examples
Combine dissimilarity and quality into a single number, or
present separately ?
Are these numbers useful ?
Where are we ?
Good existing algorithms for generation of one or two
Sequential generation
Simultaneous generation
Not yet deployed on very large datasets
Validated using assorted benchmark datasets and
internal metrics
Open Issues
Whats the killer application ?
Deployment of alternative clusterings
Need convincing use cases where consensus clustering is
Objective function and performance measures
How many alternatives is enough ?
How many clusters should be in an alternative
clustering ?
the same number as the original clustering ?
Open Issues cont.
How to find alternative subspace clusters (rather than
clusterings) ?
Visualisation of alternative clusterings
More focused alternatives
``Give me another clustering which is similar in
these respects and different in these other
respects to the previous clustering
Moving Forward
Central repository of code and canonical examples
(synthetic and real)
Make alternative clusterings algorithms accessible
Identify cases in the literature of missing alternative
E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its
Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge
D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of
ICML 10, 2010.
X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non
Linear Alternative Clusterings. Proc. of KDD 2010.
X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of
SDM 2010.
Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc.
of KDD 2009.
P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings.
Proc. of SDM 2008.
I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM 2008.
Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc.
of ICDM 2007.
E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high
quality and high dissimilarity. Proc. of ICDM 2006.
R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, 2006.
D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD
Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.