Sei sulla pagina 1di 19

Alternative Clusterings: Current

Progress and Open Challenges

James Bailey

Department of Computer Science and Software Engineering


The University of Melbourne, Australia

1
Introduction
Cluster analysis: group similar
objects into clusters

No single solution

Cluster by pose or individual ?


=> Equally important, different views or
hypotheses regarding the data
Motivations
Multiple explanations of the data
user doesnt initially know what they want, needs
options
different viewpoints of users
may be aiming to verify that multiple explanations do
not exist (hypothesis verification, or for benchmarking
clustering algorithms)
Contrast with consensus clustering
Every clustering should be accompanied by at least
one alternative clustering !?
Alternative Clustering: Is it
new ?
From one perspective, alternative clustering is not so new
Generation of clusterings often goes like
Generate and assess a clustering with 2 clusters
Generate and assess a clustering with 3 clusters

Generate and assess a clustering with k clusters
We now have k-1 alternative clusterings .
But some of them may be very similar
Alternative Clustering Algorithms

Growing number of approaches


ADFT, CAMI, COALA, Condens, Convolutional EM,
Decorrelated k-means, MAXIMUS, Meta clustering,
Multiview orthogonal clustering, NACI, Non redundant
clustering,.

Papers have appeared at


KDD10, ICML10, SDM10, KDD09,
SDM09,ICDM08,ICDM07,ICDM06,KDD05, ICDM04,
,DMKD, KAIS,
How do these approaches
differ ?

Task formulation:
Number of alternatives to generate
Sequential or Simultaneous Generation
Mathematical basis
Linear algebra
Information theory
Other objective functions
Sequential Alternative
Clustering Generation
Task: Given input clusterings {C1,..Cn}, generate an
alternative clustering C, such that C is of high quality
and C is different from {C1Cn}
Important special case: n=1

Existing Alternative

C1 generate
C2 ------> C

Cn
Simultaneous Alternative
Clustering Generation
Task: Simultaneously generate n clusterings
{C1,Cn}, such that each Ci is of high quality
and each pair (Ci,Cj) is different from one
another
Important special case: n=2
Alternatives

generate
C1
----------> C2

Cn
Sequential vs. Simultaneous
Sequential (greedy)
Semi-supervised
For i=2 to n
{generate the optimal alternative clustering with
respect to the previous i clusterings}
Locally optimal at each step
Simultaneous (non-greedy)
Unsupervised
In parallel, generate optimal set of n clusterings
Globally optimal clustering collection
but might miss some strong clusterings which would
be generated by a sequential technique
More difficult optimisation problem
Style of Algorithm
Projection based
Project the data into an orthogonal subspace and then
re-cluster
Appealing linear algebra formulation
Relatively efficient
Orthogonality may be too strict
More complex objective function
Generate the alternative clustering, trading off
dissimilarity and quality in the objective function
More flexible
May require parameter choices
Simple Example

Most existing techniques seem to work well (a


canonical example)
Circle of Gaussians

-Techniques which trade off dissimilarity and quality


more likely to produce the second clustering
-Orthogonal projection doesnt work so well here
Other issues
Evaluation: Measuring quality/dissimilarity of alternatives
Clustering setting:
Desired shape of clusters: spherical versus elongated, linear
versus non linear separation
low versus high dimensionality data
continuous versus discrete features
soft versus hard clusters
EM versus K-means versus hierarchical versus constraint
based
Number of clusters desired in each clustering
Alternative Clustering
Evaluation
Measuring dissimilarity: Mathematical measures - Rand
index, Jaccard index, normalised mutual information
Measuring quality:
Internal validation measures: Dunn index, David
Bouldin index, silhouette width
External validation: Synthetic examples
Combine dissimilarity and quality into a single number, or
present separately ?
Are these numbers useful ?
Where are we ?
Good existing algorithms for generation of one or two
alternatives
Sequential generation
Simultaneous generation
Not yet deployed on very large datasets
Validated using assorted benchmark datasets and
internal metrics
Open Issues
Whats the killer application ?
Deployment of alternative clusterings
Need convincing use cases where consensus clustering is
limited
Objective function and performance measures
How many alternatives is enough ?
How many clusters should be in an alternative
clustering ?
the same number as the original clustering ?
Open Issues cont.
How to find alternative subspace clusters (rather than
clusterings) ?
Visualisation of alternative clusterings
More focused alternatives
``Give me another clustering which is similar in
these respects and different in these other
respects to the previous clustering
Moving Forward
Central repository of code and canonical examples
(synthetic and real)
Make alternative clusterings algorithms accessible
Identify cases in the literature of missing alternative
clusterings
Bibliography
E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its
Application to the Discovery of Alternate Clusterings. To appear in Data Mining and Knowledge
Discovery.
D. Niu, J. G. Dy, and M. I. Jordan, Multiple non-redundant spectral clustering views, in Proc. of
ICML 10, 2010.
X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non
Linear Alternative Clusterings. Proc. of KDD 2010.
X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. Proc. of
SDM 2010.
Z. Qi and I. Davidson, A principled and flexible framework for finding alternative clusterings, Proc.
of KDD 2009.
P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings.
Proc. of SDM 2008.
I. Davidson and Z. Qi. Finding alternative clusterings using constraints. Proc. of ICDM 2008.
Y. Cui, X. Z. Fern, and J. G. Dy, Non-redundant multi-view clustering via orthogonalization. Proc.
of ICDM 2007.
E. Bae and J. Bailey. COALA: A novel approach for the extraction of an alternate clustering of high
quality and high dissimilarity. Proc. of ICDM 2006.
R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In ICDM Conference, 2006.
D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. Proc. of KDD
2005.
Gondek, D., Hofmann, T. Non-redundant data clustering. Proc. of ICDM 2004.