Objective Clustering

Advanced Review
Objective function-based
clustering
Lawrence O. Hall
Clustering is typically applied for data exploration when there are no or very
few labeled data available. The goal is to nd groups or clusters of like data. The
clusters will be of interest to the person applying the algorithm. An objective
function-based clustering algorithm tries to minimize (or maximize) a function
such that the clusters that are obtained when the minimum/maximum is reached
are homogeneous. One needs to choose a good set of features and the appropri-
ate number of clusters to generate a good partition of the data into maximally
homogeneous groups. Objective functions for clustering are introduced. Cluster-
ing algorithms generated from the given objective functions are shown, with a
number of examples of widely used approaches discussed.
C
2012 Wiley Periodicals,
Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2012, 2: 326339 doi: 10.1002/widm.1059
INTRODUCTION
C
onsider the case in which you are given an elec-
tronic repository of news articles. You want to
try to determine the future direction of a commodity
like oil, but do not want to sift through the 50,000 ar-
ticles by hand. You would like to have them grouped
into categories and then you could browse the ap-
propriate ones. You might use the count of words
appearing in the document as features. Having words
such as commodity or oil or wheat appear multiple
times in an article would be good clues as to what it
was concerned with.
Clustering can do the grouping into categories
for you. Objective function-based clustering is one
way of accomplishing the grouping. In this type of
clustering algorithm, there is a function (the objective
function) which you try to minimize or maximize. The
examples or objects to be partitioned into clusters are
described by a set of s features. To begin, we will think
of all features as being continuous numeric values
such that we can measure distances between them.
A challenge of using objective function-based
clustering lies in the fact that it is an optimization
problem.
1,2
As such, it is sensitive to the initialization
that is provided. This means that you can get different
Correspondence to: hall@cse.usf.edu

Department of Computer Science and Engineering, University of
South Florida, Tampa, FL, USA
DOI: 10.1002/widm.1059
nal partitions of the data depending upon where you
start.
In this article, we will cover a selection of objec-
tive function-based clustering algorithms. There are
many such algorithms and we will focus on the more
basic ones and build on them to illustrate more com-
plex approaches. The document clustering example
above can be well solved by some, but not all al-
gorithms described here. Latent Dirichlet allocation
3
is a more complex algorithm which can do the doc-
ument clustering task well, but is not covered here
because it requires that more advanced concepts be
introduced in limited space. The algorithms, here,
produce clusters of data that are, ideally, homoge-
neous or all from the same (unknown) class. We say
the result of clustering the data is a partition into a set
of k clusters. The partition may be hard, all examples
belong to only one cluster; or soft, the examples may
belong to multiple clusters; or probabilistic in which
each example has a probability of belonging to each
cluster.
Some critical issues when clustering data are
how many clusters for which to search; a reasonable
initialization of the algorithm; howto dene distances
between examples, computational complexity, or run
time; and how to tell when a partition of the data
is good. As with any kind of data analysis one must
be concerned with noise, for which robust clustering
approaches exist,
46
which is not a focus here. They
include reformulations of the objective function to
incorporate different distance functions and different
326 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective function-based clustering
integrations with robust statistics. The choice of the
number of clusters to search for can be domain driven,
but when data driven is inter-related to the question
of determining that a partition is good. Cluster or
partition validity metrics are useful for determining
whether a partition into k = 2 or k = 3 or k = 4
clusters, for example, is best.
The most straightforward idea for evaluation of
a partition, as well as creating one, is to look at how
tight the clusters are and how well separated they are.
Oversimplifying, we prefer clusters which have small
standard deviations fromtheir centroid (or mean) and
whose centroids are more distant from each other.
In the proceeding, the k-means algorithm will
rst be introduced in detail as the baseline example of
objective function-based clustering algorithms. For a
broader look at how k-means and related algorithms
t together in a common mathematical framework,
see Ref 7. Then a selected set of other clustering al-
gorithms will be discussed in detail and related to
k-means. There will be a brief discussion of validity
metrics and distance measures.
k-MEANS CLUSTERING
One of the earliest objective function-based clustering
approaches was called k-means clustering.
810
The k
refers to the number of clusters that are desired. The
idea behind the objective function for k-means is both
simple and elegant: Minimize the within cluster scatter
and maximize the between cluster scatter. So, you try
to minimize the distances between examples assigned
to a cluster and maximize the distances between ex-
amples assigned to different clusters.
The equation for the k-means objective function
is
J
1
= (U, V) =
k
i =1
n
j =1
u
i j
D(x
j
, v
i
), (1)
where
x
j
X represents one of n feature vectors of
dimension s;
v
i
V is an s dimensional cluster center, rep-
resenting the average value of the examples
assigned to a cluster;
U is the k n matrix of example assignments
to cluster centers and u
ij
U and u
ij
{0, 1}
indicates whether the jth example belongs to
the ith cluster;
k is the number of clusters;
1. Initialize the k cluster centers V
0
,
, ,
T = 1.
2. For all x
j
do
u
i j
= 1 if minarg
i
D(x
j
v
i
)
0 otherwise
3. For i = 1 to k do
v
i
=
n
j= 1
u
i j
x
j
4. if V
T
V
T 1
< stop
else T = T + 1 and go to 2.
(2)
FI GURE 1 | k-Means clustering algorithm.
and D(x
j
, v
i
) = x
j
v
i
2
is the distance. For
example, the Euclidean distance which is also
known as the L
2
norm.
In (1), we add up the distances between the ex-
amples assigned to a cluster and the corresponding
cluster center. The value J
1
is to be minimized. The
way to accomplish the minimization is to x V or U
and calculate the other and then reverse the process.
This requires an initialization to which the algorithm
is quite sensitive.
1113
The good news is that the algo-
rithm is guaranteed to converge.
14
The k-means clus-
tering algorithm is shown in Figure 1. We have used
bold notation to indicate that x, v are vectors and U,
V are matrices. We are going to drop the bold in the
proceeding for convenience expecting the reader will
recall they are vectors or matrices.
Now, we have our rst objective-based cluster-
ing algorithm dened. We can use it to cluster a four
dimensional, three class dataset as an illustrative ex-
ample. Here, the Weka data mining tool
15
has been
used to cluster and display the Iris data.
16
The dataset
shown in Figure 2 describes Iris plants and has 150
examples with 50 examples in each of three classes.
There were four numeric features measuredsepal
length, sepal width, petal length, and petal width. The
projection here is into two dimensions, petal length
and petal width. You can see it looks like there might
be two classes, as 2 overlap in Figure 2(a) with one
clearly separate. In Figure 2(b), you see a partition of
the data into three classes and in Figure 2(c) a dif-
ferent partition (from a different initialization, thus
illustrating the sensitivity to initialization).
It is important to note that the expert who
created the Iris dataset recognized the three classes
of Iris owers. However, the features recorded do
not necessarily disambiguate the classes with 100%
accuracy. This is true for many more complex
Vol ume 2, J ul y/ August 2012 327 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
FI GURE 2 | The Iris data (a) with labels (b) clustered by k-means with a Good initialization and (c) clustered by k-means with a Bad
initialization.
real-world problems. The point is that, without la-
bels, the features may give us a different number of
classes than the known number. In this case, we might
want better features (or a different algorithm).
Note, that for the Iris data some claim the fea-
tures really only allow for 2 clusters.
13,17
Real-world
ground truth tells us that there are three clusters or
classes. Which is correct? Perhaps both with the given
features?
FUZZY k-MEANS
If you allowan example x
j
to partially belong to more
than one cluster with a membership in cluster i of
i
(x
j
) [0, 1], this is called a fuzzy membership.
18
Using fuzzy memberships allows the creation of the
fuzzy k-means (FKM) algorithm in which each ex-
ample has some membership in each cluster. The
algorithm was originally called fuzzy c-means. Like
k-means an educated guess of the number of
clusters using domain knowledge is required of
the user.
The objective function for FKMis J
m
in Eq. (3):
J
m
(U, V) =
k
i =1
n
j =1
w
j
u
m
i j
D(x
j
, v
i
), (3)
where u
ij
is the membership value of the jth example,
x
j
, in the ith cluster; v
i
is the ith cluster centroid; n is
the number of examples; k is the number of clusters;
and m controls the fuzziness of the memberships with
very close to one causing the partition to be nearly
crisp or approximate k-means. Higher values cause
fuzzier partitions spreading the memberships across
more clusters:
D(x
j
, v
i
) = x
j
v
i
2
is the norm, for example,
the Euclidean distance.
w
j
is the weight of the jth example. For FKM,
w
j
= 1, j. We will use this value, which is not typi-
cally shown, later.
1. l = 0
2. Initialize the cluster centers (v
i
s) to get V
0
.
3. Repeat
l = l +1,
calculate U
l
according to Eq. (4)
calculate V
l
4. Until V
l
V
l1
<
FI GURE 3 | FKM algorithm.
U and V can be calculated as
u
i j
=
D
_
x
j
, v
i
_ 1
1m
k
i =1
D
_
x
j
, v
i
_ 1
1m
, (4)
v
i
=
n
j =1
w
j
(u
i j
)
m
x
j
n
j =1
w
j
(u
i j
)
m
. (5)
The clustering algorithm is shown in Figure 3.
There is some extra computation when compared to
k-means and we must choose the value for m. There
are many papers that describe approaches to choosing
m,
19,20
some of which are automatic. Adefault choice
of m = 2 often works reasonably well as a starting
point. There is a convergence theorem
21
for FKMthat
shows that it ends up in local minima or saddle points,
but will stop iterating.
This algorithm is very useful if you know you
have examples that truly are a mixture of classes.
It also allows for the easy identication of exam-
ples that do not t well into a single cluster and
provides information on how well any example ts
a cluster. An interesting case of its use is with the
previously mentioned Iris dataset. FKM in many ex-
periments (over 5000 random initializations) always
converged to the same nal partition which had 16
errors when searching for three clusters as shown in
Figure 2. k-Means converged to one of three parti-
tions with the most likely one the same as FKMs.
13
The other two were local extrema that resulted in sig-
nicantly higher values of J
1
, one of which is shown in
Figure 2. Of course, both algorithms are sensitive to
initialization and for other datasets FKM will not al-
ways converge to the same solution.
The FKM algorithm with the Euclidean dis-
tance function has a bias toward hyperspherical clus-
ters that are equal sized. If you change the distance
function
22,23
hyperellipsoidal clusters can be found.
EXPECTATION MAXIMIZATION
CLUSTERING
Consider the case in which you want to assign prob-
abilities to whether an example is in one cluster or
another. We might have the case that x
5
belongs to
cluster A with a probability of 0.9, cluster B with a
probability of 0, and cluster C with a probability of
0.1 for a three cluster or class problem. Note, without
labels the cluster designations are arbitrary. The clus-
tering algorithm to use is based on the idea of expec-
tation maximization
24
or nding the maximum like-
lihood solution for estimating the model parameters.
We are going to give a simplied clustering focused
version of the algorithmhere and more general details
can be found in Refs 15 and 24. The algorithm does
come with a convergence proof
25
which guarantees
that we can nd a solution (although not necessarily
the optimal solution) under conditions that usually
are not violated by real data.
We want to nd the set of parameters that max-
imize the log likelihood of generating our data X.
Now will consist of our probability function
for examples belonging to classes which, in the sim-
plest case, requires us to nd the centroid of clus-
ters and the standard deviation of them. More gen-
erally, the necessary parameters to maximize Eq. (6)
are found:
= argmax
i =1
log(P(x
i
|)). (6)
Let p
j
be the probability of class j. Let z
ij
= 1, if
example i belongs to class j and 0, otherwise
Now our objective function will be
L(X, ) =
n
i =1
k
j =1
z
i j
log( p
j
P(x
i
| j )). (7)
Now, how do we calculate P(x
i
|j)? A simple for-
mulation is given in Eq. (8) using a Gaussian-based
distance:
P(x
i
| j ) = f (x
i
;
j
,
j
) =
1
_
2
j
e
(x
i
j
)
2
2
2
j
. (8)
This works for roughly balanced classes with
a spherical shape in feature space. A more general
description which works better for data that does not
t the constraints of the previous sentence is given in
Ref 26. Our objective function depends on and .
We observe that
p
j
=
n
i =1
P(x
i
| j )/n. (9)
1. Initialize the
j
s and
j
s by running one iteration of k-means and taking the k-cluster centers
and their standard deviations. Initialize and L
0
=
2. Repeat
3. E-step: Calculate P(x
i
| j) as in Eq. (8)
, where L is calculated from (7)
4. M-step: n
j
=
n
i=1
P(x
i
| j), 1 j k
p
j
= n
j
/ n
j
=

n
i=1
P(x
i
| j)x
i
n
j
, 1 j k
j
=

n
i=1
P(x
i
| j)(x
i

j
)
2
n
i=1
P(x
i
| j)
5. Until | L
t
L
t 1
| <
FI GURE 4 | The EM clustering algorithm.
FI GURE 5 | The Iris data clustered by the EM algorithm.
The EM algorithm is shown in Figure 4. We
have applied the EM algorithm as implemented in
Weka to the Iris data. A projection of the partition
obtained when searching for three classes is shown
in Figure 5. The nal partition differs, albeit slightly
from that found in Figure 2. However, it is interesting
that on even this simple dataset there are disagree-
ments which, unsurprisingly, involve the two over-
lapping classes.
POSSIBILISTIC k-MEANS
The clustering algorithm discussed in this section was
designed to be able to cluster data like that shown
in Figure 6. Visually, there is a bar with a spherical
object at each end. A person would most likely say
there are three clusters, the two spheres and the linear
bar. The problem for the clustering algorithms dis-
FI GURE 6 | Three cluster problem that is difcult.
cussed thus far is that they must nd very different
shapes (spherical and linear) and are not designed to
do so. The possibilistic k-means (PKM) algorithmcan
nd nonspherical clusters together with ellipsoidal or
spherical clusters.
27
The algorithm, originally named
possibilistic c-means, is also signicantly more noise
tolerant than FKM.
28,29
The approach is more computationally complex
and requires some attention to parameter setting.
28
The innovation is to view the examples as being pos-
sible members of clusters. Possibility theory is uti-
lized to create the objective function.
30
So, an exam-
ple might have a possibility of 1 (potentially complete
belonging) to more than one cluster. The membership
value can also be viewed as the compatibility of the
assignment of an example to a cluster.
The objective function for PKM looks like that
for FKM with an extra term and some different
constraints on the membership values as shown in
Eq. (10). The second term forces the u
ij
to be as large
as possible to avoid the trivial solution:
J
m
(U, V) =
k
i =1
n
j =1
u
m
i j
D(x
j
, v
i
)
+
k
i =1
i
n
u=1
(1 u
i j
)
m
, (10)
where,
i
are suitable positive numbers, u
ij
: is the
membership value of the jth example, x
j
, in the
ith cluster such that u
ij
[0, 1], 0 <
n
j =1
u
i j
n
and max
j
u
i j
> 0 i . Critically, the memberships are
1. Run FKM for one full iteration. Choose m and k.
2. l = 0
3. Initialize the cluster centers (v
i
s) from FKM to get V
0
.
4. Estimate
i
using Eq. (12).
5. Repeat
l = l + 1
calculate U
l
calculate V
l
6. Until V
l
V
l 1
<
Now, if you want to know the shapes of the possibility distributions, rst reestimate
i
with Eq.
(13). This is optional.
1. Repeat
l = l + 1
calculate U
l
calculate V
l
2. Until V
l
V
l 1
<
FI GURE 7 | PKM algorithm.
now not constrained to add to 1 across classes with
the only requirement being that every example have
nonzero membership (or possibility) for at least 1
cluster. v
j
is the jth cluster centroid; n is the num-
ber of examples; k is the number of clusters; m affects
the possibilistic membership values with closer to 1
making the results closer to a hard partition (which
has memberships of only 0 or 1); and D(x
j
, v
i
) =
x
j
v
i
2
is the norm, such as the Euclidean distance.
Now, the calculation for the cluster centers is
still done by Eq. (5) with w
j
= 1j. The calculation
for the possibilsitic memberships is given by Eq. (11):
u
i j
=
1
1 +
_
D(x
j
,v
i
)
i
_ 1
m1
. (11)
The value of
i
has the effect of determining the
distance at which an examples membership becomes
0.5. It should be chosen based on the bandwidth of
the desired membership distribution for a cluster. In
practice, a value proportional to the average fuzzy
intracluster distance can work as in Eq. (12). The
authors of the approach note that R = 1 is a typical
choice:
i
= R
n
j =1
(u
i j
)
m
D(x
j
, v
i
)
n
j =1
(u
i j
)
m
. (12)
The PKM algorithm is quite sensitive to the
choice of m.
28,29
You can x the
i
s or calculate them
each iteration. When xed, you have guaranteed con-
vergence. The algorithm that will allow you to nd
clusters such as in Figure 6 benets from the use of
Eq. (13) to generate after convergence is achieved
using Eq. (12). Consider the
i
to contain all the
memberships for the ith cluster. Then (
i
)
contains
all membership values above and in terms of possi-
bility theory is called an cut. So, for example with
an 0.5 you get a good set of members that have
a pretty strong afnity for the cluster:
i
= R
x
j
(
i
)
D(x
j
, v
i
)
|(
i
)
|
. (13)
The algorithm for PKM is shown in Figure 7.
A nice advantage of this algorithm is its perfor-
mance when your dataset is noisy. As well as being
able to nicely extract different shapes, although this
typically requires a distance function that is a bit more
complex than the Euclidean distance. Two other dis-
tance functions that can be used are shown in Eqs (14)
and (16). The scaled Mahalanobis distance
31
allows
for nonspherical shapes. If you have spherical shells
potentially in your data, then the distance measure
32
shown in Eq. (16) can be effective. However, the in-
troduction of the radius into the distance measure
requires a new set of updated equations for nding
the cluster centers, which can be found in Ref 27:
D
i j
= |F
i
|
1/n
(x
j
v
i
)
T
F
1
i
(x
j
v
i
), (14)
1. Given R = [ r
i j
], initialize 2 k < n, and initialize U
0
M
k
| u
i j
{0, 1} with the constraints
of Eq. (17) holding and T = 1.
2. Calculate the k-mean vectors v
i
to create V
T
as
(19)
(20)
v
i
= ( u
i1
, u
i2
, . . . , u
in
)
T
/
n
j= 1
u
i j
3. Update U
T
using Eq. (4) where the distance is
D(x
j
, v
i
) = ( Rv
i
)
j
(v
T
i
Rv
i
)/ 2
4. If U
T
U
T 1
< stop
FI GURE 8 | Relational k-means clustering algorithm.
where F
i
is the fuzzy covariance matrix of cluster v
i
and can be updated with Eq. (15):
F
i
=
n
j =1
u
m
i j
(x
j
v
i
)(x
j
v
i
)
T
n
j =1
u
m
i j
, (15)
D
i j
= (x
j
v
i
)
1/2
r
i
)
2
, (16)
where r
i
is the radius of cluster i. With new calcu-
lations to nd the cluster centers, this results in an
algorithm called possibilistic k-shells. It is quite effec-
tive if you have a cluster within a cluster (like a big O
containing a small o).
There are a number of alternative formulations
of possibilistic clustering such as that in Ref 33 where
the authors argue their approach is less sensitive to
parameter setting. In Ref 34, a mutual cluster repul-
sion term is introduced to solve the technical problem
of coincident cluster centers providing the best mini-
mization and it introduces some other potentially use-
ful properties.
The Mahalanobis distance measure can be used
in k-means (with just the covariance matrix), FKM
and EM as well. In FKM, the use of Eq. (14) as the
distance measure gives the so called GK clustering
algorithm,
23
which is known for its ability to capture
hyperellipsoidal clusters.
RELATIONAL CLUSTERING
How do we cluster data for which the attributes are
not all numeric? Relational clustering is one approach
that can be used stand alone or in an ensemble of clus-
tering algorithms.
35,36
Cluster ensembles can provide
other options for dealing with mixed attribute types.
How about if all we know is how examples relate to
one another in terms of how similar they are? Rela-
tional clustering works when x
i
X, (x
i
, x
j
) = r
ij

[0, 1]. We can think of as a binary fuzzy relation. R
=[r
ij
] is a fuzzy relation (or a typical relation matrix if
r
ij
{0, 1}). Relational clustering algorithms are typ-
ically associated with graphs because R can always be
viewed as the adjacency matrix of a weighted digraph
on the n examples (nodes) in X.
Graph clustering is not typically addressed with
an objective function-based clustering approach.
37
However, there are relational versions of k-means and
FKM
38
which are applicable to graphs and will be dis-
cussed. First, our U matrix of example memberships
in clusters can be put into a context that allows for
both the k-means and FKM to be described with one
objective function:
M
f k
= {U R
kn
|0 u
i j
1,
k
i =1
u
i j
= 1, for 1 j n, (17)
n
j =1
u
i j
> 0, for 1 i k}
In Eq. (17) the memberships can be fuzzy or in
{0, 1}, called crisp. The same constraints as for FKM
and k-means hold. Our objective function is
JR
m
(U) =
k
i =1
_
_
n
j =1
n
l=1
u
m
i j
u
m
il
r
jl
2
n
t=1
u
m
i t
_
_
, (18)
where m 1. If we have numeric data, we can create
r
il
=
2
il
= x
i
x
l
2
for some distance function. The
square just ensures a positive number. The algorithm
for relational k-means is shown in Figure 8.
This algorithm has a convergence proof based
on the nonrelational case. It allows us to do graph
1. Given R = [ r
i j
], initialize 2 k < n, and initialize U
0
M
k
| u
i j
[0, 1] with the constraints
of (17) holding and T = 1. Choose m > 1.
2. Calculate the k mean vectors v
i
to create V
T
as
v
i
= (u
m
i1
, u
m
i2
, . . . , u
m
in
)
T
/
n
j= 1
u
m
i j
3. Update U
T
using (2) where the distance is
D(x
j
, v
i
) = (Rv
i
)
j
(v
T
i
Rv
i
)/ 2
4. If U
T
U
T 1
< stop
(21)
(22)
FI GURE 9 | Relational fuzzy k-means clustering algorithm.
clustering using an objective function-based cluster-
ing algorithm. The fuzzy version (with the member-
ships relaxed to be in [0, 1]) is shown in Figure 9.
It also converges
38
and provides a second option
for relational clustering with objective function-based
algorithms.
Another approach to fuzzy relational clustering
is fuzzy medoid clustering.
39
Aset of k fuzzy represen-
tatives fromthe data (medoids) are found to minimize
the dissimilarity in the created clusters. The approach
typically requires less computational time than that
described here.
ADJUSTING THE PERFORMANCE OF
OBJECTIVE FUNCTION-BASED
ALGORITHMS
The algorithms discussed thus far are very good clus-
tering algorithms. However, for the most part, they
have important limitations. With the exception of
PKM, noise will have a strong negative effect on them.
Unless otherwise noted there is a built-in bias for hy-
perspherical clusters that are equal sized. That is a
problem if you have, for example, a cluster of interest
which is rare and, hence, small.
To get different cluster shapes, the distance mea-
sure can be changed. We have seen an example of
the Mahalanobis distance and discussed the easy to
compute Euclidean distance. There are lots of other
choices, such as given in Ref 40. Any that involve
the use of the covariance matrix of a cluster, or some
variation of it, typically will not require changes in
the clustering algorithm.
For example, we can change our probability cal-
culation in EM to be as follows:
P(x
i
| j ) = f
_
_
x
i
;
j
,
j
_
_
=
exp{
1
2
(x
i

j
)
T
1
j
(x
i

j
)}
(2)
s/2
|
1/2
j
|
, (23)
where
j
is the covariance matrix for the jth cluster,
s is the number of features, and
j
is the centroid
(average) of cluster j. This gives us ellipsoidal clusters.
Now, if we want to have different shapes for clusters
we can look at parameterizations of the covariance
matrix. An eigenvalue decomposition is
j
=
j
D
j
A
j
D
T
j
, (24)
where D
j
is the orthogonal matrix of eigenvectors, A
j
is a diagonal matrix whose elements are proportional
to the eigenvalues of

j
, and
j
is a scalar value.
41
This formulation can be used in k-means and FKM
and really PKM with a fuzzy covariance matrix for
the latter two. It is a very exible formulation.
The orientation of the principal components of
j
is determined by D
j
, the shape of the density con-
tours is determined by A
j
. Then
j
species the volume
of the corresponding ellipsoid, which is proportional
to
s
j
| A
j
|. The orientation, volume and shape of the
distributions can be estimated from the data and be
allowed to vary for each cluster (although you can
constrain them to be the same for all).
In Ref 22, a fuzzied version of Eq. (23) is
given and modications are made to FKM to result
in the so-called GathGeva clustering algorithm. The
1. l = 0
2. Initialize U
0
, perhaps with 1 iteration of FKM or k-means. Initialize m.
3. Repeat
l = l + 1
For FKM: calculate U
l
according to Eq. (4) using Eq. (29) for the distance. Note
that the distances depend on the previous membership values. You might add a step to
calculate them all to improve computation time, if you like.
For k-means: calculate U
l
according to Eq. (2) using Eq. (29) for the distance with
m = 1.
4. Until U
l
U
l 1
<
FI GURE 10 | Kernel-based k-means/FKM algorithm.
algorithm also has a built-in method to discover the
number of clusters in the data, which will be ad-
dressed in the proceeding.
Kernel-Based Clustering
Another interesting way to change distance functions
is to use the kernel trick
42
associated with support
vector machines which are trained with labeled data.
A very simplied explanation of the idea, which is
well explained by Burges,
43
is the following. Con-
sider projecting the data into a different space, where
it may be more simply separable. From a clustering
standpoint, we might think of a three cluster problem
where the clusters are touching in some dimensions.
When projected they may nicely group for any clus-
tering algorithm to nd them.
44
Now consider : R
s
H to be a nonlin-
ear mapping function from the original input space
to a high-dimensional feature space H. By applying
the nonlinear mapping function , the dot product
x
i
x
j
in our original space is mapped to (x
i
) (x
j
)
in feature space. The key notion in kernel-based learn-
ing is that the mapping function need not be explic-
itly specied. The kernel function K(x
i
, x
j
) in the orig-
inal space R
s
can be used to calculate the dot product
(x
i
) (x
j
).
First, we introduce 3 (of many) potential kernels
which satisfy Mercers condition
43
:
K(x
i
, x
j
) = (x
i
x
j
+b)
d
, (25)
where d is the degree of the polynomial and b is some
constant offset;
K(x
i
, x
j
) = e
x
i
x
j
2
2
2
, (26)
where
2
is a variance parameter;
K(x
i
, x
j
) = tanh((x
i
x
j
) +), (27)
where and are constants that shape the sigmoid
function.
Practically, the kernel function K(x
i
, x
j
) can be
integrated into the distance function of a clustering
algorithm which changes the update equations.
45,46
The most general approach is to construct the cluster
center prototypes in kernel space
46
because it allows
for more kernel functions to be used. Here, we will
take a look at hard and FKM approaches to objec-
tive function-based clustering with kernels. Now our
distance becomes D(x
j
, v
i
) = (x
j
) v
i
2
. So our
objective function reads as
J
K,m
=
k
i =1
n
j =1
u
m
i j
(x
j
) v
i
2
. (28)
For k-means, we just need m = 1 with u
ij
{0, 1}
as usual. We see our objective function has (x
j
) in it
so our update equations and distance equation must
change. Before discussing the modied algorithm, we
introduce the distance function.
Our distance will be as shown belowEq. (29) for
the fuzzy case. It is simple to modify for k-means. It is
important to note that the cluster centers themselves
do not show up in the distance equation:
D(x
j
, v
i
) = K(x
j
, x
j
) 2
n
l=1
u
m
il
K(x
l
, x
j
)
n
l=1
u
m
il
+
n
q=1
n
l=1
u
m
i q
u
m
il
K(x
q
, x
l
)
_
n
q=1
u
m
i q
_
2
. (29)
The algorithm for k-means and FKM is then
shown in Figure 10.
Choosing the Right Number of Clusters:
Cluster Validity
There are many functions
4749
that can be applied to
evaluate the partitions produced by a clustering algo-
rithm using different numbers of cluster centers. The
silhouette criterion
50
and fuzzy silhouette criterion
51
measure the similarity of objects to others in their
1. Set initial number of clusters I, typically 2. Set maximum number of clusters MC, MC << n
unless something is unusual. Initialize T = I, k = 0. Choose the validity metric to be used
and parameters for it.
2. While (T MC and k = = 0) do
3. Cluster data into T clusters
4. k = checkvalidity /* Returns the number of clusters if applicable or 0. */
5. T = T + 1
Return (IF (k = = 0) return MC ELSE return k)
FI GURE 11 | Finding the right number of clusters with a partition validity metric. Any validity metric that applies to a particular objective
function-based clustering algorithm can be applied.
own cluster against the nearest object from another.
A nice comparison for k-means and some hierarchi-
cal approaches is found in Ref 52. Perhaps the sim-
plest (but far from only approach) is that shown in
Figure 11. Here you start with 2 clusters and run the
clustering algorithm to completion (or enough itera-
tions to have a reasonable partition), then increase to
3, 4, . . ., MC clusters and repeat the process. A valid-
ity metric is applied to each partition and can be used
to pick out the right one according to it. This will
typically be determined at the point when the declin-
ing or increasing value of the metric changes direction
(increases/decreases). Note that you cannot use the
objective function because it will prefer many clusters
(sometimes as many as there are examples).
A very good, simple cluster validity metric
for fuzzy partitions of data is the XieBeni index
[Eq. (30)].
53,54
It uses the value m with u
ij
and you
can set m to the same value as used in clustering or
simply a default (say m = 2). The search is for the
smallest S.
S =
k
i =1
n
j =1
u
m
i j
D(x
j
, v
i
)
n min
i j
{D(v
i
, v
j
)}
. (30)
The generalized Dunn index
47
has proved to be
a good validity metric for k-means partitions. One
good version of it is shown in Eq. (31) with a nice
discussion given in Ref 47. It looks at between cluster
scatter versus within cluster scatter measured by the
biggest intercluster distance.
S
gd
= min
1sk
_
min
1tk,t=s
_

zv
s
,wv
t
D(z, w)
|v
s
| |v
t
| max
1lk
(max
x,yv
l
D(x, y))
__
. (31)
A relatively new approach to determining the
number of clusters involves the user of the clustering
algorithm. These approaches are called visual assess-
ment techniques.
55,56
The examples or objects are or-
dered to reect dissimilarities (rows and columns in
the dissimilarity matrix are moved). The newly or-
dered matrix of pairwise example dissimilarities is dis-
played as an intensity image. Clusters are indicated by
dark blocks of pixels along the diagonal. The viewer
can decide on how many clusters one sees.
Scaling Clustering Algorithms to Large
Data-sets
The clustering algorithms, we have discussed, take a
signicant amount of time when the number of ex-
amples or features or both are large. For example,
k-means run time requires checking the distance of
n examples of dimension s against k cluster centers
with an update of the (n k) U matrix during each
iteration. The run time is proportional to (nsk + nk)t
for t iterations. Using big O notation,
57
the average
run time is O(nskt). This is linear in n it is true, but
the distances are computationally costly to compute.
As n gets very large, we would like to reduce the time
required to accomplish clustering.
Clustering can be sped up with distributed
approaches
5860
where the algorithm is generalized
to work on multiple processors. An early approach to
speeding up k-means clustering is given in Ref 61.
They provide a four-step process shown below to
allow just one pass through the data assuming a
maximum-sized piece of memory can be used to store
data as shown in Figure 12. An advantage of this
approach is that you are only loading the data from
disk one time, meaning it can be much larger than the
available memory. The clustering is done in step 2.
1. Obtain next available (possibly random) sample from the dataset and put it in free memory
space. Initially, you may ll it all.
2. Update current model using data in memory.
3. Based on the updated model, classify the loaded examples elements as
(a) needs to be retained in memory
(b) can be discarded with updates to the sufcient statistics
(c) can be reduced via compression and summarized as sufcient statistics
4. Determine if stopping criteria are satised. If so, terminate; else go to 1.
FI GURE 12 | An approach to speeding up k-means (applied in step 2) with one pass through the data.
Now to speed up FKM, the single-pass algo-
rithm can be used.
62,63
It makes use of the weights
shown in Eq. (3). The approach is pretty simple. Break
the data into c chunks.
(1) Cluster the rst chunk.
(2) Create weights for the cluster centers based
on the amount of membership assigned to
them with Eq. (32). Here, n
d
is the number
of examples being clustered for a chunk of
data.
(3) Bring in the next chunk of data and the k
weighted cluster centers from the previous
step and apply FKM.
(4) Go to step 2 until all chunks are processed.
So, if c = 10, in the rst step you process 10%
of the n examples and n
d
= 0.1n. In the next 9 steps,
there are n
d
= 0.1n + k examples to cluster:
w
j
=
n
d
l=1
u
jl
w
l
, 1 j k. (32)
SUMMARY AND DISCUSSION
Objective function-based clustering describes an ap-
proach to grouping unlabeled data in which there is a
global function to be minimized or maximized to nd
the best data partition The approach is sensitive to
initialization and a brief example of this was given. It
is necessary to specify the number of clusters for these
approaches, although algorithms can easily be built
22
to incorporate methods to determine the number of
clusters.
Four major approaches to objective function-
based clustering have been coveredk-means, FKM,
PKM, and expectation maximization. Relational clus-
tering can be done with any of the major clustering
algorithms by using a matrix of distances between
examples based on some relation. We have briey
discussed the importance of the distance measure in
determining the shapes of the clusters that can be
found.
A kernel-based approach to objective function-
based clustering has been introduced. It has the
promise, with the choice of the right kernel function,
of allowing most any shape cluster to be found.
A section on how to determine the right number
of clusters (cluster validity) was included. It shows just
two of many validity measures; however, they both
have performed well.
With the exception of PKM the other ap-
proaches discussed here are sensitive to noise. Most
clustering problems are not highly noisy. If yours is
and you want to use objective function-based tech-
niques, take a look at Refs 46 that contain modied
algorithms that are essential for this problem.
For a clustering problem, one needs to think
about the data and choose a type of algorithm. The ex-
pected shape of clusters, number of features, amount
of noise, whether examples of mixed classes exist,
and amount of data are among the critical consider-
ations. You can choose some expected values for, k,
the number of clusters or use a validity function to
tell you the best number. You might want to try mul-
tiple initializations and take the lowest (highest) value
from the objective function which indicates the best
partition. There are a number of publicly available
clustering algorithms that can be tried. In particular,
the freely available Weka
15
data mining tool has sev-
eral, including k-means and EM. A couple of fuzzy
clustering algorithms are available.
64,65
There are a lot of approaches that are not dis-
cussed here. This includes time series clustering.
6668
Some other notable ones are in Refs 6974. They all
contain some advances that might be helpful for your
problem. Happy clustering.
ACKNOWLEDGMENTS
This work was partially supported by grant 1U01CA143062- 01, Radiomics of NSCLC from
the National Institutes of Health.
REFERENCES
1. Jain A, Dubes R. Algorithms for Clustering Data. Up-
per Saddle River, NJ: Prentice-Hall; 1998.
2. Jain AK, Murty MN, Flynn PJ. Data clustering: a re-
view. ACM Comput Surv 1999, 31:264323.
3. Blei DM, Ng AY, Jordan MI. Latent Dirichlet alloca-
tion. J Mach Learn Res 2003, 3:9931022.
4. Dave RN, Krishnapuram R. Robust clustering meth-
ods: a unied view. IEEE Trans Fuzzy Syst 1997,
5:270293.
5. Kim J, Krishnapuram R, Dave R. Application of the
least trimmed squares technique to prototype-based
clustering. Pattern Recognit Lett 1996, 17:633641.
6. Wu K-L, Yang M-S. Alternative c-means clustering al-
gorithms. Pattern Recognit 2002, 35:22672278.
7. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering
with Bregman divergences. J Mach Learn Res 2005,
6:17051749.
8. MacQueen J. Some methods for classication and anal-
ysis of multivariate observations. In: Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics
and Probability. Los Angeles, CA: University of Cali-
fornia Press; 1967, 281297.
9. Jain AK. Data clustering: 50 years beyond k-means.
Pattern Recognit Lett 2010, 31:651666. Award win-
ning papers from the 19th International Conference on
Pattern Recognition (ICPR).
10. Steinbach M, Karypis G, Kumar V. A comparison of
document clustering techniques. In: The Sixth ACM
SIGKDDInternational Conference on Knowledge Dis-
covery and Data Mining; Boston, MA; August 2023,
2000.
11. Redmond SJ, Heneghan C. Amethod for initialising the
k-means clustering algorithm using kd-trees. Pattern
Recognit Lett 2007, 28:965973.
12. He J, Lan M, Tan C-L, Sung S-Y, Low H-B. Ini-
tialization of cluster renement algorithms: a review
and comparative study. In: 2004 IEEE International
Joint Conference on Neural Networks; 2004, 1
4:(xlvii+3302).
13. Hall LO, Ozyurt IB, Bezdek JC. Clustering with a
genetically optimized approach. IEEE Trans Evolut
Comput 1999, 3:103112.
14. SelimSZ, Ismail MA. K-means-type algorithms: A gen-
eralized convergence theorem and characterization of
local optimality. IEEE Trans Pattern Anal Mach Intell
1984, PAMI-6(1):8187.
15. Witten IH, Frank E. Data Mining: Practical Machine
Learning Tools and Techniques. 2nd ed. San Francisco:
Morgan Kaufmann; 2005.
16. Bezdek JC, Keller JM, Krishnapuram R, Kuncheva LI,
Pal NR. Will the real iris data please stand up? IEEE
Trans Fuzzy Syst 1999, 7:368369.
17. Kothari R, Pitts D. On nding the number of clusters.
Pattern Recognit Lett 1999, 20:405416.
18. Kandel A. Fuzzy Mathematical Techniques With Ap-
plications. Boston, MA: Addison-Wesley; 1986.
19. Wu K-L. Analysis of parameter selections for
fuzzy c-means. Pattern Recognit 2012, 45:407415.
http://dx.doi.org/10.1016/j.patcog.2011.07.01
20. Yu J, Cheng Q, Huang H. Analysis of the weighting ex-
ponent in the FCM. Syst Man Cybern, Part B: Cybern
2004, 34:634639.
21. Bezdek J, Hathaway R, Sobin M, Tucker W. Conver-
gence theory for fuzzy c-means: counterexamples and
repairs. IEEE Trans Syst Man Cybern 1987, 17:873
877.
22. Gath I, Geva AB. Unsupervised optimal fuzzy clus-
tering. IEEE Trans Pattern Anal Mach Intell 1989,
11:773780.
23. Gustafson DE, Kessel WC. Fuzzy clustering with a
fuzzy covariance matrix. In: Proc IEEE CDC 1979,
761766.
24. Dempster AP, Laird NM, Rubin DB. Maximum likeli-
hood from incomplete data via the em algorithm. J R
Stat Soc Ser B 1997, 39:138.
25. Wu CFJ. On the convergence properties of the EM
algorithm. Ann Stat 1983, 11:95103.
26. Fraley C, Raferty AE. How many clusters? Which clus-
tering method? Answers via model-based cluster anal-
ysis. Comput J 1998, 41:579588.
27. Krishnapuram R, Keller JM. A possibilistic approach
to clustering. IEEE Trans Fuzzy Syst 1993, 1:98
110.
28. Krishnapuram R, Keller JM. The possibilistic c-means
algorithm: insights and recommendations. IEEE Trans
Fuzzy Syst 1996, 4:385393.
29. Barni M, Cappellini V, Mecocci A. Comments on a
possibilistic approach to clustering. IEEE Trans Fuzzy
Syst 1996, 4:393396.
30. Dubois D, Prade H. Possibility theory, probability the-
ory and multiple-valued logics: a clarication. Ann
Math Artif Intell 2001, 32:3566.
31. Sung K-K, Poggio T. Example-based learning for view-
based human face detection. IEEE Trans Pattern Anal
Mach Intell 1998, 20:3951.
32. Krishnapuram R, Nasraoui O, Frigui H. The fuzzy
c spherical shells algorithm: a new approach. IEEE
Trans Neural Netw 1992, 3:663671.
33. Yang M-S, Wu K-L. Unsupervised possibilistic cluster-
ing. Pattern Recognit 2006, 39:521.
34. Timm H, Borgelt C, Doring C, Kruse R. An extension
to possibilistic fuzzy cluster analysis. Fuzzy Sets Syst
2004, 147:316.
35. Ghosh J, Acharya A. Cluster ensembles. WIREs Data
Min Knowl Discov 2001, 1:305315.
36. Strehl A, Ghosh J, Cardie C. Cluster ensemblesa
knowledge reuse framework for combining multiple
partitions. J Mach Learn Res 2002, 3:583617.
37. Schaeffer SE. Graph clustering. Comput Sci Rev 2007,
1:2764.
38. Hathaway RJ, Davenport JW, Bezdek JC. Relational
duals of the c-means clustering algorithms. Pattern
Recognit 1989, 22:205212.
39. Krishnapuram R, Joshi A, Nasraoui O, Yi L. Low-
complexity fuzzy relational clustering algorithms for
web mining. IEEE Trans Fuzzy Syst 2001, 9:595607.
40. Aggarwal C, Hinneburg A, Keim D. On the surprising
behavior of distance metrics in high dimensional space.
Lecture Notes in Computer Science. Springer, 2001,
420434.
41. Baneld JD, Raftery AE. Model-based Gaussian and
non-Gaussian clustering. Biometrics 1993, 49:803
821.
42. Hofmann T, Sch olkopf B, Smola AJ. Kernel methods
in machine learning. Ann Stat 2008, 36:11711220.
43. Burges CJC. A tutorial on support vector machines for
pattern recognition. Data Min Knowl Discov 1998,
2:121167.
44. Kim D-W, Lee KY, Lee DH, Lee KH. Evaluation of the
performance of clustering algorithms in kernel-induced
feature space. Pattern Recognit 2005, 38:607611.
45. Heo G, Gader P. An extension of global fuzzy c-means
using kernel methods. In: 2010 IEEE International
Conference on Fuzzy Systems (FUZZ); 2010, 16.
46. Chen L, Chen CLP, Lu M. A multiple-kernel fuzzy c-
means algorithm for image segmentation. IEEE Trans
Syst Man Cybern, Part B: Cybern 2011, 99:112.
47. Bezdek JC, Pal NR. Some new indexes of cluster va-
lidity. IEEE Trans Syst Man Cybern, Part B: Cybern
1998, 28:301315.
48. Wang J-S, Chiang J-C. A cluster validity measure with
outlier detection for support vector clustering. IEEE
Trans Syst Man Cybern, Part B: Cybern 2008, 38:78
89.
49. Pal NR, Bezdek JC. On cluster validity for the fuzzy
c-means model. Fuzzy Systems, IEEE Transactions on
1995, 3(3):370379.
50. Kaufman L, Rousseeuw P. Finding Groups in Data.
New York: John Wiley & Sons; 1990.
51. Campello RJGB, Hruschka ER. A fuzzy extension of
the silhouette width criterion for cluster analysis. Fuzzy
Sets Syst 2006, 157:28582875.
52. Vendramin L, Campello RJGB, Hruschka ER. Rela-
tive clustering validity criteria: a comparative overview.
Stat Anal Data Min 2010, 3:209235.
53. Xie XL, Beni G. Avalidity measure for fuzzy clustering.
IEEE Trans Pattern Anal Mach Intell 1991, 13:841
847.
54. Pal NR, Bezdek JC. Correction to on cluster validity
for the fuzzy c-means model [correspondence]. IEEE
Trans Fuzzy Syst 1997, 5:152153.
55. Bezdek J, Hathaway R. Vat: a tool for visual assessment
of (cluster) tendency. In: Proceedings of International
Joint Conference on Neural Networks 2002, 2225
2230.
56. Bezdek J, Hathaway R, Huband J. Visual assessment
of clustering tendency for rectangular dissimilarity ma-
trices. IEEE Trans Fuzzy Syst 2007, 15:890903.
57. Sedgewick R, Flajolet P. An Introduction to the Analy-
sis of Algorithms. Boston, MA: Addison-Wesley; 1995.
58. Kargupta H, Huang W, Sivakumar K, Johnson E. Dis-
tributed clustering using collective principal compo-
nent analysis. Knowl Inf Syst 2001, 3:422448.
59. Kriegel H-P, Krieger P, Pryakhin A, Schubert M. Effec-
tive and efcient distributed model-based clustering.
IEEE Int Conf Data Min 2005, 0:258265.
60. Olman V, Mao F, Wu H, Xu Y. Parallel clustering algo-
rithm for large data sets with applications in bioinfor-
matics. IEEE/ACM Trans Comput Biol Bioinf 2009,
6:344352.
61. Bradley PS, Fayyad U, Reina C. Scaling clustering algo-
rithms to large databases. In: Proceedings of the Fourth
International Conference on Knowledge Discovery and
Data Mining 1998, 915.
62. Hore P, Hall LO, Goldgof DB. Single pass fuzzy c
means. In: IEEE International Fuzzy Systems Confer-
ence, FUZZ IEEE 2007. IEEE; 2007, 17.
63. Hore P, Hall L, Goldgof D, Gu Y, Maudsley A,
Darkazanli A. A scalable framework for segmenting
magnetic resonance images. J Sig Process Syst 2009,
54:183203.
64. Eschrich S, Ke J, Hall LO, Goldgof DB. Fast accurate
fuzzy clustering through data reduction. IEEE Trans
Fuzzy Syst 2003, 11:262270.
65. Hore P, Hall LO, Goldgof DB, Gu Y. Scalable clus-
tering code. Available at: http://www.csee.usf.edu/
hall/scalable. (Accessed April 26, 2012).
66. DUrso P. Fuzzy clustering for data time arrays with
inlier and outlier time trajectories. IEEE Trans Fuzzy
Syst 2005, 13:583604.
67. Coppi R, DUrso P. Fuzzy unsupervised classication
of multivariate time trajectories with the Shannon
entropy regularization. Comput Stat Data Anal 2006,
50:14521477.
68. Liao TW. Clustering of time series data aa survey. Pat-
tern Recognit 2005, 38:18571874.
69. Zhang T, Ramakrishnan R, Livny M. Birch: an efcient
data clustering method for very large databases. In:
Proceedings of the 1996 ACMSIGMODInternational
Conference on Management of Data, SIGMOD 96.
New York: ACM; 1996, 103114.
70. Guha S, Rastogi R, ShimK. Cure: an efcient clustering
algorithm for large databases. In: Proceedings of ACM
SIGMOD International Conference on Management
of Data; 1998, 7374.
71. Aggarwal CC, Han J, Wang J, Yu PS. A framework
for clustering evolving data streams. In: Proceedings
of the International Conference on Very Large Data
Bases; 2003.
72. Gupta C, Grossman R. Genic: a single pass generalized
incremental algorithm for clustering. In: Proceedings
of the Fourth SIAM International Conference on Data
Mining (SDM 04), 2004, 2224. In: den Bussche JV,
Vianu V, eds. Database Theory ICDT 2001, volume
1973 of Lecture Notes in Computer Science, Vol. 1973.
Berlin/Heidelberg: Springer; 2001, 420434.
73. Dhillon IS, Mallela S, Kumar R. A divisive information
theoretic feature clustering algorithm for text classi-
cation. J Mach Learn Res 2003, 3:12651287.
74. Linde Y, Buzo A, Gray R. An algorithm for vector
quantizer design. IEEE Trans Commun 1980, 28:84
95.

Objective Clustering

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Objective Clustering

Caricato da

Copyright:

Formati disponibili

Advanced Review

Correspondence to: hall@cse.usf.edu

Potrebbero piacerti anche