Fuzzy Metric Learning

3534 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 26, NO.
6, DECEMBER 2018
Discriminative Fuzzy C-Means as a Large Margin

Unsupervised Metric Learning Algorithm
Zahra Moslehi, Mahsa Taheri, Abdolreza Mirzaei , and Mehran Safayani
Abstract—In this paper, a new unsupervised metric learning al- together and pushing dissimilar ones out. In unsupervised met-
gorithm with real-world application in clustering is proposed. To ric learning algorithms, unlike the other two categories, there is
have a desirable clustering, the separability among different classes no need to any additional information. Due to the lack of label
of data needs to be improved. A common manner in accomplish-
ing this objective is to utilize the advantages of metric learning information, this method has more challenges than the previous
in clustering and vice versa. Clustering provides an estimation of two categories [3].
class labels and metric learning maximizes the separability among In this paper, we focus on the unsupervised metric learn-
these different estimated classes of data. This procedure is per- ing with real-world application in clustering. In unsupervised
formed in an iterative fashion, alternating between clustering and clustering, the objective is to find a set of clusters, where each
metric learning. Here, a new method is proposed, called discrimina-
tive fuzzy c-means (Dis-FCM), in which FCM and metric learning cluster only contains the same class data, while compactness
are integrated into one joint formulation. Unlike traditional ap- and separability on all of them are well defined. Without the
proaches, which simply alternate between clustering and metric label information, finding such well-defined clusters is impossi-
learning, Dis-FCM applies both simultaneously. Here, FCM pro- ble. However, one way to achieve better results is to first apply
vides an estimation of class labels. This can avoid the problem of a dimensionality reduction algorithm like PCA or ISOMAP and
fast convergence, which is common in previous algorithm. More-
over, Dis-FCM is able to handle not only numerical data, but also then cluster data in the low dimensional space. By applying a
categorical data, which are not found in traditional methods. The dimensionality reduction algorithm, the intrinsic structure of the
experimental results indicate its superiority over other state-of- data points is captured. Thus, more probably a better clustering
the-art algorithms in terms of extrinsic and intrinsic clustering result can be achieved in the low-dimensional embedded space.
measures. Another way is to first perform a clustering method in the input
Index Terms—Categorical data, discriminative learning, fuzzy space to provide an estimation of the class labels. They are then
c-means (FCM), large margin techniques, numerical data, unsu- fed into a supervised or weakly supervised metric learning al-
pervised metric learning. gorithm to learn the better metric, through which the estimated
class separation is maximized. These two steps are repeated
I. INTRODUCTION several times in an alternative fashion. In this approach, the first
EASURING distances among data points’ pairs is a nec- clustering does not use the benefit of data’s underlying lower
M essary step in most algorithms in machine learning, pat-
tern recognition, and data mining. In distance metric learning,
dimensional manifold and can mislead the other steps. A better
way is to perform both clustering and metric learning together.
the objective is to find a good distance measure, which would This is achieved by proposing a joint formula, in which the clus-
best describe the relation of the input data. A considerable num- ter indicator and the metric learning parameters are optimized
ber of supervised, weakly supervised and unsupervised metric simultaneously.
learning algorithms exist [1], [2]. Supervised algorithms re- Applying a combination of the linear discriminant analysis
quire a collection of labeled training data, while weakly super- (LDA) as one of the supervised metric learning algorithms with
vised metric learning algorithms only consider the similarity k-means as a clustering method is assessed by Ding et al. [4],
and dissimilarity of the input data points. In these two types [5]. In this method, the new space for clustering is provided by
of algorithms, the objective is to find the best representation of LDA and the label information for LDA is provided by k-means.
data, through which the class separation is maximized. Max- It should be noted that no joint framework is observed in [4], [5]
imizing separability among different classes, named the large and this is simply alternated between finding the transformation
margin property, can be achieved by pulling similar data points matrix and the cluster indicator. Following these works, Torre
and Kanade proposed discriminative cluster analysis (DCA),
Manuscript received November 9, 2017; revised March 17, 2018; accepted where a low-dimensional projection and the cluster indicator
May 3, 2018. Date of publication May 15, 2018; date of current version Novem- are found in a joint framework [6]. The discriminative k-means
ber 29, 2018. (Corresponding author: Abdolreza Mirzaei.) proposed by Ye et al. is an extension of DCA with a simpler for-
The authors are with the Department of Electrical and Computer En-
gineering, Isfahan University of Technology, Isfahan 84156-83111, Iran mulation [7]. The nonlinear version of this method is formulated
(e-mail:, z.moslehi@ec.iut.ac.ir; mahsa.taheri@ec.iut.ac.ir; mirzaei@cc.iut. in a SemiDefinite Program (SDP) in [7]. Ye et al. introduced
ac.ir; safayani@cc.iut.ac.ir). the adaptive metric learning (AML), which runs distance metric
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org learning and k-means clustering in a simultaneous manner in
Digital Object Identifier 10.1109/TFUZZ.2018.2836338 the form of a trace maximization problem [3]. In this method, in
1063-6706 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
MOSLEHI et al.: DISCRIMINATIVE FUZZY C-MEANS AS A LARGE MARGIN UNSUPERVISED METRIC LEARNING ALGORITHM 3535
order to provide more separability among different clusters, an of this proposed model is discussed in Section IV. The experi-
iterative procedure is applied. This procedure obtains the best mental results on different datasets are presented in Section V,
transformation and cluster indicator matrices iteratively. A non- and this paper is concluded in Section VI.
linear version of this method named nonlinear AML (NAML) is
presented by Chen et al. [8]. Following these works, a method II. BACKGROUND AND NOTATIONS
called optimized kernel k-means clustering (OKKC) is proposed
In order to formulate the objective here and propose this new
by Yu et al. with a simpler procedure and lower computational
algorithm, it is necessary to have a briefing on the following
complexity than NAML [9]. Unsupervised neighborhood com-
concepts.
ponent analysis [10] is another algorithm, which is a combina-
tion of NCA and k-means.
A. Weakly Supervised Metric Learning
The first drawback of these methods is that in some of them,
no joint formulation is observed and they simply alternate be- Assume that X = {xi |xi ∈ p }ni=1 is the input data, where
tween clustering and metric learning. As mentioned before, in xi ∈ p is the ith data point with p as the number of attributes,
the first step, the data points are clustered in the input space, n is the number of data points, and S and D represent the similar
which results that the intrinsic structure of the data points is and dissimilar sets, respectively,
ignored and the other steps of the algorithm are misled. The
S = {(xi , xj )|xi and xj belong to the same class} (1)
second drawback is that, in all of them, k-means is applied
as the clustering algorithm while other clustering methods ex- D = {(xi , xj )|xi and xj belong to the different classes}. (2)
ist which can outperform the k-means [11]–[13]. Moreover,
Most of the weakly supervised metric learning algorithms are
the use of k-means causes the problem of rapid convergence.
based on these sets. The large separability as the main objective
Thus, some heuristic methods need to be applied to avoid this
of these algorithms is defined as follows.
problem [14].
Provided that xi and xj belong to S, they should be close to-
In this paper, fuzzy c-means (FCM) is applied to obtain label
gether and provided that they belong to D, they should be apart
information. Then, the proposed metric learning algorithm uses
from each other. In the linear approach, this procedure is fol-
this information to estimate the new space in which the distance
lowed by learning a linear transformation and project data into
among the different estimated classes is maximized. Here, both
the new space: L : xi ← Lxi . The squared Euclidean distance
the metric learning and FCM clustering are integrated into a joint
in this projected space is calculated as follows:
formulation, which is named Discriminative FCM (Dis-FCM).
Thus, both clustering and metric learning adapt themselves si- dM (xi , xj ) = ||L(xi − xj )||22
multaneously. Moreover, in FCM, each data point belongs to all
clusters with different degrees of membership. As a result, the = (xi − xj )T LT L(xi − xj )
use of FCM allows achieving the higher performance in terms = (xi − xj )T M (xi − xj ) (3)
of clustering results, especially when the clusters are not well
separated and they are overlapped [13]. The use of FCM avoids where dM (xi , xj ) is the Mahalanobis distance, in which M
the problem of rapid convergence. Another benefit of this pro- is a positive semidefinite matrix [1]. This equation indicates
posed formulation is that it is able to handle both the numerical that computing the Euclidean distance after performing a linear
and categorical data differently, something not yet considered transformation is as if computing the Mahalanobis distance in
in the available literature. The experimental results on different the input space. This is revealing the fact that metric learning and
datasets indicate that Dis-FCM is a significantly improved ver- data projection problems are considered as a unified concept.
sion of its kind. Briefly, the main contribution of this paper is as
follows: B. Fuzzy C-Means
1) A new framework is provided with both clustering and FCM is a method of fuzzy clustering, where each of the data
metric learning, which are integrated in a joint formula- points belongs to more than one cluster [11]. Assume X is
tion. the input data, C = {cl |cl ∈ p }kl=1 is the position of cluster
2) Applying FCM allows achieving higher performance in centers with k as the number of clusters and Q = {qil |qil ∈ }
terms of clustering results and avoids the problem of rapid is the fuzzy membership matrix, where each qil is the degree of
convergence. membership of xi in cluster l. The objective function of FCM
3) Both categorical and numerical data are handled differ- is defined as follows:
ently through this algorithm, which is not found in the
available literature.
n
k
minimize qiul ||xi − cl ||2 subject to:
4) It is revealed that the results of this algorithm are not Q ,C
i=1 l=1
influenced by the value of the parameters, while parameter

n
tuning is one of the most important issues in the existing 1) 0< qil < n ∀l
algorithms. i=1
The remainder of this paper is organized as follows. Some no-
k
tations and background are presented in Section II. The formula- 2) qil = 1 ∀i
tion is introduced in Section III. The computational complexity l=1
(4)
3536 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 26, NO. 6, DECEMBER 2018
where u is the level of fuzziness. This is a nonconvex opti- Assume the vector
mization problem with a biconvex objective function, which is
solvable through an alternating optimization scheme in an ef- dpp (xi , xj ) = [dpp (xi,1 , xj,1 ), . . . , dpp (xi,r , xj,r )
ficient manner. It is convex with respect to Q when C is fixed dpp (xi,r +1 , xj,r +1 ), . . . , dpp (xi,r +s , xj,r +s )]T
and with respect to C when Q is fixed. Consequently, we ini-
tially consider C to be fixed and we will optimize the problem reveals the point-to-point distance vector between two data
over another parameter, and then repeat this procedure for Q. points xi and xj , where r and s are the number of categori-
This process will continue until convergence is achieved [11]. cal and numerical attributes, respectively.
Following are the update formulas: For any f ∈ {1, .., r}, dpp
ORLP called overlap distance and
pp
dESK called ESK distance are defined as follows:
k
1/(u −1)
qil = 1/ ||xi − cl ||2 /||xi − cl ||2 (5) pp 0, if xi,f = xj,f
dORLP (xi,f , xj,f ) = (8)
l=1 1, if xi,f = xj,f

n
n
cl = qilu xi / qilu . (6) pp 0, if xi,f = xj,f
dESK (xi,f , xj,f ) =
i=1 i=1 1 − mf 2 /(mf 2 + 2), if xi,f = xj,f
(9)
The time complexity of FCM is O(Γnpk 2 ), where Γ is the
number of iterations [15]. where mf is the number of different values taken by the f th
attribute [17]. These two measures are chosen because they are
recommended among many of existing categorical measures
C. SemiDefinite Program (SDP)
[18]. The distance for any numerical attribute f ∈ {r + 1, .., r +
SDP is a kind of convex optimization problem, expressed as s} is computed as follows:
follows:
dpp (xi,f , xj,f ) = xi,f − xj,f . (10)
minimize trace (GY ) subject to :
Y
In a similar sense, the associated point-to-cluster distance
1) trace (Hi Y ) = hi , i = 1, .., T (7)
between each xi ∈ X and cl ∈ C is expressed as
2) Y ≥0
dp c (xi , cl ) = [dp c (xi,1 , cl,1 ), . . . , dp c (xi,r , cl,r )
where Y ∈ S N ×N is a variable in which S N ×N denotes the set
of symmetric matrix N × N , and H1 , . . . , HT ∈ S N ×N , G ∈ dp c (xi,r +1 , cl,r +1 ), . . . , dp c (xi,r +s , cl,r +s )]T .
S N ×N , and h1 , . . . , hT ∈ are all given parameters. In this
optimization problem, T is the number of equality constraints, For any f ∈ {1, .., r}, next to the overlap and ESK distance,
and Y ≥ 0 means that Y is a positive semidefinite matrix [16]. another measure is computed as follows:
Expression (7) indicates the standard form of semidefinite
dpCheung
c
(xi,f , cl,f ) = 1 − δA f =x i , f (cl,f )/δA f = null (cl,f )
programs. More general semidefinite programs with linear in-
(11)
equality constraints next to the linear equality constraints also
where δA f =x i , f (cl,f ) counts the number of data points in clus-
exist. In this case, to get the standard form of semidefinite pro-
ter l, in which the attribute Af has the value of xi,f and
gram, the standard approach is to add nonnegative slack vari-
δA f = null (cl,f ) has the same meaning with the difference that
ables to any inequality constraint and convert it into an equality
this time the attribute Af should not be null [19]. In (11), when
constraint. Here, the positive semidefinite matrix Y will be ex-
there is not a crisp value for membership of each data point to
tended to the block diagonal Y in the following form:
cluster l, the corresponding fuzzy membership values are ag-
⎡ ⎤ gregated. This measure can be applied only as a point-to-cluster
Y 0 ··· 0
⎢ .. ⎥ distance and it cannot be applied as the point-to-point distance.
⎢ 0 y1 0 . ⎥ Accordingly, the two different vectors dp c and dpp have already
Y =⎢

⎢ ..
⎥
⎥
⎣ .. been defined separately. The distance measure for any numerical
. 0 . 0 ⎦
attribute is also computed as follows:
0 ··· 0 yT
dp c (xi,f , cl,f ) = xi,f − cl,f ∀ f ∈ {r + 1, .., r + s}. (12)
where T is the number of inequality constraints and each non-
negative slack variable y 1 , . . . , yT corresponds to each of them. All the notations in this paper are tabulated in Table I, some
It is clear that Y is a positive semidefinite matrix if and only if of which are defined and the others will be defined as the study
Y ≥ 0 and y 1 , . . . , yT ≥ 0 [16]. continues.
D. Categorical Distance Measures III. ALGORITHM

In this section, all the numerical and categorical distance In order to propose this new algorithm, it is necessary to
measures that are used in the rest of this paper are introduced. describe problem formulation and its variables’ computations.
TABLE I membership values to all clusters (i.e., qi and qj ) would be dif-

NOTATIONS
ferent. In this case, the pairs of associate entries of qi and qj
(i.e., qil and qj l ) are not only different from each other but,
in addition, at least one of them is close to zero. Thus, their
dissimilarity measure d˜ij would be close to one.
Now, the joint clustering and metric learning can be formu-
lated as an optimization problem, provided that the estimated
class separation is maximized.
One criterion for this desired metric is to have small distance
between all similar data points. All data within almost the same
cluster (i.e., xi and xj within the cluster l) are considered as a
pair in S, which are pushed against each other with an orientation
towards cluster center cl . Here, because S is a fuzzy similar set,
this constraint should be satisfied in proportion to membership
value of xi given to cluster l (i.e., qil ). To accomplish this
objective, it is sufficient to replace the Euclidean distance in
FCM objective function (4) with Mahalanobis distance (3). The
first criterion is formulated as a loss function expressed as:

n
k
minimize qilu dp c (xi , cl )T M dp c (xi , cl ). (14)
Q ,C ,M
i=1 l=1
The second criterion is to assure that all dissimilar data points

in D are distanced from each other. This assurance is made by
adding the following constraint to our formulation:
dpp (xi , xj )T M dpp (xi , xj ) ≥ ε ∀xi , xj ∈ D (15)
A. Problem Formulation
where ε is a constant greater than zero. Because here D is a
Recall that X = {xi |xi ∈ p }ni=1 is the set of input data and fuzzy dissimilarity relation, this newly added constraint must
L is the linear transformation (i.e., L : xi ← Lxi ). The focus be satisfied in proportion to d˜ij rate. The slack variable ζij ≥
here is on the linear metric learning (i.e., concentrate on the 0 associated with this constraint is introduced to measure the
Mahalanobis distance learning approaches). This algorithm is amount of its violation. Hence, our second criterion is defined
based on weakly supervised metric learning with the difference by sum of all ζij , where each one of them is multiplied by d˜ij .
that here there exists no side information in the form of similar Hence, this problem is formulized as follows:
set S and dissimilar set D. Due the lack of this side information,
the clustering algorithm FCM is performed to obtain an estimate
n
k
of these sets. Pairs of data points within the same cluster belong minimize (1−α) qilu dp c (xi , cl )T M dp c (xi , cl )
Q ,C ,M ,ζ
to S, while those that belong to two distinct clusters would be i=1 l=1

n
n
in D. +α d˜ij ζij subject to :
The clustering method provides the estimated label informa- i=1 j =1
tion for all data points. When this information is found, weakly 1) dpp (xi , xj )T M dpp (xi , xj ) ≥ ε − ζij , ∀i, j ∈ {1, . . . , n}
supervised metric learning method can be adopted to find the 2) ζij ≥ 0,
new data representation in a sense that the separability among 3) M ≥ 0,
different estimated classes of data is maximized. n
In this algorithm, instead of applying crisp similarity and dis- 4) 0 < qil < n ∀l ∈ {1, . . . , k}
similarity sets, the classical FCM provides us the fuzzy version i=1
of them. Recall that Q = {qil |qil ∈ } is the fuzzy membership
k
5) qil = 1 ∀i ∈ {1, . . . , n}
matrix and C = {cl |cl ∈ p }kl=1 is the cluster centers. The no-
l=1
tation d˜ij ∈ indicates the dissimilarity rate of data point i and (16)
j through their fuzzy membership values as follows: where α is a tradeoff parameter between the two terms, M is a
positive semidefinite matrix (M ≥ 0), and ε is a constant value
d˜ij = 1 − qi qj T /(||qi ||2 × ||qj ||2 ). (13)
greater than zero. In this optimization problem, if d˜ij = 0, the
Here, qi and qj are represented ith and jth rows of matrix Q, corresponding term is 0. Thus, the value of ζij is not of major
respectively. A quantitative estimate of dissimilarity between importance. On the other hand, if d˜ij = 1, the value of ζij will
ith and jth data points is obtained based on whether their mem- specify the contribution of the given points to the second term.
bership degrees in all clusters are different or similar. It means The constraints 4) and 5) are added to this formulation, because
that if these data strongly belong to different clusters, their they are needed in the FCM formulation (4).
This proposed method is not like its counterparts, where k- constant and the optimal value of Q should be obtained by
means is considered as a crisp method to provide label informa- taking derivation of first term subject to constraints 4) and 5)
tion. These k-means based methods have the problem of rapid of optimization problem (16). The optimal value for each qil is
convergence [14]. In these methods, they first learn the cluster calculated as follows:
indicator according to k-means and then adjust the transforma- u 1−1
k
dp c (xi , cl )T M dp c (xi , cl )
tion matrix to fit this cluster indicator in a rapid manner. In qil = 1/ . (18)
the second iteration, the new cluster indicator would remain the l=1 dp c (xi , cl )T M dp c (xi , cl )
same as before and thus the method cannot learn the new trans-
The proof of (18) is very similar to the classical FCM [11]. An
formation matrix in this new iteration. To avoid this problem,
essential point here is that the definition of d˜ij in the second term
the authors have to apply some heuristic methods [14]. In this
of objective function (16) depends on qil , while this parameter
newly proposed method, due to application of FCM, the pairs of
is set from the previous iteration and its value is considered as
data points are not completely similar or dissimilar. Thus, sat-
a constant value in this step.
isfaction degree of two criterions in the objective function (16)
is obtained according to the fuzzy membership values qil and
D. Fixing Q, C and Updating M, ζ
dissimilarity measures d˜ij , which avoids the rapid convergence.
The proposed optimization formula in (16) is not convex mak- For a given cluster membership matrix Q and cluster centers
ing the finding global solution difficult. However, it becomes matrix C, computing the optimal M and ζ is run by solving the
convex in each variable if other variables are fixed and it can be following optimization problem:
solved efficiently by an alternating optimization.
n
k
minimize (1−α) qilu dp c (xi , cl )T M dp c (xi , cl )
M ,ζ
B. Fixing M, ζ, Q, and Updating C i=1 l=1

n
n
As observed, when all parameters except C are fixed, the +α d˜ij ζij subject to :
second term of objective function (16) would become constant i=1 j =1
with no constraint on the parameter C. In case of numerical 1) dpp (xi , xj ) M d (xi , xj ) ≥ ε − ζij , ∀i, j ∈ {1, . . . , n}
T pp
attributes, updating each one of the cluster centers is similar to 2) ζij ≥ 0
that of original FCM and can be calculated through (6). 3) M ≥ 0.
One point of major importance is how to update the cluster (19)
centers cl for the categorical attributes. For this purpose, the This optimization formula is an instance of semidefinite pro-
following theorem is adopted. gram, which is a kind of convex optimization problem. By defin-
Theorem 1 The Fuzzy k-Modes Update Method ([20]): ing the new slack variables ζ̃ij , the linear inequality constraints
Assume that X c = {xci }ni=1 is a set of categorical objects can be converted into linear equality constraints. Thus, it can be
described through categorical attributes A1 , A2 , . . . , Ar and reformulated easily in the form of well-defined SDP. This SDP
(1) (2) (n )
Domain(Af ) = {af , af , . . . , af f }, where nf is the num- can be solved in the polynomial time through the existing online
ber of categories of attribute Af for 1 ≤ f ≤ r; Assume that packages.
the cluster centers cl are represented by [cl,1 , cl,2 , . . . , cl,r ],
n k u p c T E. Main Algorithm
for 1 ≤ l ≤ k. Hence, the value i=1 l=1 qil d (xi , cl )

(t )
dp c (xi , cl ) is minimized iff cl,f = af ∈ Domain(Af ), with Based on the above analyses, an iterative algorithm, named
the following constraint: Dis-FCM is proposed to solve the optimization problem (16)
here. The pseudo code of this algorithm is given by Algorithm 1.

n
n
In each step of this method, all the variables are updated accord-
qilu ≥ qilu , 1 ≤ t ≤ nf (17)
( t ) (t )
ing to their corresponding formulas. This procedure continues
i=1,x i , f =a f i=1,x i , f =a f
until the variables converge.
for 1 ≤ f ≤ r. As mentioned before, learning the Mahalanobis distance is
According to this theorem, for every categorical attribute, the equivalent to learning the linear transformation matrix and com-
category of each attribute of the cluster center cl is given by the puting the Euclidean distance in the linearly transformed space.
category that obtains the maximum of the summation of qilu to Accordingly, after updating M , the numerical data can be pro-
cluster l over all categories. The authors of this theorem assume jected into the new space and the procedure is continued in this
the overlap measure as the categorical distance measure, while space. Projecting categorical data through a linear transforma-
it is not hard to realize that this theorem can be generalized for tion generates a new data in the real space, which is mean-
other measures applied in this paper. ingless for categorical data. Consequently, instead of project-
ing each data into the new space, the new point-to-point and
point-to-cluster distance vectors are computed after learning
C. Fixing M, ζ, C, and Updating Q the new linear transformation. In the step 4 of Algorithm 1,
When all the parameters are fixed except Q, the optimal Q the extra variable L is defined to calculate the new point-
to the problem (16) can be obtained in a closed form. In this to-point and point-to-cluster distance vectors. After learn-
case, the second term of objective function (16) would become ing the new M in each iteration, these distance vectors are
TABLE II
Algorithm 1 Dis-FCM. DETAIL OF DATASETS APPLIED IN EXPERIMENTS HERE. THE SYMBOL r AND s
• Input: Mixed data X, α, ε. DEFINES THE NUMBER OF CATEGORICAL AND NUMERICAL
ATTRIBUTES, RESPECTIVELY
• Output: Q, C, M, ζ.
• Initialize Q, C, ζ randomly in a sense that for each qil
and ζij , constraints 2, 4 and 5 in (16) is satisfied.
• Set M ← I. Compute d˜ij , ∀i, j by (13).
• Repeat
1. Update C by (6) while fixing M, ζ, Q, d˜ij .
2. Update Q by (18) while fixing M, ζ, C, d˜ij and then set
d˜ij , ∀i, j by (13).
3. Update M, ζ by optimizing objective function (19)
while fixing Q, C, d˜ij .
4. First, using the Cholesky decomposition of M , set L in TABLE III
a sense that M = LT L. Then, update dpp (xi , xj ) ACC/NMI COMPARISON OF DIFFERENT METHODS ON 9 UCI DATASETS.
through dpp (xi , xj ) ← Ldpp (xi , xj ) and update ACCS/NMIS IN BOLD FONT ARE THE HIGHEST
AVERAGED VALUE ON EACH ROW
dp c (xi , cl ) through dp c (xi , cl ) ← Ldp c (xi , cl )),
respectively.
5. M ← I.
• Until convergence
updated and the procedure continues on the basis of these new

vectors.
IV. COMPUTATIONAL COMPLEXITY

The time complexity of Dis-FCM in two different settings of
full metric M and diagonal M is analyzed. In Dis-FCM, Q and
C are updated in O(Γnpk 2 ), similar to the original FCM. When
Q and C are fixed, computing the variables M and ζ is equivalent
to solving a well-defined SDP. Many efficient SDP solvers exist
based on the interior point methods (IPM). In this implemen-
tation, CVX where one of its core solver is SeDuMi is applied
[21]. SeDuMi performs based on primal-dual IPM, where its
√
time complexity is O((T N 3 + T 2 N 2 + T 3 ) N log(1/E)). V. EXPERIMENTS
In this function, N is the number of variables, T is the number
of equality constraints, and E is the accuracy rate [22]. Ac- The effectiveness of Dis-FCM is assessed empirically through
cording to the optimization formula (19), in the case that M different experiments. Each experiment is repeated ten times and
is full, its size and the number of slack variables ζij are p2 the averaged results are tabulated in Table III through Table VII.
and C(n, 2), respectively. The slack variables ζ̃ij have the same
size with ζij . Hence, N = p2 + 2C(n, 2) and T = C(n, 2), A. Experimental Setting
where C(n, 2) is the number of two combination. As to the The experimental results are implemented on nine datasets
diagonal M , it is easy to show that this SDP problem is sim- of the UCI repository [23]. There exist four numerical, one
ply transformed into a linear programming problem, where categorical, and four mixed datasets composed of numerical
N = p + 2C(n, 2) and T = C(n, 2). The Cholesky decom- and categorical attributes. Detail of each dataset consisting of
position in step 4 of Algorithm 1 occurs in O(υ 3 ), where the number of data points, attributes, and classes are tabulated
υ is the size of matrix M . Finally, by assuming the num- in Table II. To improve the computational speed of Dis-FCM on
ber of iterations as Γ, the time complexity of Dis-FCM is Abalone and Poker data, 200 and 300 of these data points are
√
O(Γ(npk 2 + (T N 3 + T 2 N 2 + T 3 ) N log(1/E) + υ 3 )). chosen uniformly at random, respectively.
Recall that the time complexity of original FCM is O(Γnpk 2 ), The results of each experiment can be assessed by comparing
which is lower than Dis-FCM. Since the objective func- the predicted label of each data point with its true label. For
tion in this proposed algorithm is more complicated than this purpose, the “clustering accuracy” and “normalized mu-
the original FCM, its computational complexity increases tual information (NMI)” are applied as the two criterions for
accordingly. comparing different methods [10].
Fig. 1. (a) Original dataset. The same symbols define the same class data. (b) Rescaling data with ISOMAP. (c) and (d) Rescaling data with Dis-FCM in the case
of diagonal and full metric
In each cluster, the most frequent class label is assigned to In all experiments, the number of clusters is set to be equiv-
all of its data points. Then the accuracy of this assignment is alent to the number of classes, which is assumed to be known.
measured by counting the total number of correctly assigned In this method, two kinds of stopping criteria are of concern:
data points and dividing them by the number of all data as 1) reaching a predefined number of iterations; or
follows: 2) the values of the loss function do not decrease or decrease
n very slowly.

Accuracy = δ(yi , map (xi )) /n × 100 (20)
i=1 B. Toy Example
where n is the number of data points, yi is the correct label, We use two demonstrative examples to show the effectiveness
map is a function, which assigns the label to xi according to of Dis-FCM in combining of weakly supervised metric learning
majority label of its cluster and delta function δ(s, t) = 1 when and FCM clustering.
s = t, otherwise it is 0. For the first example, PCA is run on the Iris dataset to reduce
NMI is computed as follows: the dimensionality to 2. The yield data, which are shown in
Fig. 1(a), are the original data. The embedded data after applying
NMI(Y, I) ISOMAP algorithm are shown in Fig. 1(b). Fig. 1(c) and (d)
p(y i ,clust j ) shows the output of this proposed method in diagonal and full
y i ∈Y ,clust j ∈I p(yi , clustj ). log p(y i ).p(clust j )
= 200 × covariance matrices, respectively. The following observations
H(Y ) + H(I) can be made from this figure: In the case of diagonal M , the
(21) proposed method cannot significantly change the space of the
data. In the case of full M , it is intriguing to see that the full
where Y and I are the sets of true labels and cluster indicators.
metric M can project data on a line while the separability of
p(yi ), p(clustj ), and p(yi , clustj ) are the probabilities that each
different class data is well preserved. By applying the proposed
randomly selected data point might belong to the class yi , cluster
d˜ij of (13), different class data remain at a reasonable distance
clustj , and in the intersection of yi and clustj , respectively.
from each other while the same class data become joint together
The functions H(Y ) and H(I) are the entropy of Y and I,
in the new space.
respectively [10].
The second example indicates the effectiveness of this method
Both clustering accuracy and NMI are extrinsic measures and
on a dataset generated from two different two-dimensional nor-
assume the knowledge of the ground truth. Without exploiting
mal distributions. These two distributions have [1, 1] and [1,
this knowledge, to define whether the given fuzzy partition fit to
10] mean vectors and [0.1 0; 0 10] and [10 0; 0 0.1] covariance
the data, two different cluster validity measures, “Partition Index
matrices. This dataset contains 80 data points that belong to
(SC)” and “Fukuyama–Sugeno Index (FS)” are also applied.
two different classes. The result of this method in the case of
SC is the ratio of the sum of compactness and separation of
diagonal M is compared with that of the PCA and ISOMAP in
the clusters [24] and is computed as follows:
Fig. 2, if the scale of data points on X dimension (10−5 ) is of
k n u p c T p c concern, it is observed that this diagonal metric M can correctly
i=1 qil d (xi , cl ) M d (xi , cl )
SC = k pp T
. (22) ignore the X dimension. In this space, the separation of different
pp
l=1 ni l =1 d (cl , cl ) M d (cl , cl ) class data are still preserved well. Finding this projected space
FS is applied at the identification of “compact and well sep- when different class data are well separated is proposed as the
arated clusters.” The formula is computed as follows [25]: objective of this newly method Dis-FCM.

k
n
C. Clustering Quality Comparison
FS = qilu
l=1 i=1 First, the accuracy and NMI results of different well-known
clustering algorithms are provided in Table III to emphasize how
× dp c (xi , cl )T M dp c (xi , cl ) − dpp (cl , c̄)T M dpp (cl , c̄) . Dis-FCM is able to improve the clustering results. In this table,
(23) single-linkage, complete-linkage, and average-linkage are from
Fig. 2. (a) Original dataset. Same symbols define the same class data. (b) Rescaling data with PCA. (c) Rescaling data with ISOMAP. (d) Rescaling data with
Dis-FCM in the case of diagonal metric.
TABLE IV
ACC/NMI COMPARISON OF DIFFERENT METHODS ON 9 UCI DATASETS. ACCS/NMIS IN BOLD FONT ARE THE HIGHEST AVERAGED VALUE ON EACH ROW
the hierarchical clustering algorithms, while GMM and DB- settings of diagonal and full metric M s. In the case of diagonal
SCAN methods belong to EM-based and density-based clus- M , the accuracy and NMI associated with the minimum loss
tering methods, respectively [26]. Then, the accuracy and NMI function among all ten trials are also presented in Table IV. As
results of Dis-FCM and more related algorithms are tabulated experimental results suggest as described later, the parameters
in Table IV. Dis-FCM is tested against standard FCM, PCA- of Dis-FCM are set to the given values of α = 0.4, and ε = 5.
FCM, and ISOMAP-FCM; In PCA-FCM and ISOMAP-FCM, For categorical attributes, the ESK measure is considered the
PCA and ISOMAP are performed, respectively, before apply- point to cluster distance and ORLP is the measure of point to
ing the standard FCM while preserving 95% of data’ variance. point distance.
There are two Mahalanobis-based FCM algorithms called GK By comparing the results in Tables III and IV with one an-
[27] and FCM-σ [28] and four recently introduced algorithms other, the following can be observed.
called AML [3], NAML [8], OKKC [9], UNCA [10]. The re- 1) Dis-FCM outperforms the other clustering algorithms in
sults of AML, NAML, OKKC, and UNCA are the best averaged most of the datasets. Of the nine datasets, Dis-FCM with
results obtained by tuning their parameters among the values diagonal M has the highest accuracy in seven and the high-
considered in their corresponding articles. The best averaged est NMI in four. In most cases, the results associated with
results of AML and UNCA are reported by tuning λ among minimum loss function are much better than the averaged
[0.001, 0.01, 0.1, 1, 10, 100, 1000] and the best averaged results results. This is because a significant correlation between
of NAML are reported by tuning λ among [10−2 , 10−4 , 10−6 ]. the objective function and clustering performance exists.
All of the compared methods treat the categorical attributes Thus to obtain high accuracy and NMI, this algorithm can
similarly to that of the numerical attributes. They assign a nu- be repeated several times to report the results associated
merical value to each category and then apply their methods with minimum loss function.
through these numerical values. In this newly method, the cat- 2) This method with full metric M outperforms most of its
egorical distance measure different from numerical measure is counterparts, but it does not outperform its own diagonal
of concern. The Dis-FCM results are reported in two different metric. The reason for better result for diagonal M is that
TABLE V
SC/FS COMPARISON OF DIFFERENT METHODS ON NINE UCI DATASETS. SCS/FSS IN BOLD FONT ARE THE BEST AVERAGED VALUE ON EACH DATASET
TABLE VI TABLE VII

IMPACT OF DIFFERENT CATEGORICAL DISTANCE ON ACCURACY OF FIVE IMPACT OF DIFFERENT PARAMETER VALUES OF α AND ε AND ALSO
DATASETS. THE RESULTS ARE SET TO THE GIVEN VALUES α = 0.4, ε = 5. INITIALIZATION POINTS ON THE ACCURACY OF NINE UCI DATASETS
VALUES IN BOLD FONT ARE THE HIGHEST AVERAGED VALUE ON EACH ROW
in this case, the columns of the transformation matrix L

are orthogonal to each other. Thus, the features in the
new space are uncorrelated. This is not the case when
M is not diagonal. In addition, in terms of computational
speed, full metric M is much slower than diagonal metric
M . Hence, in the rest of this paper, the focus is on the
proposed method Dis-FCM with diagonal metric M .
So far, both accuracy and NMI assume the knowledge of the
ground truth. The results of SC and FS as two intrinsic measures
are tabulated in Table V. A lower value of these two measures
indicates the better clustering. In this experiment, the results of
Dis-FCM are compared with other fuzzy clustering algorithms. Fig. 3. Variation of clustering accuracy with the number of clusters for:
Of the nine datasets, Dis-FCM has the lowest SC in five and the (a) BT and (b) wine datasets.
lowest FS in six.
this study. Hence, for comparison purposes, ESK and ORLP are
D. Sensitivity Study selected the point to cluster and the point to point categorical
In Dis-FCM, the parameters that need tuning are α and ε. For distance measure, respectively.
the categorical datasets, the best categorical measure also should 2) Clustering With Different Parameters: For investigating
be defined. The behavior of Dis-FCM with different parameters the impact of α and ε, we run experiments with different values
and categorical measures are assessed as follows. of these two parameters. The results are presented in Table VII.
1) Clustering With Different Categorical Distance Mea- As it can be observed, for a fixed α(α = 0.4) different values
sures: Here, Dis-FCM is assessed by applying different types of ε does not have significant impact on the results. Moreover,
of categorical distance measures. For this purpose, five categor- as observed in this table, performance of this algorithm may be
ical and mixed datasets are assessed by different distance mea- affected by the α parameter. If this parameter is set on some
sure. The results are reported in Table VI. The point to cluster appropriate values such as 0.4, the best results would yield.
distance measure is selected among ESK, ORLP, and Cheung Hence, for comparison purposes α is set to 0.4.
measures and the point to point distance is selected between 3) Clustering With Different Initialization Points: The prob-
ESK and ORLP measures. It can be observed that, the pairs of lem formulation (16) is not convex; therefore, this nonconvex
dpESK
c
, dpp pc pp
ESK and dESK , dORLP perform in average similarly and problem could be sensitive to the initialization points. The im-
their results are often better than the other distance measures of pact of initializing variables on the behavior of the Dis-FCM is
Fig. 4. Clustering accuracy, NMI, and the value of objective function in terms of iterations. (a) Iris. (b) Wine. (c) Lenses. (d) Abalone datasets.
assessed here. As observed in Dis-FCM algorithm, M is initial- F. Statistical Tests

ized to I, and the Q and C are initialized in a random manner. To
The Freidman test is run to assess whether there exists a sta-
have a better initialization, first FCM clustering can be applied
tistical significant difference between the results of this method
to learn Q and C and next Dis-FCM begins its operations. The
and the other existing algorithms.
accuracy results of Dis-FCM while Q and C are initialized to
The Friedman test is run, as a nonparametric statistical test,
the FCM solution compared with their random settings are tab-
for comparing multiple group effects in a two-way layout. This
ulated in Table VII, column 3 and 8. As observed in this table,
test is applied when the measurements are subject to the fol-
in average, initializing these variables with FCM does not have
lowing assumptions: first, each group is measured on different
any positive effect on Dis-FCM behavior. When Q and C are
occasions; second, there exists an ordinal scale on a specific
initialized to the FCM solution, the best values of these variables
group; and third, samples do not need to be distributed normally
are obtained by the structure of input data. As discussed before,
[29].
this can mislead the other steps of the algorithm. This is due to
For this test, this method (with diagonal M , α = 0.4, and
ignoring the latent space of input data in the first step. Thus, it is
ε = 5) is applied together with other algorithms, proposed in
deduced that a better clustering result is achieved by optimizing
Tables III and IV. Each group shows the averaged accuracy
both clustering and metric learning parameters simultaneously
results of specific algorithm on all datasets.
through a joint formulation.
The rejection of null hypothesis is defined in a sense that at
4) Clustering With Different Cluster Numbers: To assess the
least one group-sample median is significantly different than
effect of cluster number in clustering results, the accuracy results
others. In this experiment, if the p-value is equal or less than
of Dis-FCM and other existing algorithms as the functions of
0.05, it indicates the rejection of null hypothesis. The Friedman
cluster number on two BreastTissue (BT) and Wine datasets
p-value for this test is 7.8419e-08, indicates the group-sample
are shown in Fig. 3. In this figure, the number of clusters is
median of this method is significantly different than others.
set to be equivalent to the number of classes and added up to
Thus, there exist a statistical significant difference between this
four times the number of classes. The results indicate that Dis-
method and its counterparts.
FCM outperforms other existing algorithms in different cluster
numbers.
VI. CONCLUSION
In this paper, a new unsupervised metric learning method
E. Convergence Behavior is proposed for large separability among different estimated
To gain some further insight on the proposed method, its classes of data. Both numerical and categorical data can
convergence behavior is concerned. The values of objective be handled through this proposed algorithm. This algorithm
function, accuracy, and NMI as the functions of iteration are is assessed against single-linkage, complete-linkage, average-
shown in Fig. 4. This figure corresponds to the experiments linkage, GMM, DBSCAN, FCM, PCA-FCM, ISOMAP-FCM,
with minimum loss function value among all of the ten trails. GK, FCM-σ, AML, NAML, OKKC, and UNCA. The clustering
Each one of the iterations in this figure corresponds to updating quality comparison by applying different extrinsic and intrinsic
Q, C, M , and ζ variables. By analyzing this figure, the follow- measures proves the superiority of this proposed method.
ing is observed: in all cases, there exists a general tendency to Here, CVX is applied in updating M and ζ. Though CVX is
decrease the values of Dis-FCM objective function, while in- a well-known package for solving the convex problems, it does
creasing the values of accuracy and NMI. In very few cases, not solve the problems efficiently and in appropriate time. Thus,
the objective function values increase due to updating the Q proposing an efficient solver for this given problem (19) will be
proposed in Section III-C. As discussed before, for updating Q, the focus of future research. Also, as noted in Section V-A,
the second term of (16) is ignored though this term depends in all experiments the number of classes is assumed to be
on Q. After updating Q, the loss function does not necessar- known and the same is considered as the number of clusters in
ily decrease and few increasing points are observed along the this method. In real-world applications, this value is unknown.
curve. Hence, this could make it hard to apply this method in some real
applications. Further assessments are to be run as to determining [15] S. Ghosh and S. K. Dubey, “Comparative analysis of k-means and fuzzy c-
the number of clusters in an automated sense in future. means algorithms,” Int. J. Adv. Comput. Sci. Appl., vol. 4, no. 4, pp. 35–39,
2013.
[16] B. Gärtner and J. Matousek, “Semidefinite programming,” in Approx-
REFERENCES imation Algorithms and Semidefinite Programming, Berling, Germany:
Springer, 2012, pp. 15–20.
[1] F. Wang and J. Sun, “Survey on distance metric learning and dimension- [17] S. Boriah, V. Chandola, and V. Kumar, “Similarity measures for categorical
ality reduction in data mining,” Data Mining Knowl. Discovery, vol. 29, data: A comparative evaluation,” in Proc. SIAM Int. Conf. Data Mining,
pp. 534–564, 2014. 2008, pp. 243–254.
[2] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for [18] T. R. dos Santos and L. E. Zárate, “Categorical data clustering: what
feature vectors and structured data,” 2013. arXiv:1306.6709. similarity measure to recommend?,” Expert Syst. Appl., vol. 42, pp. 1247–
[3] J. Ye, Z. Zheng, and L. Huan, “Adaptive distance metric learning for 1260, 2013.
clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, [19] Y. M. Cheung and H. Jia, “Categorical-and-numerical-attribute data clus-
pp. 1–7. tering based on a unified similarity metric without knowing cluster num-
[4] C. Ding and T. Li, “Adaptive dimension reduction using discriminant ber,” Pattern Recognit., vol. 46, pp. 2228–2238, 2013.
analysis and k-means clustering,” in Proc. Int. Conf. Mach. Learn., 2007, [20] Z. Huang and MK. Ng, “A fuzzy k-modes algorithm for clustering
pp. 521–528. categorical data,” IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446–52,
[5] C. Ding, X. He, H. Zha, and H. D. Simon, “Adaptive dimension reduction Aug. 1999.
for clustering high dimensional data,” in Proc. Int. Conf. Mach. Learn., [21] M. Grant, S. Boyd, and Y. Ye, CVX Users’ Guide. Stanford, CA, USA:
2002, pp. 147–154. Stanford Univ., 2009.
[6] F. De La Torre and T. Kanade, “Discriminative cluster analysis,” in Proc. [22] D. Avis, A. Hertz, and O. Marcotte, “Interior point and semidefinite ap-
23rd Int. Conf. Mach. Learn., 2006, pp. 241–248. proaches in combinatorial optimization,” in Graph Theory and Combina-
[7] J. Ye, Z. Zhao, and M. Wu, “Discriminative k-means for clustering,” in torial Optimization, Berlin, Germany: Springer, 2005, pp. 123–126.
Proc. Neural Inf. Process. Syst., 2008, pp. 1649–1656. [23] K. Bache and M. Lichman, “UCI machine learning repository,” 2013.
[8] J. Chen, Z. Zheng, J. Ye., and L. Huan, “Nonlinear adaptive distance [Online]. Available: http://archive.ics.uci.edu/ml
metric learning for clustering,” in Proc. 13th ACM SIGKDD Int. Conf. [24] B. Balasko, J. Abonyi, and B. Feil, Fuzzy Clustering and Data
Knowl. Discovery Data Mining, 2007, pp. 123–132. Analysis Toolbox for Use With MATLAB. 2005. [Online]. Available:
[9] S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. De Moor, http://www.fmt.vein.hu/softcomp
and Y. Moreau, “Optimized data fusion for kernel k-means clustering,” [25] Y. Fukuyama and M. Sugeno, “A new method of choosing the number of
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 5, pp. 1031–1039, clusters for the fuzzy c-mean method,” (in Japanese) in Proc. 5th Fuzzy
May 2012. Syst. Symp., 1989, pp. 247–250.
[10] C. Qin, S. Song, G. Huang, and L. Zhu, “Unsupervised neighborhood [26] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed. Orlando,
component analysis for clustering,” Neurocomputing, vol. 168, pp. 609– FL, USA: Academic, 2009.
617, 2015. [27] D. E. Gustafson and W. C. Kessel. “Fuzzy clustering with a fuzzy covari-
[11] J. C. Bezdek, “Objective function clustering,” in Pattern Recognition With ance matrix,” in Proc. IEEE Conf. Decis. Control, 1979, pp. 761–766.
Fuzzy Objective Function Algorithms, Berlin, Germany: Springer, 1981, [28] D. M. Tsai and C. C. Lin, “Fuzzy C-means based clustering for linearly and
pp. 65–85. nonlinearly separable data,” Pattern Recognit., vol. 44, no. 8, pp. 1750–
[12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” 1760, 2011.
ACM Comput. Survey, vol. 31, no. 3, pp. 264–323, 1999. [29] J. D. Gibbons and S. Chakraborti, Nonparametric Statistical Inference.
[13] C. Budayan, I. Dikmen, and M. T. Birgonul,.”Comparing the perfor- Berlin Heidelberg: Springer, 2011.
mance of traditional cluster analysis, self-organizing maps and fuzzy C-
means method for strategic grouping,” Expert Syst. Appl., vol. 36, no. 9,
pp. 11772–11781, 2009.
[14] C. Hou, F. Nie, D. Yi, and D. Tao, “Discriminative embedded clustering:
A framework for grouping high-dimensional data,” IEEE Trans. Neural
Network, vol. 26, no. 6, pp. 1287–1299, Jun. 2015. Authors’ photographs and biographies not available at the time of publication.

Fuzzy Metric Learning

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Fuzzy Metric Learning

Caricato da

Copyright:

Formati disponibili

3534 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 26, NO.

Discriminative Fuzzy C-Means as a Large Margin

D. Categorical Distance Measures III. ALGORITHM

TABLE I membership values to all clusters (i.e., qi and qj ) would be dif-

The second criterion is to assure that all dissimilar data points

updated and the procedure continues on the basis of these new

IV. COMPUTATIONAL COMPLEXITY

TABLE VI TABLE VII

in this case, the columns of the transformation matrix L

assessed here. As observed in Dis-FCM algorithm, M is initial- F. Statistical Tests

Potrebbero piacerti anche