Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
AbstractRecently, semi-supervised clustering has been remarked and discussed in many research fields. In semisupervised clustering, prior knowledge or information are
often formulated as pairwise constraints, that is, must-link
and cannot-link. Such pairwise constraints are frequently used
in order to improve clustering properties. In this paper, we
will propose a new semi-supervised fuzzy c-means clustering
by using clusterwise tolerance and pairwise constraints. First,
the concept of clusterwise tolerance and pairwise constraints
are introduced. Second, the optimization problem of fuzzy cmeans clustering using clusterwise tolerance based pairwise
constraint is formulated. Especially, must-link constraint is
considered and introduced as pairwise constraints. Third, a
new clustering algorithm is constructed based on the above
discussions. Finally, the effectiveness of proposed algorithm is
verified through numerical examples.
Keywords-semi-supervised clustering, fuzzy c-means clustering, clusterwise tolerance, pairwise constraints.
I. I NTRODUCTION
The aim of data analysis methods is to discover important
properties or knowledges from a large quantity of data.
Recently, semi-supervised learning has also been remarked
and discussed in many researches [1][8]. In the field of
clustering [12], [14], pairwise constraints are frequently used
in order to improve clustering results by using background
knowledges or prior informations [2], [3]. Also, pairwise
constraints problems are considered by using probabilistic
model [4], fuzzy clustering model [5], [8], and agglomerative
hierarchical clustering [9][11]. In addition, soft constraints
which are introduced as penalty terms in the objective
function are another way [7], [8]. In case of these methods
with soft constraints, pairwise constraints are not always
satisfied. These hard and soft constraints are frequently
considered in semi-supervised learning methods.
In recent years, semi-supervised clustering which are
based on k-means and fuzzy c-means clustering, and kernel
methods have been widely discussed [2], [5], [7], [8]. In
these methods, pairwise constraints referred to must-link and
cannot-link are used as a prior or background knowledge
about which objects should be in the same or different cluster
978-0-7695-4161-7/10 $26.00 2010 IEEE
DOI 10.1109/GrC.2010.149
II. P RELIMINARIES
Let a data, a cluster and its cluster center be x =
(xk1 , . . . , xkp )T p , (k = 1, . . . , n), Ci (i = 1, . . . , c)
and vi = (vi1 , . . . , vip )T p , respectively. Moreover, uki
is the membership grade of xk belonging to Ci and we
denote a partition matrix U = (uki )1kn, 1ic . Here, a
set of data and a set of cluster center be X = {x1 , . . . , xn },
V = {v1 , . . . , vc }, respectively.
188
n
c
k=1 i=1
n
c
(uki )m dki ,
uki dki + 1 uki log uki .
ki 2 (ki )
k=1 i=1
(2)
i=1
(ki 0) , k, i.
dki = xk vi 2 =
(xkj vij )2 .
j=1
B. Pairwise constraints
Typical examples of pairwise constraints are must-link
and cannot-link [2]. These constraints are considered as
a prior or background knowledges about which objects
should be in the same or different cluster. A set ML =
{(xk , xl )} X X consists of must-link pairs so that
xk and xl should be in the same cluster, while another
set CL = {(xk , xl )} X X consists of cannot-link
pairs so that xk and xl should be in different cluster.
Obviously, ML and CL are assumed to be symmetric, that is,
if (xk , xl ) ML then (xl , xk ) ML, and if (xk , xl ) CL
then (xl , xk ) CL.
In many studies, these pairwise constraints are considered
as hard or soft constraints. The hard constraint means that
pairwise constraints ML and CL are always satisfied in
clustering procedure and results, while ones are not always
satisfied in case of soft constraint. Many semi-supervised
clustering methods based on such hard or soft constrains
have been proposed in order to improve clustering results
by using background knowledges or prior informations of
data set [2][8].
Figure 1.
in 2 .
(3)
(4)
A concept of clusterwise tolerance based pairwise constraints uses these sets to calculate the upper bound of
clusterwise tolerance vector |K(xk , vi )| which is defined
between a data and cluster center.
A value of K(xk , vi ) is calculated as the sum of clusterwise tolerance ki which in a set of must or cannot-linked
A. Clusterwise tolerance
First, we define a clusterwise tolerance and a clusterwise tolerance vector. A clusterwise tolerance ki =
(ki1 , . . . , kip )T means the admissible range of each clusterwise tolerance vector. A set of clusterwise tolerance
189
objects.
K(xk , vi ) = ki +
qi
xq ML(xk )
(5)
xr CL(xk )
In this section, we consider semi-supervised fuzzy cmeans clustering using clusterwise tolerance based pairwise
constraints (SSFCMCT). Especially, we consider and introduce only must-link constraint as clusterwise tolerance based
pairwise constraints. Therefore, (5) is rewritten as follows:
qi .
K(xk , vi ) = ki +
xq ML(xk )
A. Standard model
The objective function of semi-supervised standard fuzzy
c-means clustering using clusterwise tolerance based pairwise constraints (SSsFCMCT) is as follows:
Jsct (U, , V ) =
n
c
k=1 i=1
(7)
n
c
n
c
k (
uki 1)
k=1
i=1
k=1 i=1
1
1 m1
dki
,
(8)
uki = c
1
1 m1
For x2 ,
ML(x2 ) = {x1 } , CL(x2 ) = {x3 } ,
K(x2 , v1 ) = 1.0, K(x2 , v2 ) = 1.0.
For x3 ,
l=1
n
vi = k=1
dkl
m
(uki ) (xk + ki ki )
n
(9)
(uki )
k=1
190
L
0,
ki
L
ki
ki
L
= 0,
ki
ki 0. (10)
= 0, we can get
ki =
L
= 0,
From ki
ki
2
ki ki 2 {ki K(xk , vi )} = 0.
(16)
l=1
n
(uki ) ki (xk vi )
.
m
(uki ) + kij
exp (dki )
,
c
exp (dkl )
(11)
vi =
k=1
uki (xk + ki )
n
(17)
uki
k=1
ki = ki ki (xk vi ),
(12)
where,
ki = min
ki K(xk , vi )
,1 .
xk vi
C. Algorithms
The algorithm of SSFCMCT is described as Algorithm 1.
Equations A, B, and C follow Table I.
ki = ki (xk vi ) .
2
ki 2 =
m
(uki ) + ki
Algorithm 1 SSFCMCT
SSFCMCT1 Set the initial values and parameters.
SSFCMCT2 Calculate uki U using Equation A.
SSFCMCT3 Calculate vi V using Equation B.
SSFCMCT4 Calculate ki E using Equation C.
SSFCMCT5 If convergence criterion is satisfied, stop.
Otherwise, go back to SSFCMCT2.
(uki )
ki K(xk , vi )
.
=
m
xk vi
(uki ) + ki
(13)
In these algorithms, the convergence criterion is convergence of each variable, value of objective function or number
of repetition.
Table I
T HE OPTIMAL SOLUTIONS OF SSFCMCT.
ki = min
Algorithm
SSsFCMCT
SSeFCMCT
(14)
ki K(xk , vi )
,1 .
xk vi
Equation C
(14)
(14)
Equation B
(9)
(17)
V. N UMERICAL EXAMPLES
B. Entropy model
Ject (U, , V ) =
Equation A
(8)
(16)
uki dki + 1 uki log uki . (15)
k=1 i=1
The constraint for uki and ki are remains the same as (1)
and (7)
191
Table II
DATA SET {xk | xk p , k = 1 9}.
k
1
2
3
4
5
6
7
(xk1 , xk2 )
(0.0,0.0)
(0.0,10.0)
(2.0,5.0)
(5.0,5.0)
(8.0,5.0)
(10.0,0.0)
(10.0,10.0)
[4] S. Basu, M. Bilenko, R. J. Mooney, A probabilistic framework for semi-supervised clustering, Proc. of the 10th ACM
SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD 2004), pp. 5968, 2004.
[5] S. Miyamoto, M. Yamazaki, A. Terami, On semi-supervised
clustering with pairwise constraints, Proc. of The 7th International Conference on Modeling Decisions for Artificial
Intelligence (MDAI 2009), pp. 245254, 2009 (CD-ROM).
[6] Y. Endo, Y. Hamasuna, M. Yamashiro, S. Miyamoto, On
semi-supervised fuzzy c-means clustering, Proc. of 2009
IEEE International Conference on Fuzzy Systems (FUZZIEEE 2009), pp. 11191124, 2009.
[7] B. Yan, C. Domeniconi, An adaptive kernel method for semisupervised clustering, Proc. of 17th European Conference on
Machine Learning (ECML 2006), pp. 521532, 2006.
[8] B. Kulis, S. Basu, I. Dhillon, R. Mooney, Semi-supervised
graph clustering: a kernel approach, Machine Learning, Vol.
74, No. 1, pp. 122, 2009.
VI. C ONCLUSIONS
In this paper, we have proposed semi-supervised fuzzy cmeans clustering using clusterwise tolerance based pairwise
constraints. The proposed method can handle the pairwise
constraints without breaking the Euclidean space by using
the concept of clusterwise tolerance. Moreover, we have
shown the effectiveness of proposed method through numerical examples. The proposed method is quite different from
other semi-supervised clustering methods from the viewpoint of handling pairwise constraint by using clusterwise
tolerance vector.
In future works, we will consider the way to handle
cannot-link constraint by proposed method. Next, we will
compare our proposed method with other semi-supervised
clustering methods through numerical examples with various kinds of data sets. Moreover, we will apply proposed
method to fuzzy c-means clustering for data with clusterwise
tolerance based on regularization [17], [18].
ACKNOWLEDGMENTS
[13] S. Miyamoto, M. Mukaidono, Fuzzy c-means as a regularization and maximum entropy approach, Proceedings of the
7th International Fuzzy Systems Association World Congress
(IFSA97), Vol. 2, pp. 8692, 1997.
10
[15] Y. Hamasuna, Y. Endo, S. Miyamoto, On Tolerant Fuzzy cMeans, Journal of Advanced Computational Intelligence and
Intelligent Informatics (JACIII), Vol. 13, No. 4, pp. 421427,
2009.
[16] Y. Endo, R. Murata, H. Haruyama, S. Miyamoto, Fuzzy cmeans for data with tolerance, Proc. of International Symposium on Nonlinear Theory and Its Applications (Nolta05),
pp. 345348, 2005.
Figure 5.
10
10
Figure 6.
Figure 3.
10
10
10
10
8
8
6
6
4
4
2
2
0
0
0
0
Figure 4.
Figure 7.
10
Result with ML = .
193
10