Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract—Hierarchical clustering is one of the most important Besides, the existing algorithms by combining
tasks in data mining. However, the existing hierarchical hierarchical clustering and K-means [4] ignore the existing
clustering algorithms are time-consuming, and have low constraints [5]. Clustering is traditionally considered as an
clustering quality because of ignoring the constraints. In this unsupervised method for data analysis. However, in some
paper, a Hierarchical Clustering Algorithm based on K-means cases the background knowledge is known in addition to the
with Constraints (HCAKC) is proposed. In HCAKC, in order data instances. Typically, the background knowledge is in
to improve the clustering efficiency, Improved Silhouette is the form of pairwise constraints (must-link and cannot-link).
defined to determine the optimal number of clusters. In Clustering quality can be improved through utilizing the
addition, to improve the hierarchical clustering quality, the existing constraints. Carlos Ruiz enhanced the density-based
existing pairwise must-link and cannot-link constraints are algorithm DBSCAN with constraints upon data points to get
adopted to update the cohesion matrix between clusters.
the new algorithm C-DBSCAN [6]. C-DBSCAN has
Penalty factor is introduced to modify the similarity metric to
superior performance to DBSCAN even with a small
address the constraint violation. The experimental results
show that HCAKC has lower computational complexity and
number of constraints. However, the efficiency of C-
better clustering quality compared with the existing algorithm DBSCAN is not good. How the K-means clustering
CSM. algorithm can be profitably modified to make use of the
constraints is demonstrated in cop-kmeans [7]. Although the
Keywords- Hierarchical clustering; Improved Silhouette; K- clustering accuracy of cop-kmeans is improved, the
means; constraints constraint violation problem has not been well addressed. I.
Davidson [8] incorporated the pairwise constraints into the
agglomerative hierarchical clustering to improve the
I. INTRODUCTION
clustering quality. However, the problem of constraint
Clustering is an important analysis tool in many fields, violation is still not solved.
such as pattern recognition, image classification, biological In this paper, we propose HCAKC, a new method for
sciences, marketing, city-planning, document retrievals, etc. hierarchical clustering based on K-means with existing
Hierarchical clustering is one of the most widely used pairwise constraints. In HCAKC, Improved Silhouette,
clustering methods. CUCMC (Constraints-based Update of Cohesion Matrix
At present, several existing clustering algorithms focus between Clusters) are defined. The optimal number of
on combination of the advantages of hierarchical and clusters is determined through computing the average
partitioning clustering algorithms [1-2]. K-means, which is Improved Silhouette of the dataset such that the time
one of the representative partitioning methods, obtains the complexity can be reduced. The initial clusters of HCAKC
number of clusters through minimizing the objective are obtained through running K-means. In order to improve
function. K-means has higher efficiency compared with the the clustering quality of the hierarchical clustering, the
hierarchical methods. However, the number of clusters K existing pairwise must-link and cannot-link constraints are
needs to be fixed iteratively. Thus, K-means is often incorporated into the agglomerative hierarchical clustering.
required to be run many times and is computationally CUCMC is done based on the existing constraints. The
expensive. How to determine the number of clusters penalty factor [9] is introduced into cohesion [2] similarity
becomes an increasingly important problem. The common metric to address the constraint violation.
trail-and-error method [3] generally depends on certain This paper is organized as follows: In section II, we
clustering algorithms and is inefficient when the dataset is give the basic concepts and definitions. In section III, we
large. present our hierarchical clustering algorithm called
HCAKC. Section IV shows the experimental results of the
*
This work is supported by the National High Technology Research and clustering algorithm. Finally we conclude the paper in
Development Program ("863"Program) of China (No. 2009AA01Z433) section V.
and the Natural Science Foundation of Hebei Province P.R. China
(No.F2008000888).
1480
III. HIERARCHICAL CLUSTERING ALGORITHM BASED ON 3: repeat
K-MEANS WITH CONSTRAINTS 4: { for (each point x in S)
5: assign x to the closest sub-cluster based on the
The CSM [2] needs to specify K. Different K leads to
distance to the centroid;
different clustering results. Thus, how to determine the
6: update the centroid of each sub-cluster;
appropriate K becomes especially important. Besides, the
7: } until (no points change between the t clusters) //utilize
existing constraints are not considered in CSM, so the
the K-means on the S, where K equals to t
accuracy of the clustering results will not be high.
8: Compute the cohesion matrix X between the t clusters;
In HCAKC, we plot the curve about the average IS of
9: If ( (Ci Cj )∈ M or (Ci Cj )∈ C)
dataset to be clustered and the number of partitions. The
10: implement CUCMC;
optimal number of clusters is determined by the maximum
11: If (Ci Cj ) violates M (C)
of the curve, since the average IS of a dataset not only
reflects the density of clusters, but also the dissimilarity 12: w, w are enforced on the cohesion matrix;
between clusters. The cohesion matrix X is constructed 13: do{ Extract the maximal chs (Ci, Cj);
according to the cohesion between any two clusters. The 14: If (Ci and Cj do not belong to the same sub-cluster)
existing pairwise constraints are incorporated into the 15: merge the two sub-clusters which they belong to a
hierarchical clustering. CUCMC is implemented based on new subcluster;
the existing constraints. Thus, the clustering results are 16: t:=t-1; } while (t>K).
greatly optimized. In our algorithms, S is the dataset to be end
clustered; K is the optimal number of clusters; n is the size In HCAKC, Find-K is firstly run in order to determine
of S; m is the number of sub-clusters; M={(Ci, Cj)}, (i, the optimal number K of the dataset to be clustered. Then,
j ∈ [1,t]) is the set of existing must-link constraints; C={(Ci, K-means is adopted to form t clusters initially, in which t is
Cj)}, (i, j ∈ [1,t]) is the set of existing cannot-link more than K. The cohesion matrix named X between t
constraints. clusters is obtained base on formula (3). Afterwards, the
Algorithm Find-K existing constraints sets M={(Ci, Cj)} and C={(Ci, Cj)} are
Input: S considered to implement the CUCMC. The penalty factor is
Output: K introduced to address the constraint violation. When must-
begin link (Ci, Cj) is violated, w≠ M ( Ci ,C j ) is forced on the similarity
1: partition S into t clusters: C1, C2…Ct, according to the metric according to formula (5). The row i column j in X is
geometry distribution of S set at Sim(Ci, Cj).
2: repeat{
3: for (i=1; i<=t; i++) IV. EXPERIMENTAL RESULTS
4: {for (each object x in Ci)
All of our experiments have been conducted on a
5: {calculate ISi(x), the improved silhouette of x;
computer with 2.4Ghz Intel CPU and 512M main memory.
6: calculate ISi , the average Improved Silhouette of S, The operating system of the computer is Microsoft
1 t Windows XP. HCAKC is compared with CSM to evaluate
and ISi = ∑ ∑ ISi ( x) ;
n i =1 x∈ci
the clustering quality and time performance of HCAKC.
The algorithms are all implemented in Microsoft Visual
7: plot the curve about t and ISi in the 2-dimensional C++6.0.
coordinate system; } We performed our experiments on the UCI datasets:
8: t:=t+1; } Ionosphere, iris, breast-cancer, credit-g, page-blocks. The
9: } until (the curve reaches the maximum) must-link and cannot-link constraints are generated
10: K: =t; artificially by utilizing the same method with [7]. The
end details of datasets are shown in Tab.1. For instance, D1 is
In algorithm Find-K, dataset S is firstly partitioned into the Ionosphere dataset consisting of 355 instances from two
t clusters: C1, C2…Ct, according to the geometry clusters. Accuracy [7], one of the clustering quality
distribution of S. IS is introduced into the algorithm Find- measures, is computed to compare the clustering results
K. The closer the improved silhouette of a cluster to 1, the between HCAKC and CSM. We averaged the measures for
more similar the objects of the same cluster will be. The 100 trails on each dataset. Fig. 3 and Fig. 4 show the
curve, which is about cluster number t and the average IS, experimental results comparing HCAKC with CSM.
is plotted. The number of clusters corresponding to the HCAKC and CSM are run on D1 (i.e. Ionosphere
maximum of the curve is the optimal number of clusters. dataset) with constraints respectively. Fig. 3 shows the
Algorithm HCAKC accuracy results comparing HCAKC with CSM on the
Input: S, n, m, M, C Ionosphere dataset. From Fig. 3, we can get the conclusion
Output: The K clusters that CSM is lower in accuracy compared with HCAKC
begin with varying size of constraints.
1: Call the Algorithm Find-K; In CSM, the constraints are not considered. In HCAKC,
2: Initially select t points as the centroids of sub-clusters the constraints are incorporated into the hierarchical
arbitrarily (t>>K); clustering to update the cohesion matrix, and the constraints
1481
violation is addressed as well. Thus, HCAKC is better in V. CONCLUSION
terms of clustering accuracy. In order to improve the time efficiency and clustering
The experiments have been conducted on the datasets: quality of CSM, a new method named HCAKC is proposed
iris, breast-cancer, credit-g and page-blocks respectively to in this paper. In our proposed algorithm, the curve graph
compare the time efficiency of HCAKC with that of CSM. about average IS of the dataset and different partition
From Fig. 4, we can conclude that HCAKC is better than number has been plotted. The optimal number of clusters is
CSM in CPU running time on different datasets. determined by locating the maximum of the curve graph. As
The cluster number K needs to be specified as a a result, the complexity of the process when determining the
parameter before the CSM algorithm. The time cost of the number of clusters has been improved. Thereafter, the
parameter setting is expensive, since the K-means needs to existing constraints have been incorporated to complete the
be run iteratively. HCAKC finds out the optimal K via CUCMC during the hierarchical clustering process. The
computing the average IS of the points in datasets, and the penalty factor is introduced into our algorithm to address the
time cost of this process is insignificant. The time efficiency constraints violation. Hence, the clustering quality has been
of HCAKC is obvious even when the scale of the dataset is improved. The results of the experiments have demonstrated
large. that the HCAKC algorithm is efficient in reducing the time
TABLE.1 PARAMETERS IN TESTING DATASET complexity and increasing the clustering quality.
1482