Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
html
1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Contents
1. Abstract
2. Keywords
3. Introduction
4. Clustering
5. Partitional Algorithms
6. K-medoid Algorithms
6.1 PAM
6.2 CLARA
6.3 CLARANS
7. Analysis
8. Conclusion
9. References
2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
1. ABSTRACT
In last few years there has been tremendous research interest in devising efficient data mining
algorithms. Clustering is a very essential component of data mining techniques. Interestingly,
the special nature of data mining makes the classical clustering algorithms unsuitable, these
characteristics are usually very large datasets; the dataset need not be necessarily numeric
and hence importance should be given to efficient input and output operations instead of
algorithmic complexity. As a result in last few years a number of clustering algorithms are
proposed for data mining. The present paper gives a brief overview of partitional clustering
algorithms used in data mining. The first part of the paper discuses overview of clustering
technique used in data mining. In the second part the paper discusses different partitional
clustering algorithms used in mining of data.
2. KEYWORDS:
Knowledge discovery in database, Data mining, Clustering, partitional
algorithms, PAM, CLARA, CLARANS.
3. INTRODUCTION:
Data mining is the non-trivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns of data. Knowledge discovery in database (KDD) is a well
defined process, consisting of several distinct steps. Data mining is the core step in the
process which results in the discovery of knowledge. Data mining is a high-level application
technique used to present and analyze data for decision-makers. There is an enormous wealth
of information embedded in huge databases belonging to enterprises and this has spurred
tremendous interest in areas of knowledge discovery and data mining. The fundamental goals
of data mining are prediction and description. Prediction makes use of existing variables in
the database in order to predict unknown or future values of interest and description focuses
on finding patterns describing the data and the subsequent presentation for user
interpretation. There are several mining techniques for prediction and description. These are
categorized as association, classification, sequential patterns and clustering. The basic
3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
premise of association is to find all associations such that the presence of one set of items in
a transaction implies other items. Classification develops profiles different groups. Sequential
patterns identify sequential patterns subject to a user-specified minimum constraint.
Clustering segments a database into subsets or clusters.
4. Clustering
Clustering is a useful technique for discovery of data distribution and patterns in the
underlying data. The goal of clustering is to discover dense and sparse regions in a data set.
Data clustering has been studied in the statistics, machine learning, and database
communities with diverse emphases. There are two main types of clustering techniques
partitional clustering techniques and hierarchical clustering techniques. The partitional
clustering techniques construct a partition of the database into predefined number of clusters.
The hierarchical clustering techniques do a sequence of partitions in which each partition is
nested into next partition in the sequence.
5. PARTITIONAL ALGORITHMS
subsets. An exhaustive enumeration method can though find the global optimal partition but
is practically infeasible when n and k are very small. The partitional clustering algorithm
usually adopts iterative optimization paradigm. It starts with an initial partition and uses an
iterative control strategy. It tries swapping of data points to see if such a swapping improves
the quality of clustering. When no swapping yields improvements in clustering it finds a
locally optimal partition. This quality of clustering is very sensitive to initially selected
partition. There are mainly two different categories of the partitioning algorithms.
• k-means algorithm, where each cluster is represented by the center of gravity of the
cluster.
• k-medoid algorithms where each cluster is represented by one of the objects of the
cluster located near the center.
Most of special clustering algorithms designed for data mining are k-medoid algorithms.
Different k-medoid algorithms are PAM, CLARA, CLARANS.
6. k-Medoid Algorithms
6.1 PAM
PAM uses a k-medoid method to identify the clusters. PAM selects k objects arbitrarily from
the data as medoids. In each step, a swap between a selected object O i and a non-selected
object Oh is made as long as such a swap would result in an improvement of the quality of
clustering .To calculate the effect of such a swap between Oi and Oh a cost Cih is computed,
which is related to the quality of partitioning the non-selected objects to k clusters
represented by the medoids. So, at this stage it is necessary first to understand the method of
partitioning of the data objects when a set of k-medoids are given
Partitioning
If Oj is a non-selected object and Oi is a medoid, we then say Oj belongs to the cluster
represented by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the minimum is taken over all medoids
Oe and d(Oa,Oh) determines the distance or dissimilarity between objects Oa and Oh. The
dissimilarity matrix is known prior to the commencement of PAM. The quality of clustering
5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
is measured by the average dissimilarity between an object and the medoid of the cluster to
which the object belongs.
Let us assume that O1, O2, ….., Ok are k medoids selected at any stage. We denote C1, C2, …
, Ck are the respective clusters. From the foregoing discussion, for a non-selected object Oj, j
≠ 1, 2 … k if Oj Є Ch then Min(1<i<k) d(Oj,Oi) = d(Oj,Oh). Let us now analyze the effect of
swapping Oi and Oh. In other words, let us of compare the quality clustering, if we select k
medoids as O1,O2, … ,Oi-1,Oh, Oi+1,…,Ok, where Oh replaces Oi as one of the medoids. Due
to the change in the set of medoids, there will be three types of changes that can occur in the
actual clustering.
• A non-selected object Oj, such that Oj, such that Oj Є Ci before swapping and Oj Є Ch
after swapping. This case arises when the following conditions hold:
Min d(Oj,Oe) = d(Oj,Oi), before swapping and Mine≠I d(Oj,Oe)=d(kOj,Oh) after
swapping.
• A non-selected object Oj Є Ci and Oj Є Cj΄ , j΄ ≠h. This case arises when Min
d(Oj,Oe) = d(Oj,Oi), and Min d(Oj,Oe)=d(Oj,Oj΄), j΄ ≠ h.Define a cost as Cjih
=d(Oj,Oj΄) - d(Oj,Oi)
• A non-selected object joj Є Cj΄ = Oj Є Ch
So, Min d(Jo,Au) = d(Jo,Jo΄), and
Min d(Jo,Au) = d(Jo,Oh)Cjih = d(Oj,Oh) - d(Oj,Oj΄)
Define the total cost of swapping Oi and Oh as Chi = ∑jCjih if Cih is negative then the quality
of clustering is improved by making Oh as a medoid in plase of Oi. The process is repeated
until we cannot find a negative Cih.
6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
ALGORITHM
6.2 CLARA
It can be observed that the major computational efforts for PAM are to determine k
medoids through an iterative optimization. CLARA through follows the same principle,
attempts to reduce the computational effort by relying on sampling to handle large datasets.
Instead of finding representative objects for the entire dataset, CLARA draws sample of the
dataset, applies PAM on this sample and finds the medoids of the sample. If the sample were
drawn in a sufficiently random way, the medoids of the sample would approximate the
medoids of the entire dataset. The steps of CLARA are summarized below:
7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
ALGORITHM
6.3 CLARANS
8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
• Do while (j ≤ maxneighbour)
o Consider randomly a pair (i,h) such that Oi is a selected object and Oh is a
non-selected object.
o Calculate the cost Cih.
o If Cih is negative
“update current”
Mark Oi non-selected ,Oh selected and j=1
Else
Increment j←j+1
End do
Compare the cost of clustering “with mincost”
If current_cost < mincost
o Mincost ←current_cost
o Best_node←current
• increment e←e+1
• End do
• Return “best node”.
7. ANALYSIS
PAM is very robust to the existence of outliers. The clusters found by this method do not
depend on the order in which the objects are examined. However, it cannot handle very large
data. CLARA samples the large data and applies PAM on this sample. The result will be
based on the sample. CLARANS applies randomized Iterative-Optimization for determining
of medoids. This can be applied to large datasets also. It is more efficient than earlier
medoid-based methods suffers from two major drawbacks: it assumes that all objects fit in
main memory, and the result is very sensitive to input order. In addition, it may not find a real
local minimum due to the trimming of searching which is controlled by ‘maxneighbour’.
9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
8. CONCLUSION
PAM algorithm is efficient and gives good results when data is small. However it cannot be
applied to large datasets. CLARA efficiency is determined by the sample of data taken at
sampling phase. CLARANS is efficient for large datasets. As datasets from which required
data is mined is large CLARANS is used and is efficient partitional algorithm compared to
PAM and CLARA.
9. REFERENCES:
IEEE Papers
www.datawarehouse.com
www.itpapers.com
10 Email: chinna_chetan05@yahoo.com