Sei sulla pagina 1di 10

Visit: www.geocities.com/chinna_chetan05/forfriends.

html

DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE


TO
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Contents

1. Abstract

2. Keywords

3. Introduction

4. Clustering

5. Partitional Algorithms

6. K-medoid Algorithms

6.1 PAM

6.2 CLARA

6.3 CLARANS

7. Analysis

8. Conclusion

9. References

2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

1. ABSTRACT

In last few years there has been tremendous research interest in devising efficient data mining
algorithms. Clustering is a very essential component of data mining techniques. Interestingly,
the special nature of data mining makes the classical clustering algorithms unsuitable, these
characteristics are usually very large datasets; the dataset need not be necessarily numeric
and hence importance should be given to efficient input and output operations instead of
algorithmic complexity. As a result in last few years a number of clustering algorithms are
proposed for data mining. The present paper gives a brief overview of partitional clustering
algorithms used in data mining. The first part of the paper discuses overview of clustering
technique used in data mining. In the second part the paper discusses different partitional
clustering algorithms used in mining of data.

2. KEYWORDS:
Knowledge discovery in database, Data mining, Clustering, partitional
algorithms, PAM, CLARA, CLARANS.

3. INTRODUCTION:

Data mining is the non-trivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns of data. Knowledge discovery in database (KDD) is a well
defined process, consisting of several distinct steps. Data mining is the core step in the
process which results in the discovery of knowledge. Data mining is a high-level application
technique used to present and analyze data for decision-makers. There is an enormous wealth
of information embedded in huge databases belonging to enterprises and this has spurred
tremendous interest in areas of knowledge discovery and data mining. The fundamental goals
of data mining are prediction and description. Prediction makes use of existing variables in
the database in order to predict unknown or future values of interest and description focuses
on finding patterns describing the data and the subsequent presentation for user
interpretation. There are several mining techniques for prediction and description. These are
categorized as association, classification, sequential patterns and clustering. The basic
3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

premise of association is to find all associations such that the presence of one set of items in
a transaction implies other items. Classification develops profiles different groups. Sequential
patterns identify sequential patterns subject to a user-specified minimum constraint.
Clustering segments a database into subsets or clusters.

4. Clustering

Clustering is a useful technique for discovery of data distribution and patterns in the
underlying data. The goal of clustering is to discover dense and sparse regions in a data set.
Data clustering has been studied in the statistics, machine learning, and database
communities with diverse emphases. There are two main types of clustering techniques
partitional clustering techniques and hierarchical clustering techniques. The partitional
clustering techniques construct a partition of the database into predefined number of clusters.
The hierarchical clustering techniques do a sequence of partitions in which each partition is
nested into next partition in the sequence.

Datasets before clustering Datasets after clustering

5. PARTITIONAL ALGORITHMS

Partitional algorithms construct a partition of a database of n objects into a set of k clusters.


The construction involves determining the optimal partition with respect to an objective
function. There are approximately kⁿ/k! ways of partitioning a set of n data points into k
4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

subsets. An exhaustive enumeration method can though find the global optimal partition but
is practically infeasible when n and k are very small. The partitional clustering algorithm
usually adopts iterative optimization paradigm. It starts with an initial partition and uses an
iterative control strategy. It tries swapping of data points to see if such a swapping improves
the quality of clustering. When no swapping yields improvements in clustering it finds a
locally optimal partition. This quality of clustering is very sensitive to initially selected
partition. There are mainly two different categories of the partitioning algorithms.

• k-means algorithm, where each cluster is represented by the center of gravity of the
cluster.

• k-medoid algorithms where each cluster is represented by one of the objects of the
cluster located near the center.

Most of special clustering algorithms designed for data mining are k-medoid algorithms.
Different k-medoid algorithms are PAM, CLARA, CLARANS.
6. k-Medoid Algorithms

6.1 PAM
PAM uses a k-medoid method to identify the clusters. PAM selects k objects arbitrarily from
the data as medoids. In each step, a swap between a selected object O i and a non-selected
object Oh is made as long as such a swap would result in an improvement of the quality of
clustering .To calculate the effect of such a swap between Oi and Oh a cost Cih is computed,
which is related to the quality of partitioning the non-selected objects to k clusters
represented by the medoids. So, at this stage it is necessary first to understand the method of
partitioning of the data objects when a set of k-medoids are given

Partitioning
If Oj is a non-selected object and Oi is a medoid, we then say Oj belongs to the cluster
represented by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the minimum is taken over all medoids
Oe and d(Oa,Oh) determines the distance or dissimilarity between objects Oa and Oh. The
dissimilarity matrix is known prior to the commencement of PAM. The quality of clustering

5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

is measured by the average dissimilarity between an object and the medoid of the cluster to
which the object belongs.

Iterative Selection of Medoids

Let us assume that O1, O2, ….., Ok are k medoids selected at any stage. We denote C1, C2, …
, Ck are the respective clusters. From the foregoing discussion, for a non-selected object Oj, j
≠ 1, 2 … k if Oj Є Ch then Min(1<i<k) d(Oj,Oi) = d(Oj,Oh). Let us now analyze the effect of
swapping Oi and Oh. In other words, let us of compare the quality clustering, if we select k
medoids as O1,O2, … ,Oi-1,Oh, Oi+1,…,Ok, where Oh replaces Oi as one of the medoids. Due
to the change in the set of medoids, there will be three types of changes that can occur in the
actual clustering.

• A non-selected object Oj, such that Oj, such that Oj Є Ci before swapping and Oj Є Ch
after swapping. This case arises when the following conditions hold:
Min d(Oj,Oe) = d(Oj,Oi), before swapping and Mine≠I d(Oj,Oe)=d(kOj,Oh) after
swapping.

• A non-selected object Oj Є Ci and Oj Є Cj΄ , j΄ ≠h. This case arises when Min
d(Oj,Oe) = d(Oj,Oi), and Min d(Oj,Oe)=d(Oj,Oj΄), j΄ ≠ h.Define a cost as Cjih
=d(Oj,Oj΄) - d(Oj,Oi)
• A non-selected object joj Є Cj΄ = Oj Є Ch
So, Min d(Jo,Au) = d(Jo,Jo΄), and
Min d(Jo,Au) = d(Jo,Oh)Cjih = d(Oj,Oh) - d(Oj,Oj΄)

Define the total cost of swapping Oi and Oh as Chi = ∑jCjih if Cih is negative then the quality
of clustering is improved by making Oh as a medoid in plase of Oi. The process is repeated
until we cannot find a negative Cih.

The algorithm can be stated as follows:

6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

ALGORITHM

• Input: Database of object D.


• Select arbitrarily k representative objects. Mark these objects as “selected” and mark
the remaining as “non-selected”.
• Repeat until no more objects are to be classified.
• Do for all selected object Oi.
Do for all non-selected objects Oh.
Compute Cih
End do
• End do
• Select imin, hmin such that Cimin,hmin=Min i,h Cih
• If Cimin,hmin<0
Then mark Oi as non-selected and Oh as selected.
Do repeat.
• Find cluster C1, C2, C3, … , Ch.

6.2 CLARA

It can be observed that the major computational efforts for PAM are to determine k
medoids through an iterative optimization. CLARA through follows the same principle,
attempts to reduce the computational effort by relying on sampling to handle large datasets.
Instead of finding representative objects for the entire dataset, CLARA draws sample of the
dataset, applies PAM on this sample and finds the medoids of the sample. If the sample were
drawn in a sufficiently random way, the medoids of the sample would approximate the
medoids of the entire dataset. The steps of CLARA are summarized below:

7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

ALGORITHM

• Input: Database of D objects,


• Repeat until
1. Draw a sample S c D randomly from D.
2. Call PAM (s,k) to get k medoids.
3. Classify the entire data set D to C1, C2, Ck,
4. Calculate the quality of clustering as the average dissimilarity.
• End.

6.3 CLARANS

CLARANS (Clustering Large Applications based on the Randomized Search) is similar


to PAM but it applies a randomized Iterative-Optimization for determination of medoids. It is
easy to see that in PAM, at every iteration, we examine k(N-k) swapping to determine the
pair corresponding to minimum cost . On the other hand, CLARA tries to examine fewer
elements by restricting its search to smaller sample of the database. Thus if the sample size is
s ≤ N, it examines at most k(S-k) pairs at every iteration. CLARANS does not restrict the
search to any particular subset of objects. Neither does it search the entire dataset. It
randomly selects few pairs for swapping at the current state. CLARANS, like PAM, start
with a randomly selected set of k medoids. It checks at most the maxneighbour number of
pairs for swapping and if a pair with negative cost is found, it updates the medoids set and
continues. Otherwise, it records the current selection of medoids as a local optimum and
restarts with a new randomly selected medoid-set to search for another local optimum.
CLARANS stops after the “numlocal” number of local optimal medoid-sets are determined
and return the best among these.
ALGORITHM
• Input(D, k, maxneighbour and numlocal)
• Select arbitrarily k representative objects. Mark these objects as “selected” and all
other objects as non-selected. Call it current.
• Set e =1.
• Do while (e ≤ numlocal)
Set j=1.

8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

• Do while (j ≤ maxneighbour)
o Consider randomly a pair (i,h) such that Oi is a selected object and Oh is a
non-selected object.
o Calculate the cost Cih.
o If Cih is negative
 “update current”
 Mark Oi non-selected ,Oh selected and j=1
Else
Increment j←j+1
 End do
 Compare the cost of clustering “with mincost”
 If current_cost < mincost
o Mincost ←current_cost
o Best_node←current
• increment e←e+1
• End do
• Return “best node”.

7. ANALYSIS

PAM is very robust to the existence of outliers. The clusters found by this method do not
depend on the order in which the objects are examined. However, it cannot handle very large
data. CLARA samples the large data and applies PAM on this sample. The result will be
based on the sample. CLARANS applies randomized Iterative-Optimization for determining
of medoids. This can be applied to large datasets also. It is more efficient than earlier
medoid-based methods suffers from two major drawbacks: it assumes that all objects fit in
main memory, and the result is very sensitive to input order. In addition, it may not find a real
local minimum due to the trimming of searching which is controlled by ‘maxneighbour’.

9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

8. CONCLUSION

PAM algorithm is efficient and gives good results when data is small. However it cannot be
applied to large datasets. CLARA efficiency is determined by the sample of data taken at
sampling phase. CLARANS is efficient for large datasets. As datasets from which required
data is mined is large CLARANS is used and is efficient partitional algorithm compared to
PAM and CLARA.
9. REFERENCES:

Vasudha Bhatnagar, On Mining Of Data, IETE Journal of research, 2001

Data mining and warehousing by Dunham

IEEE Papers

www.datawarehouse.com
www.itpapers.com

10 Email: chinna_chetan05@yahoo.com

Potrebbero piacerti anche