Sei sulla pagina 1di 22

Data Mining: Clustering Approaches & Techniques

based on Real-life Data


24/01/2014

White Paper
BSNL CDR Project

Hrishav Bakul Barua
&
Anupam Roy
Telecom
hrishav.barua@tcs.com,
anupam1.r@tcs.com




Confidentiality Statement
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in
whole or in part outside TCS for any purpose without the express written permission of
TATA Consultancy Services.












Tata Code of Conduct
We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata
Code of Conduct. We request your support in helping us adhere to the Code in letter and
spirit. We request that any violation or potential violation of the Code by any person be
promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics
Counselor or the CEO of TCS. All communication received in this regard will be treated
and kept as confidential.




TableofContent
Abstract............................................................................................................................................................................. 4
AbouttheAuthors ............................................................................................................................................................ 4
1. DataMining............................................................................................................................................................... 5
1.1 ClusterAnalysis ................................................................................................................................................. 5
1.1.1 Whatagoodclusteringtechnique/algorithmdemands?......................................................................... 6
1.1.2 ACategorizationofMajorClusteringApproaches.................................................................................... 7
1.1.3 HierarchicalMethod ................................................................................................................................. 7
1.1.4 PartitioningMethod.................................................................................................................................. 8
1.1.5 DensityBasedMethod.............................................................................................................................. 9
1.1.6 GridBasedMethods ................................................................................................................................. 9
1.1.7 ConstraintBasedClustering.................................................................................................................... 10
1.1.8 ClusteringOverMultiDensityDataSpace.............................................................................................. 11
1.1.9 ClusteringOverVariableDensitySpace.................................................................................................. 11
1.1.10 ClusteringHigherDimensionalData ....................................................................................................... 11
1.1.11 MassiveDataClusteringUsingDistributedandParallelApproach ........................................................ 12
1.1.12 HowClusteringAlgorithmsareCompared?............................................................................................ 12
1.1.13 ClusterValidation.................................................................................................................................... 12
2. Conclusion............................................................................................................................................................... 19
3. Acknowledgements................................................................................................................................................. 19
4. References .............................................................................................................................................................. 20


Abstract


Findingmeaningfulpatternsandusefultrendsinlargedatasetshasattractedconsiderableinterestrecently.Oneof
themostwidelystudiedproblemsinthisareaistheidentificationandformationofclustersordenselypopulated
regionsinadataset.Clusteranalysisdividesdataintomeaningfulorusefulgroupscalledclusters.Theobjectiveof
thispaperistopresentaclearanalysisandsurveyofthevariousexistingclusteringapproachesandtechniquesand
someofthefamousandpioneeringalgorithmsappliedundertheseapproaches.Hence,thispaperwillbringtolight
thebestofthetechniquesandwillshowwhytheyarethebestamongallthetechniques.

Inthispaper,thetechniqueofdataclusteringhasbeenexamined,whichisaparticularkindofdataminingproblem.
Theprocessofgroupingasetofphysicalorabstractobjectsintoclassesofsimilarobjectsiscalledclustering.A
clusterisacollectionofdataobjectsthataresimilartooneanotherwithinthesameclusterandaredissimilartothe
objectsinotherclusters[1].Givenalargesetofdatapointsthatis,dataobjects;thedataspaceisusuallynot
uniformlyoccupied.Dataclusteringidentifiesthesparseandthecrowdedplacesandhence,discoverstheoverall
distributionpatternsofthedataset.Besides,thederivedclusterscanbevisualisedmoreefficientlyandeffectively
thantheoriginaldataset.Miningknowledgefromlargeamountsofspatialdataisknownasspatialdatamining.It
becomesahighlydemandingfieldbecausehugeamountsofspatialdatahavebeencollectedinvariousapplications
rangingfromgeospatialdata,industrialdatatobiomedicalknowledge.Theamountofspatialdatabeingcollected
isincreasingexponentiallyandhasfarexceededhumansabilitytoanalysethem.Recently,clusteringhasbeen
recognisedasaprimarydataminingmethodforknowledgediscoveryinspatialdatabase.Thedevelopmentof
clusteringalgorithmshasreceivedalotofattentioninthelastfewyearsandnewclusteringalgorithmsare
proposed.Avarietyofalgorithmshaverecentlyemergedthatmeettherequirementsofdataminingusingcluster
analysisandweresuccessfullyappliedtoreallifedataminingproblems.

About the Authors

Hrishav Bakul Barua has joined TCS on September 10, 2012. A student of Sikkim Manipal University (SMU), he has
published his research works on Data Mining: Clustering Techniques in International Journal of Computer
Applications (FCS), New York, USA.
http://www.ijcaonline.org/archives/volume58/number2/9252-3418
Anupam Roy has total four years of project experience in TCS. Currently working in BSNL CDR Project and pursuing
ME in Software Engineering from Jadavpur University, Kolkata. He has worked on Attacks on Distributed
Database and Intrusion Detection/Prevention Systems.


1. Data Mining

Data mining refers to extracting or mining knowledge from large volume of data. Many other terms carry a
similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge
extraction, data/pattern analysis, data archaeology and data dredging. Many people treat data mining as a
synonym for another popularly used term, Knowledge Discovery from Data or KDD.

1.1 Cluster Analysis

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A
cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to
the objects in other clusters [1].

Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine
learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of
very large datasets with many attributes of different types. This imposes unique computational requirements on
relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and
were successfully applied to real-life data mining problems. They are subject of the survey. From a machine
learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the
resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in
data mining applications such as scientific data exploration, information retrieval and text mining, spatial database
applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, business management,
archaeology, insurance, libraries and many others. In recentyears, due to the rapid increase of online documents,
text clustering becomes important.

Distance (similarity, or dissimilarity) function for clustering quality
Inter-clusters distance maximised
Intra-clusters distance minimised
















Figure1:Formation of Clusters


1.1.1 What a good clustering technique/algorithm demands?

Agoodclusteringtechnique/algorithmdemandsthefollowing:
Scalability: Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.

Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixture of these data types.

Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.

Minimal requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to input certain parameters in cluster analysis (such as the number of desired
clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult
to determine, especially for data sets containing high-dimensional objects. This not only burdens users,
but it also makes the quality of clustering difficult to control.

Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or
erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.

Incremental clustering and insensitivity to the order of input records: Some clustering algorithms
cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and
instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the
order of input data. That is, given a set of data objects, such an algorithm may return dramatically different
clustering depending on the order of presentation of the input objects. It is important to develop
incremental clustering algorithms and algorithms that are insensitive to the order of input.

High dimensionality: A database or a data warehouse can contain several dimensions or attributes.
Many clustering algorithms are good at handling low-dimensional data, involving only two to three
dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding
clusters of data objects in high dimensional space is challenging, especially considering that such data
can be sparse and highly skewed.

Constraint-based clustering: Real-world applications may need to perform clustering under various
kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering
constraints such as the citys rivers and highway networks, and the type and number of customers per
cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified
constraints.

Interpretability and usability: Users expect clustering results to be interpretable, comprehensible and
usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering features and methods.

Time Complexity: The time required for a particular clustering algorithm to run/execute and produce the
output.

Labeling or assignment: Hard or strict (each data object is in one and only one cluster vs. soft or fuzzy
(each data object has a probability of being in each cluster).


1.1.2 A Categorization of Major Clustering Approaches

Hierarchical Method
Partitioning Method
Density-Based Methods
Grid-Based Methods
Methods Based on Co-Occurrence of Categorical Data
Constraint-Based Clustering
Clustering Algorithms Used in Machine Learning
Scalable Clustering Algorithms
Model-based Methods
Algorithms For High Dimensional Data
1.1.3 Hierarchical Method

Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram as
represented in the following figure:



Figure2:Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}.

Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such
an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized
into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton)
clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of
all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion
(frequently, the requested number k of clusters) is achieved.



Figure3:Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}.



Advantages of hierarchical clustering include:
Embedded flexibility regarding the level of granularity
Ease of handling any forms of similarity or distance
Consequently, applicability to any attribute types

Disadvantages of hierarchical clustering are related to:
Vagueness of termination criteria
The fact that most hierarchical algorithms do not revisit once constructed(intermediate) clusters with the
purpose of their improvement

Hierarchical clustering based on linkage metrics results in clusters of proper (convex) shapes. Active contemporary
efforts to build cluster systems that incorporate our intuitive concept of clusters as connected components of arbitrary
shape, including the algorithms CURE and CHAMELEON [13], are surveyed in the sub-section Hierarchical Clusters
of Arbitrary Shapes. Divisive techniques based on binary taxonomies are presented in the sub-section Binary Divisive
Partitioning. The sub-section Other Developments contains information related to incremental learning, model-based
clustering and cluster refinement.

One of the most striking developments in hierarchical clustering is the algorithm BIRCH [8]. Data squashing used by
BIRCH to achieve scalability has independent importance. Hierarchical clustering of large datasets can be very sub-
optimal, even if data fits in memory. Compressing data may improve performance of hierarchical algorithms.

1.1.4 Partitioning Method



In this section we survey data partitioning algorithms, which divide data into several subsets. Because checking all
possible subset systems is computationally infeasible, certain greedy heuristics are used in the form of iterative
optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k
clusters. Unlike traditional hierarchical methods, in which clusters are not revisited after being constructed, relocation
algorithms gradually improve clusters. With appropriate data, this results in high quality clusters. One approach to data
partitioning is to take a conceptual point of view that identifies the cluster with a certain model whose unknown
parameters have to be found. More specifically, probabilistic models assume that the data comes from a mixture of
several populations whose distributions and priors we want to find. Corresponding algorithms are described in the
sub-section Probabilistic Clustering. One clear advantage of probabilistic methods is the interpretability of the
constructed clusters. Having concise cluster representation also allows inexpensive computation of intra-clusters
measures of fit that give rise to a global objective function.


Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each
partition represents a cluster and k n. That is, it classifies the data into k groups, which together satisfy the following
requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group.
Most applications adopt one of a few popular heuristic methods, such as (1) the k-means algorithm, where each
cluster is represented by the mean value of the objects in the cluster, and(2) the k-medoids algorithm, where each
cluster is represented by one of the objects located near the center of the cluster.


1.1.5 Density-Based Method


Most partitioning methods cluster objects based on the distance between objects. Such methods can find only
spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering
methods have been developed based on the notion of density. Their general idea is to continue growing the given
cluster as long as the density (number of objects or data points) in the neighborhood exceeds some threshold; that
is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum
number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. The
density-based approach is famous for its capability of discovering arbitrary shaped clusters of good quality even in
noisy datasets [2].Figure 4: Irregular shapes difficult for k-meansillustrates some cluster shapes that present a
problem for partitioning relocation clustering (e.g., k-means), but are handled properly by density-based algorithms.
hey also have good scalability.

T


Figure4:Irregular shapes difficult for k-means
int in the attribute space and is
xplained in the sub-section Density Functions. It includes the algorithm DENCLUE.
NCLUE is a method that clusters objects based on the analysis of the value
distribu s of density functions.
.1.6 Grid-Based Methods
nt of the number of data objects, yet
ependent on only the number of cells in each dimension in the quantized space.


There are two major approaches for density-based methods. The first approach pins density to a training data point
and is reviewed in the sub-section Density-Based Connectivity. Representative algorithms include DBSCAN,
GDBSCAN, OPTICS, and DBCLASD. The second approach pins density to a po
e

DBSCAN [2] and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-
based connectivity analysis. DE
tion


1

The grid-based clustering approach uses a multi-resolution grid data structure. It quantizes the object space into a
finite number of cells that form a grid structure on which all of the operations for clustering are performed [3]. The main
advantage of the approach is its fast processing time, which is typically independe
d

10

There is high probability that all data points that fall into the same grid cell belong to the same cluster. Therefore, all
data points belonging to the same cell can be aggregated and treated as one object. It is due to this nature that grid-
based clustering algorithms are computationally efficient which depends on the number of cells in each dimension in
the quantized space. It has many advantages such as the total number of the grid cells is independent of the number
of data points and is insensitive of the order of input data points.

Some of the popular grid-based clustering techniques are STING [4], Wave Cluster [5], CLIQUE [6], pMAFIA [7]and so
on. CLIQUE [6] is a hybrid clustering method that combines the idea of both density-based and grid-based
approaches. pMAFIA [7] is an optimized and improved version of CLIQUE. It uses the concept of adaptive grids for
detecting the clusters. It scales exponentially to the dimension of the cluster of the highest dimension in the data set.

The algorithm STING (STatistical INformation Grid-based method) [4] works with numerical attributes (spatial data)
and is designed to facilitate region oriented queries. In doing so, STING constructs data summaries in a way similar
toBIRCH [8]. It, however, assembles statistics in a hierarchical tree of nodes that are grid-cells.Figure 5: Cell
generation and tree construction in STINGpresents the proliferation of cells in 2-dimensional space and the
construction of the corresponding tree. Each cell has four (default) children and stores a point count, and attribute-
dependent measures: mean standard deviation, minimum, maximum, and distribution type. Measures are
accumulated starting from bottom level cells, and further propagate to higher-level cells (e.g., minimum is equal to a
minimum among the children-minimums). Only distribution type presents a problem- X
2
-test is used after bottom cell
distribution types are handpicked. When the cell-tree is constructed (in O(N)time), certain cells are identified and
connected in clusters similar to DBSCAN. If the number of leaves is K, the cluster construction phase depends on K
and not on N. This algorithm has a simple structure suitable for parallelization and allows for multi resolution; though
defining appropriate granularity is not straightforward. STING has been further enhanced to algorithm STING+ [9] that
targets dynamically evolving spatial databases, and uses similar hierarchical cell organization as its predecessor. In
ddition, STING+ enables active data mining.

a



Figure5:Cell generation and tree construction in STING

lute and relative conditions on regions (a
et of adjacent cells), absolute and relative conditions on certain attributes.

.1.7 Constraint-Based Clustering
of such
onditioned cluster partitions is the subject of active research; for example, we can look into the survey [10].


To do so, it supports user defined trigger conditions (e.g., there is a region where at least10 cellular phones are in use
per square mile with total area of at least 10 square miles, or usage drops by 20% in a described region). The related
measures, sub-triggers, are stored and updated over the hierarchical cell tree. They are suspended until the trigger
fires with user-defined action. Four types of conditions are supported: abso
s

1

In real-world applications customers are rarely interested in unconstrained solutions. Clusters are frequently subjected
to some problem-specific limitations that make them suitable for particular business actions. Building
c

11

The framework for the constrained-based clustering is introduced in [11]. The taxonomy of clustering constraints
includes constraints on individual objects (example, customer who recently purchased) and parameter constraints (like
the number of clusters) that can be addressed through preprocessing or external cluster parameters. The taxonomy
also includes constraints on individual clusters that can be described in terms of bounds on aggregate functions (min,
avg, and so on) over each cluster. Another approach to building balanced clusters is to convert a task into a graph
partitioning problem [12].

Important constraint-based clustering application is to cluster 2D spatial data in the presence of obstacles. Instead of
regular Euclidean distance, a length of the shortest path between two points can be used as an obstacle distance. The
Clustering with Obstructed Distance (COD) algorithm [11] deals with this problem. It is best illustrated by the Figure6:
Obstacle (river with the bridge) makes a difference, showing the difference in constructing three clusters in absence
of obstacle (left) and in presence of a river with a bridge (right).



Figure6:Obstacle (river with the bridge) makes a difference

1.1.8 Clustering Over Multi-Density Data Space

One of the main applications of clustering spatial databases is to find clusters of spatial objects which are close to
each other. Most traditional clustering algorithms try to discover clusters of arbitrary densities, shapes and sizes. Very
few clustering algorithms show preferable efficiency when clustering multi-density datasets. This is also because small
clusters with small number of points in a local area are possible to be missed by a global density threshold. TDCT [16]
is a triangle- density clustering technique for large multi-density as well as embedded clusters.

1.1.9 Clustering Over Variable-Density Space

Most of the real life datasets have a skewed distribution and may also contain nested cluster structures the discovery
of which is very difficult. Therefore, we discuss two density based approaches, OPTICS [14] and EnDBSCAN [15],
which attempt to handle the datasets with variable density successfully. OPTICS can identify embedded clusters over
varying density space. However, its execution time performance degrades in case of large datasets with variable
density space and it cannot detect nested cluster structures successfully over massive datasets. In EnDBSCAN [15],
an attempt is made to detect embedded or nested clusters using an integrated approach. Based on our experimental
analysis in light of very large synthetic datasets, it has been observed that EnDBSCAN can detect embedded clusters;
however, with the increase in the volume of data, the performance of it also degrades. EnDBSCAN is highly sensitive
to the parameters MinPts and . In addition to the above mentioned parameters, OPTICS requires an additional
parameter that is, '

1.1.10 Clustering Higher Dimensional Data

Most of the clustering methods stated in section 1.1 are implemented in 2D spatial datasets. The need for clustering in
3D spatial datasets is highly demanded. In case of space research and Geo-Spatial data or 3D object detection, an
efficient clustering algorithm is required. CLIQUE is a dimension-growth subspace clustering method [12]. Here,
process starts at single dimensional subspace and extends to higher dimensional ones. CLIQUE is a combination of

12

density and grid based clustering method. In this, the data space is portioned into non overlapping rectangular units,
identifying the dense units out of them. 3D-CATD [17] is a clustering technique for massive numeric three-
dimensional (3D) datasets. The clustering algorithm is based on density approach and can detect global as well as
embedded clusters. Experimental results are reported to establish the superiority of the algorithm in light of several
synthetic data sets. We have only considered three-dimensional objects. But, some or more of the real life problems
deals with higher dimensionalities rather than 2D /3D datasets.


1.1.11 Massive Data Clustering Using Distributed and Parallel Approach

Parallel and distributed computing is expected to relieve current clustering methods from the sequential bottleneck,
providing the ability to scale massive datasets and improving the response time. Such algorithms divide the data into
partitions, which are processed in parallel. The results from the partitions are then merged. In [18], a Density Based
Distributed Clustering (DBDC)[21] algorithm was presented where the data are first clustered locally at different sites
independent of each other. The aggregated information about locally created clusters are extracted and transmitted to
a central site. On the central site, a global clustering is performed based on the local representatives and the result is
sent back to the local sites. The local sites update their clustering based on the global model, that is, merge two local
clusters to one or assign local noise to global clusters. For both the local and global clustering, density-based
algorithms are used. This approach is scalable to large datasets and gives clusters of good quality. GDCT [19],[20] is
a distributed algorithm for intrinsic cluster detection over large spatial data.


1.1.12 How Clustering Algorithms are Compared ?

There are many factors on the basis of which clustering algorithms are compared. A few of them are listed as follows:

The size of datasets
Number of clusters
Type of datasets
Type of software used for implementation
Complexity of time taken for execution
Number of users parameters
Noise handling accuracy


1.1.13 Cluster Validation

A large number of clustering algorithms have been developed to deal with specific applications. Several questions
arise like:
Which clustering algorithm is best suitable for the application at hand?
How many clusters are there in the studied data?
Is there a better cluster scheme?

These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster
validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific
application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns.

Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the
discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster
structures on a data set even if there is no cluster structure present in it. Cluster validation is needed in data mining to
solve the following problems:

13


To measure a partition of a real data set generated by a clustering algorithm
To identify the genuine clusters from the partition
To interpret the clusters


Generally speaking, cluster validation approaches are classified into the following three categories:
Internal approaches
Relative approaches
External approaches

The cluster validation methods are discussed as follows:

1.1.13.1 Internal Approaches

Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the
quality of the induced clusters using the available data objects only. In other words, internal cluster validation excludes
any information beyond the clustering data, and only focuses on assessing clusters quality based on the clustering
data themselves.

The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square
standard deviation (RMSSTD) is used for compactness of clusters. R-squared (RS) for dissimilarity between clusters;
and S_Dbw for compound evaluation of compactness and dissimilarity [1]. The formulas of RMSSTD, RSand S_Dbw
are shown below.




Formula 1

Where, x
j
is the expected value in the j
th
dimension; n
ij
is the number of elements in the i
th
cluster i
th
dimension; n
j
is
the number of elements in the j
th
dimension in the whole data set; n
c
is the number of clusters.



Formula 2
Where,

14




The formula of S_Dbw is given as:

S_Dbw = Scat(c) + Dens_bw(c)..Formula 3

where Scat(c) is the average scattering within c clusters. The Scat(c)is defined as:


Formula 4

The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters.
The term is the variance of a data set; and the term is the variance of cluster ci.Dens_bw(c) indicates the average
number of points between the c clusters (that is, an indication of inter-cluster density) in relation with density within
clusters. The formula of Dens_bw is given as:


Formula 5

Where uij is the middle point of the distance between the centers of the clusters vi and vj. The density function of a
point is defined as the number of points around a specific point within the given radius.


1.1.13.2 Relative Approaches

Relative assessment compares two structures and measures the irrelative merit. The idea is to run the clustering
algorithm for a possible number of parameters (for example, for each possible number of clusters) and identify the
clustering scheme that best fits the dataset, that is, they assess the clustering results by applying an algorithm with
different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use
RMSSTD, RSand S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the
clustering results. Relative cluster validity is also called cluster stability, and the recent works on research of relative
cluster validity are presented in.

15


1.1.13.3External Approaches

The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the users
intuition about the clustering structure of the data set. As a necessary post processing step, external cluster validation
is a procedure of hypothesis test, that is, given a set of class labels produced by a cluster scheme, and compare it
with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the
Figure7


Figure7:Externalcriteriabasedvalidation

External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm
can be achieved by finding a resemblance of the clusters with existing classes. The statistical methods for quality
assessment areemployed in external cluster validation such as Rand statistic, Jaccard Coefficient, Folkes and
Mallows index, Huberts statistic and Normalized statistic, and Monte Carlo method, to measure the similarity between
the priori modeled partitions and clustering results of a dataset.

Based on our selected survey and experimental analysis, it has been observed that:

Density based approach is most suitable for quality cluster detection over massive datasets in 2D, 3D or
higher dimensions.
Grid based approach is suitable for fast processing of large datasets in 2D, 3D or higher dimensions.
Almost all clustering algorithms require input parameters, determinations of which are very difficult, especially
for real world data sets containing high dimensional objects. Moreover, the algorithms are highly sensitive to
those parameters.
Distribution of most of the real-life datasets are skewed in nature, so, handling of such datasets for all types
for qualitative cluster detection based on a global input parameter seems to be impractical.
Only some of the techniques falling under density/density-grid hybrid approaches (TDCT, GDCT, DGCL etc)
are capable of handling multi-density datasets as well as multiple intrinsic or nested clusters over massive
datasets qualitatively.
Only few of the techniques (falling especially under Grid based approach) can handle higher dimensional
datasets.
Algorithms under Density based as well as Grid based approaches employ lesser number of user defined
parameters.
The Density and Grid based approaches can handle the single-linkage problem well and can detect Multi-
density as well as embedded clusters.


16

A tabular comparison of various pioneering clustering algorithms under various approaches is represented as
follows:

Table1:ClusteringAlgorithms
Approac
h
Sl.
N
o.
Algorithms No. of
Parameter
s
Optimize
d for
Structu
re
Multi-
Densi
ty
Clust
er
Embedd
ed
Clusters
Complexity Noise
Handli
ng
1 K-means No. of
Clusters
Separate
d
Clusters
Spheric
al
No No
O(l
t
kN)


No
2

K-medoids No. of
Clusters
Separate
d
Clusters,
Large
valued
objects
Spheric
al
No No
O(k(N-k)
2
)

No
3 K-modes No. of
Clusters
Separate
d
Clusters,
Large
Datasets
Spheric
al
No No
O(l
t
k(N-k)
2
)

No
4 FCM (Fuzzy
C-means
Clustering)
No. of
Clusters
Separate
d
Clusters
Non-
convex
shapes
No No
O(N)

No
5 PAM
(Partition
Around
Medoids)
No. of
Clusters
Separate
d
Clusters,
Large
Datasets
Spheric
al
No No
O(l
t
k(N-k)
2
)

No
6 CLARA
(Clustering
LARge
Applications)
No. of
Clusters
Relatively
Large
Datasets
Spheric
al
No No
O(ksz
2
+k( N-
k))

No










Partitioni
ng
Approac
h

7 CLARANS (A
CLustering
Algorithm
based on
RANdomized
Search)
No. of
Clusters,
Maximum
no. of
neighbors
Better
that PAM
& CLARA
Spheric
al
No No
O(kN
2
)

No
1 BIRCH
(Balanced
Iterative
Reducing &
Clustering
using
hierarchies)
Branching
factor,
Diameter,
Threshold
Large
Data
Spheric
al
No No O(N) Yes









Hierarchi
cal
2 CURE
(Clustering
Using
REpresentati
ves)
No. of
Clusters,
No of
representati
ves
Any
Shaped
Large
data
Arbitrar
y
No No
O(N
2
logN)

Yes

17

Approac
h
Sl.
N
o.
Algorithms No. of
Parameter
s
Optimize
d for
Structu
re
Multi-
Densi
ty
Clust
er
Embedd
ed
Clusters
Complexity Noise
Handli
ng
3 ROCK
(RObust
Clustering
using links)
No. of
Clusters
Small
noisy
data
Arbitrar
y
No No
O(N
2

+Nm
m
m
a
+N
2
l
ogN)

Yes Approac
h

4 CHAMELEO
N
3(k-nearest
neighbors,
MIN-SIZE,

c
)

Small
datasets
Arbitrar
y
Yes No
O(N
2
)

Yes
1 DBSCAN
(Density
Based
Spatial
Clustering of
Applications
with Noise)
2(MinPts, ) Large
datasets
Arbitrar
y
No No O(N log N)
using R*tree
Yes
2 OPTICS(
Ordering
Points To
Identify the
Clustering
Structure)
3(MinPts,
,')
Large
datasets
Arbitrar
y
Yes Yes O(N log N)
using R
*
tree
Yes
3 DENCLUE

2(MinPts, ) Large
datasets
Arbitrar
y
No No O(N log N)
using R
*
tree
Yes
4 TDCT
(Triangle-
Density
Clustering
Technique)
2(, )

Large
Spatial
datasets
Arbitrar
y
Yes Yes O(n
c
2*m*N) Yes







Density
Based
Approac
h
5 3D-CATD (3-
Dimensional
Clustering
Algorithm
using
Tetrahedron
Density)
2(, )

Large
datasets,
3D
datasets
Arbitrar
y
Yes Yes O(n
c
m*N) Yes
1 Wave
Cluster
No. of cells
for each
dimension,
No. of
applications
of transform
Any
Shape,
Large
Data
Any Yes No O(N) Yes










Grid-
2 STING No. of cells
in lowest
level, No. of
objects in
cell
Large
spatial
datasets
Vertical
and
horizon
tal
bounda
ry
No No O(N) Yes

18

Approac
h
Sl.
N
o.
Algorithms No. of
Parameter
s
Optimize
d for
Structu
re
Multi-
Densi
ty
Clust
er
Embedd
ed
Clusters
Complexity Noise
Handli
ng
3 CLIQUE Size of the
grid,
minimum
no. of
points in
each grid
cell
High
dimensio
nal, Large
datasets
Arbitrar
y
No No O(N) Yes Based
Approac
h
4 MAFIA Size of the
grid,
minimum
no. of
points in
each grid
cell
High
dimensio
nal, Large
datasets
Arbitrar
y
No No O(c
kl
) Yes
1 GDCT (Grid-
Density
Clustering
Technique)
2 (n, ) Large
datasets,
2D
datasets
Arbitrar
y
Yes Yes O(N/k+t) Yes

2 GDCT Using
Distributed
Computing
2 (n, ) Large
datasets,
2D
datasets
Arbitrar
y
Yes Yes O(N) Yes





Grid-
Density
Hybrid
Approac
h
3 DisClus
(Distributed
Clustering)
2 (n,) High
resolution
multi-
spectral
Satellite
Datasets
Arbitrar
y
Yes Yes O(N) Yes
Graph
Based
Clusterin
g
1 AUTOCLUST NIL Massive
Data
Arbitrar
y
No No O(NlogN) Yes


19


2. Conclusion

Clustering lies at the heart of data analysis and data mining applications. The ability to discover highly correlated
regions of objects when their number becomes very large is highly desirable, as data sets grow and their properties
and data interrelationships change. Every research paper that presents a new clustering technique shows its
superiority to other techniques and it is hard to judge how well the technique will work. In this paper we described the
process of clustering from the data mining point of view. We gave the properties of a good clustering technique and
the methods used to find meaningful partitioning. We have also done a selected survey on various clustering
approaches and the pioneering algorithms of these approaches. From the survey we can conclude that Density Based
and Grid Based clustering approaches can produce optimum solutions in clustering. Density based clustering
techniques can find clusters of any shape and size in large datasets with good noise handling activity and less
parameters. Grid based technique can find clusters in least time complexity as it can perform very fast processing of
datasets. So, the Density-Grid Hybrid clustering approach can be one of the best solutions for any kind of clustering
problems. GDCT, DGCL and DisClus which fall in this category are some of the best algorithms in the arena.


The clusters obtained from the techniques discussed can further be refined by smoothing the cluster boundaries. This
can be performed by employing Membership functions and fuzzy logic on the boundary data points to find the
probability and membership of these points with respect to the clusters and hence predicting the exact cluster where
the point belong to.






3. Acknowledgements

We would sincerely like to thank Mr. Sarbeswar Das, Project Manager BSNL CDR Project for his encouragement to
write this paper and also for sharing his experience and expertise.

Hrishav &Anupam

20


4. References

[1] J. Han and M. Kamber, (2004), Data Mining: Concepts and Techniques. India: Morgan Kaufmann Publishers.

[2] M. Ester, H. P. Kriegel, J. Sander and X. Xu,( 1996), A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases withNoise, in International Conference on
Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, pp.226-231.

[3] C. Hsu and M. Chen,(2004) Subspace Clustering of High Dimensional Spatial Data with
Noises, PAKDD, pp. 31-40.

[4] W. Wang, J. Yang, and R. R. Muntz,(1997) STING: A Statistical Information Grid Approach to
Spatial data Mining, in Proc. 23
rd
International Conference on Very Large Databases, (VLDB),
Athens, Greece, Morgan Kaufmann Publishers, pp. 186 - 195.

[5] G. Sheikholeslami, S. Chatterjee and A. Zhang,(1998) Wavecluster: A Multiresolution
Clustering approach for very large spatialdatabase, in SIGMOD'98, Seattle.

[6] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan,(1998) Automatic subspace clustering
of high dimensional data for data miningapplications, in SIGMOD Record ACM Special Interest
Group on Management of Data, pp. 94105.

[7] H. S. Nagesh, S. Goil and A. N. Choudhary,(2000) A scalable parallel subspace clustering
algorithm for massive data sets, in Proc.International Conference on Parallel Processing, pp. 477.

[8]Tian Zhang, Raghu Ramakrishnan ,MironLivny, (1996), BIRCH: An Efficient Data Clustering
Method for Very Large Databases, Proceeding SIGMOD '96 Proceedings of the 1996 ACM
SIGMOD international conference on Management of data Pages 103-114 ,ACM New York, NY, USA

[9] WANG, W., YANG, J., and MUNTZ, R.R. (1999). STING+: An approach to active spatialdata
mining. In Proceedings 15th ICDE, 116-125, Sydney, Australia.

[10] HAN, J., KAMBER, M., and TUNG, A. K. H. (2001), Spatial clustering methods in data mining: A
survey. In Miller, H. and Han, J. (Eds.) Geographic Data Mining andKnowledge Discovery, Taylor
and Francis.

[11] TUNG, A.K.H., NG, R.T., LAKSHMANAN, L.V.S., and HAN, J. (2001), Constraint-Based
Clustering in Large Databases, In Proceedings of the 8th ICDT, London, UK.

[12] STREHL, A. and GHOSH, J. 2000. A scalable approach to balanced, high-dimensional
clustering of market baskets, In Proceedings of 17th International Conference on HighPerformance
Computing, Springer LNCS, 525-536, Bangalore, India.

21

[13]L. Ertoz, M. Steinbach and V. Kumar,(2003) Finding Clusters of Different Sizes, Shapes, and
Densities in Noisy, High Dimensional Data,in SIAM International Conference on Data Mining
(SDM '03).

[14]M. Ankerst, M. M. Breuing, H. P. Kriegel and J. Sander,(1999) OPTICS: Ordering Points To
Identify the Clustering Structure, inACMSIGMOD, pp. 49-60.

[15] S. Roy and D. K. Bhattacharyya,(2005) An Approach to Find Embedded Clusters Using
Density Based Techniques, in Proc. ICDCIT,LNCS 3816, pp. 523-535.

[16]HrishavBakulBarua, Dhiraj Kumar Das and SauravjyotiSarmah,(2012), A Density Based
Clustering Technique For Large Spatial Data Using Polygon Approach, TDCT, IOSR Journal of
Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 3, Issue 6 (July-Aug. 2012), PP 01-
10.

[17]HrishavBakulBarua and SauravjyotiSarmah.(2012), Article: An Extended Density based
Clustering Algorithm for Large Spatial 3D Data using Polyhedron Approach (3D-CATD).
International Journal of Computer Applications 58(2):4-15, November 2012. Published by
Foundation of Computer Science, New York, USA(ISBN: 973-93-80871-32-3),(ISSN:0975 8887)

[18] E. Januzaj, H. P. Kriegel and M. Pfeifle, Towards Effective andEfficient Distributed
Clustering.Workshop on Clustering Large DataSets, ICDM'03.Melbourne, Florida, 2003.

[19] S. Sarmah, R. Das and D. K. Bhattacharyya, Intrinsic Cluster Detection Using Adaptive
Grids, in Proc. ADCOM'07, Guwahati,2007.

[20] S. Sarmah, R. Das and D.K. Bhattacharyya, A Distributed Algorithm for Intrinsic Cluster
Detection over Large Spatial Data Agrid-density based clustering Technique (GDCT), World
Academy of Science, Engineering and Technology 45, pp. 856-866, 2008.

[21]Januzaj, E., et al. (2003): Towards effective and efficient distributed clustering. In:
Proceedingsof the ICDM 2003 .










Thank You












Contact
For more information, contact
hrishav.barua@tcs.com,
anupam1.r@tcs.com

AboutTataConsultancyServices(TCS)
TataConsultancyServicesisanITservices,consultingandbusinesssolutions
organizationthatdeliversrealresultstoglobalbusiness,ensuringalevelofcertaintyno
otherfirmcanmatch.TCSoffersaconsultingled,integratedportfolioofITandIT
enabledinfrastructure,engineeringandassuranceservices.Thisisdeliveredthroughits
uniqueGlobalNetworkDeliveryModel
TM
,recognizedasthebenchmarkofexcellencein
softwaredevelopment.ApartoftheTataGroup,Indiaslargestindustrialconglomerate,
TCShasaglobalfootprintandislistedontheNationalStockExchangeandBombay
StockExchangeinIndia.
Formoreinformation,visitusatwww.tcs.com.

ITServices
BusinessSolutions
Consulting

All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,
republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.
Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,
and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited

Potrebbero piacerti anche