Sei sulla pagina 1di 19

Comparative Analysis of PSO and WDO Algorithms for the

Application of Clustering in Data Mining

By Dr. Surender Singh


Chandigarh University
INTRODUCTION- DATA MINING
• “Data Mining” is a concept of analyzing very large
data sets to extract and discover previously unknown
structures and relations out of such huge heaps of
details and transformation in useful information.
• Data Mining consists of all methods and techniques
which can be considered under definitions covering
link analysis/associations, sequential patterns, analysis
of time series, classification by decision trees or neural
networks, cluster analysis and scoring models.
DATA MINING FUNCTIONS
• Concept description- Description of the concept is the use of
statistics in the traditional method to calculate the various data items in
the database such as total, mean, variance, etc. to concentrate and
compare the data with other objects.

• Correlation analysis- Correlation analysis finds interesting


association or relation among the various attributes of the large number
of continuously collection and storage of data. This helps in making data
model and establishing mining association rules.

• Classification and prediction- Classification and


prediction are two forms of data analysis which can be used to extract
models describing important data classes or pre-future trends measured
data.
DATA MINING FUNCTIONS
• Cluster analysis- Cluster analysis categories the data according to maximum
similarity and minimum dissimilarity principle. It helps in managing and analyzing huge
amount of data by clustering similar looking data into one cluster. Clustering is also
easy to observe the contents of the organization into hierarchical structure to organize
similar events together.

• Outlier analysis- Database may contain data objects whose general behavior
is inconsistent with the data or the model. These data objects are called outliers. Many
data mining algorithms attempt to minimize the impact of outliers; however, in some
applications isolated point (outliers) may give very important message.

• Time series analysis- In this we map or visualize the data attribute value
changes over time for historical patterns that helps in study of behavioral
characteristics of events. It is also used to forecast the future value in historical data of
time series.
CLUSTER ANALYSIS
• Clustering is a division of data into groups of similar objects. A
cluster consists of objects that are similar between themselves
and dissimilar to objects of others. From a mathematical
perspective clusters correspond to data modeling and from
machine learning perspective, clusters correspond to hidden
patterns. The search for clusters is unsupervised learning, and
the resulting system represents a data concept.
• Clustering and classification are both fundamental tasks in Data
Mining. Classification is used mostly as a supervised learning
method, clustering for unsupervised learning. The goal of
clustering is descriptive, while that of classification is predictive.
Cluster Analysis Techniques
• A clustering algorithm/Technique assigns a large
number of data points to a smaller number of
groups such that data points in the same group
share the same properties while, in different groups,
they are dissimilar.
• Clustering has many applications, including part
family formation for group technology, image
segmentation, information retrieval, web pages
grouping, market segmentation, and scientific and
engineering analysis.
Cluster Analysis Techniques
• Farley and Raftery [Farley1998] and Pave
[Pave2002] suggest dividing the clustering
methods into two main groups: hierarchical and
partitioning methods.
• Han and Kamber [Han2001] suggest categorizing
the methods into additional three main
categories: density-based methods, model-
based clustering and grid based methods.
• [Farley1998] Farley C. and Raftery A.E., “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster
Analysis”, Technical Report No. 329. Department of Statistics University of Washington, 1998.
• [Han2001] Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
Cluster Analysis Techniques
• Hierarchical Methods construct the clusters by recursively
partitioning the instances in either a top-down or bottom-up
fashion. [Jain1999]
– Agglomerative hierarchical clustering (Bottom-up)
– Divisive hierarchical clustering (top-down)
• Partitioning methods relocate instances by moving them from
one cluster to another, starting from an initial partitioning.
Such methods typically require that the number of clusters
will be pre-set by the user.
– K-means [Selim1984]
– K-medoid [Kaufman1987]
[Jain1999] Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing Surveys, Vol. 31, No. 3, September 1999.
[Kaufman1987] Kaufman, L. and Rousseeuw, P.J., Clustering by Means of Medoids, In Y. Dodge, editor, Statistical Data Analysis, based on the
L1 Norm, pp. 405- 416, 1987, Elsevier/North Holland, Amsterdam.
[Selim1984] Selim, S.Z., and Ismail, M.A. K-means-type algorithms: a generalized convergence theorem and characterization of local
optimality. In IEEE transactions on pattern analysis and machine learning, vol. PAMI-6, no. 1, January, 1984.
Cluster Analysis Techniques
•Graph theoretic methods are methods that produce clusters via graphs. The
edges of the graph connect the instances represented as nodes. A well-known
graph-theoretic algorithm is based on the Minimal Spanning Tree (MST)
[Zahn1971 [Urquhart1982].
•Density-based methods assume that the points that belong to each cluster are
drawn from a specific probability distribution [Banfield1993]. The overall
distribution of the data is assumed to be a mixture of several distributions.
Model-based Clustering Methods attempt to optimize the fit between the
given data and some mathematical models. Unlike conventional clustering,
which identifies groups of objects; model-based clustering methods also find
characteristic descriptions for each group, where each group represents a
concept or class. The most frequently used induction methods are decision
trees and neural networks. [Fisher1987].
[Banfield1993] Banfield J. D. and Raftery A. E. . Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803-821, 1993.
[Fisher1987] Fisher, D., Knowledge acquisition via incremental conceptual clustering, in machine learning 2, pp. 139-172, 1987.
[Urquhart1982] Urquhart, R. Graph-theoretical clustering, based on limited neighborhood sets. Pattern recognition, vol. 15, pp. 173-187, 1982.
[Zahn1971] Zahn, C. T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE trans. Comput. C-20 [Apr.], 68-86, 1971 71, 2002. doi:
10.1007/3-540-28349-8_2
Soft Computing Techniques for Clustering
• Artificial Neural Network which is a powerful method for clustering and
visualization of high dimensional data [Kohonen1990].
• Fuzzy soft clustering schema in which each pattern is related to every cluster using
membership function, namely, each cluster is a fuzzy set of all the patterns
[Hoppner2005].
• Evolutionary techniques are stochastic general purpose methods for solving
optimization problems. Since clustering problem can be defined as an optimization
problem, evolutionary approaches may be appropriate here. The idea is to use
evolutionary operators and a population of clustering structures to converge into a
globally optimal clustering. Candidate clustering are encoded as chromosomes.
Various techniques are Genetic algorithms(GA), Simulated Annealing(SA) , particle
swarm optimization (PSO) [Selim1991] [A1-Sultan1995]
[A1-Sultan1995] A1-Sultan K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28:1443-1451,1995.
[Hoppner2000] Hoppner F. , Klawonn F., Kruse R., Runkler T., Fuzzy Cluster Analysis,Wiley, 2000.
[Kohonen1990] Teuvo Kohonen, “The Self-Organizing Map,” Proceedings of IEEE, vol. 78, no. 9, Sep 1990, pp. 1464-1480
[Selim1991] Selim, S. Z. AND Al-Sultan, K. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recogn. 24, 10 [1991], 10031008.
PSO Techniques for Clustering
Another frequently used evolutionary algorithm is Particle Swarm Optimization (PSO). It is a
population based stochastic optimization technique that can be used to find an optimal,
or near optimal, solution to a numerical and qualitative problem [Kennedy][Yuhui1998].
Several attempts were proposed in the literature to apply PSO to the data clustering
problem [Steinbach2000, Alwee2009, Engelbrecht2003, Sherin2007 and Mariam2013].

In the PSO clustering algorithm, the whole dataset can be represented as a multiple
dimension space with a large number of dots in space. One particle in the swarm
represents one possible solution for clustering the dataset. Each particle maintains a
matrix Xi= (C1, C2 , …, Ci, .., Ck), where Ci represents the ith cluster centroid vector and k
represent the total number of clusters. According to its own experience and those of its
neighbors, the particle adjusts the centroid vector position in the vector space at each
generation. The average distance of data objects to the cluster centroid is used as the
fitness value to evaluate the solution represented by each particle. The fitness value is
measured by the Euclidian Distance or Sum of Squared Error(SSE) between different
dimensions of Center of cluster and instances.
PSO Algorithm
Following steps illustrates the overall optimization scheme of PSO.
 
1. Initialize the particle population by randomly assigning locations (X-vector for
each particle) and velocities (V-vector with random or zero velocities- in our case
it is initialized with zero vector)
2. Evaluate the fitness of the individual particle and record the best fitness P best for
each particle so far and update P-vector related to each Pbest.
3. Also find out the individuals’ highest fitness G best and record corresponding
position pg.
4. Modify velocities based on particle best and Global best positions using
equation 1.
5. Update the particles position using equation 2.
6. Terminate if the condition is met.
7. Go to step 2.
•  
PROBLEM FORMULATION

Problem Identification:
• The major drawbacks of existing PSO
techniques are that their computation time is
very high and convergence speed is low. So
some new techniques such as WDO has to be
implemented and evaluated for better
clustering efficiency.
Objectives

• To implement and evaluate the


efficiency of WDO and PSO algorithm
for data clustering.
• To identify and investigate the factors
responsible for data clustering.
Experimentation
Experimentation
• For experimental setup we will be using MATLAB environment.
Simple MATLAB programming will be used for modeling PSO and
WDO algorithm. For clustering we will be using normalized data with
fitness function for adjudging the quality of cluster in clustering
problem. The clustering problems used for the purpose of this thesis
will be:
– Iris plants database: This is a well-understood database with 4 inputs, 3
classes and 150 data vectors.
– Wine: This is a classification problem with "well behaved” class structures.
There are 13 inputs, 3 classes and 178 data vectors.
– Hayes Roth which has 132 data vectors with 3 classes and 5 inputs.
– Diabetes data set has 768 data vectors having 2 classes and 8 inputs.
Summary Synopsis
• There are various soft computing based algorithm
applied for data clustering such as GA, SA, PSO, and
FUZZY etc. but these algorithm have large
computation time and slow convergence for large
and multi-dimensional data. In this thesis an effort
will be made to implement and evaluate the
efficiency of WDO and PSO algorithm for data
clustering. We will also try to identify and
investigate the factors responsible for data
clustering.
References and Bibliography
• [A1-Sultan1995] A1-Sultan K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28:1443-1451,1995.
• [Alwee2009] Razan Alwee, Siti Mariyam, Firdaus Aziz, K.H.Chey, Haza Nuzly, "The Impact of Social Network Structure in Particle Swarm Optimization for Classification Problems".
International Journal of Soft Computing , Vol. 4, No. 4, 2009, pp:151-156.
• [Banfield1993] Banfield J. D. and Raftery A. E. . Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803-821, 1993.
• [Cui2005] X. Cui, P. Palathingal, T.E. Potok, "Document Clustering using Particle Swarm Optimization". IEEE Swarm Intelligence Symposium 2005, Pasadena, California, pp. 185 - 191. doi:
10.1109/SIS.2005.1501621
• [Eberhart2001] Eberhart, R.C., & Shi, Y. Particle Swarm Optimization: Developments, Applications and Resources. Congress on Evolutionary Computation, Seoul, Korea, 81-86. 2001
• [Engelbrecht2003] Van D. M., Engelbrecht. A.P., "Data clustering using particle swarm optimization". Proceedings of IEEE Congress on Evolutionary Computation 2003, Canbella,
Australia. pp: 215-220. doi: 10.1109/CEC.2003.1299577
• [Farley1998] Farley C. and Raftery A.E., “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, Technical Report No. 329. Department of Statistics
University of Washington, 1998.
• [Han2001] Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
• [Hoppner2000] Hoppner F. , Klawonn F., Kruse R., Runkler T., Fuzzy Cluster Analysis,Wiley, 2000.
• [Izakian2009] H. Izakian, A. Abraham, and V. Snásel, "Fuzzy Clustering using Hybrid Fuzzy c-means and Fuzzy Particle Swarm Optimization", World Congress on Nature & Biologically
Inspired Computing, NaBIC 2009. In Proc. NaBIC, pp.1690-1694, 2009. doi: 10.1109/NABIC.2009.5393618
• [Jain1999] Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing Surveys, Vol. 31, No. 3, September 1999.
• [Kaufman1987] Kaufman, L. and Rousseeuw, P.J., Clustering by Means of Medoids, In Y. Dodge, editor, Statistical Data Analysis, based on the L1 Norm, pp. 405- 416, 1987, Elsevier/North
Holland, Amsterdam.
• [Kennedy1995] Kennedy, J., & Eberhart, R. Particle Swarm Optimization. IEEE Conference on Neural Networks, Perth, Australia, 1942-1948. . 1995
• [Kohonen1990] Teuvo Kohonen, “The Self-Organizing Map,” Proceedings of IEEE, vol. 78, no. 9, Sep 1990, pp. 1464-1480
• [Mariam2013] Mariam El-Tarabily, Rehab Abdel-Kader, Mahmoud Marie, Gamal Abdel-Azeem, " A PSO-Based Subtractive Data Clustering Algorithm ". International Journal of Research
in Computer Science, 3 [2]: pp. 1-9, March 2013. doi: 10.7815/ijorcs. 32.2013.060
• [Pave2002] Pavel Berkhin, "Survey of clustering data mining techniques". Accrue Software Research Paper, pp.25-
• [Selim1984] Selim, S.Z., and Ismail, M.A. K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. In IEEE transactions on pattern analysis
and machine learning, vol. PAMI-6, no. 1, January, 1984.
• [Selim1991] Selim, S. Z. AND Al-Sultan, K. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recogn. 24, 10 [1991], 10031008.
• [Steinbach2000] Michael Steinbach, George Karypis, Vipin Kumar, "A Comparison of Document Clustering Techniques".Te xtMining Workshop, KDD, 2000.
• [Urquhart1982] Urquhart, R. Graph-theoretical clustering, based on limited neighborhood sets. Pattern recognition, vol. 15, pp. 173-187, 1982.
• [Yuhui1998] Yuhui Shi, Russell C. Eberhart, "Parameter Selection in Particle Swarm Optimization". The 7th Annual Conference on Evolutionary Programming, San Diego, pp. pp 591-600,
1998. doi: 10.1007/BFb0040810
• [Zahn1971] Zahn, C. T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE trans. Comput. C-20 [Apr.], 68-86, 1971 71, 2002. doi: 10.1007/3-540-28349-8_2
Thanks!

Any Query Please!

Potrebbero piacerti anche