0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
31 visualizzazioni5 pagine
Clustering is the process of organizing
objects into groups whose members are similar in
some way, and different from members of other
groups. Clustering is an efficient data mining
technique that finds its usage in various fields.
Clustering and Classification can be merged for
better result solutions and also they complement each
other. They can be used together as the traditional
pattern recognition methods. With the advancement
in the field of microarray technology, cluster analysis
of the genes is made possible has many applications,
by providing insight to the structural, functional and
organisational aspects of gene data sets. Traditional,
Hierarchical, Density-based and Evolutionary
clustering algorithm for gene expression are
discussed. Evolutionary Approaches include
Clustering based on Genetic Algorithm. Special
characteristics of gene expression data and the
particular requirements from the biological domain,
gene-based clustering presents several new
challenges are also listed finally.
Titolo originale
A Survey on Clustering Approaches for Gene Expression Patterns
Clustering is the process of organizing
objects into groups whose members are similar in
some way, and different from members of other
groups. Clustering is an efficient data mining
technique that finds its usage in various fields.
Clustering and Classification can be merged for
better result solutions and also they complement each
other. They can be used together as the traditional
pattern recognition methods. With the advancement
in the field of microarray technology, cluster analysis
of the genes is made possible has many applications,
by providing insight to the structural, functional and
organisational aspects of gene data sets. Traditional,
Hierarchical, Density-based and Evolutionary
clustering algorithm for gene expression are
discussed. Evolutionary Approaches include
Clustering based on Genetic Algorithm. Special
characteristics of gene expression data and the
particular requirements from the biological domain,
gene-based clustering presents several new
challenges are also listed finally.
Clustering is the process of organizing
objects into groups whose members are similar in
some way, and different from members of other
groups. Clustering is an efficient data mining
technique that finds its usage in various fields.
Clustering and Classification can be merged for
better result solutions and also they complement each
other. They can be used together as the traditional
pattern recognition methods. With the advancement
in the field of microarray technology, cluster analysis
of the genes is made possible has many applications,
by providing insight to the structural, functional and
organisational aspects of gene data sets. Traditional,
Hierarchical, Density-based and Evolutionary
clustering algorithm for gene expression are
discussed. Evolutionary Approaches include
Clustering based on Genetic Algorithm. Special
characteristics of gene expression data and the
particular requirements from the biological domain,
gene-based clustering presents several new
challenges are also listed finally.
A Survey on Clustering Approaches for Gene Expression Patterns
Irene Maria 1 , Mathew Kurian 2
1 Department of Computer Science and Engineering, Karunya University, India
2 Department of Computer Science and Engineering, Karunya University, India
ABSTRACT Clustering is the process of organizing objects into groups whose members are similar in some way, and different from members of other groups. Clustering is an efficient data mining technique that finds its usage in various fields. Clustering and Classification can be merged for better result solutions and also they complement each other. They can be used together as the traditional pattern recognition methods. With the advancement in the field of microarray technology, cluster analysis of the genes is made possible has many applications, by providing insight to the structural, functional and organisational aspects of gene data sets. Traditional, Hierarchical, Density-based and Evolutionary clustering algorithm for gene expression are discussed. Evolutionary Approaches include Clustering based on Genetic Algorithm. Special characteristics of gene expression data and the particular requirements from the biological domain, gene-based clustering presents several new challenges are also listed finally.
Index Terms: Clustering, Fuzzy Partitioning, Gene Expression, Genetic Algorithm, Microarray. 1. Introduction Clustering needs a unique and clear decision about the clusters to be formed. A few clustering algorithms which focuses on categorical data have been developed. But in most cases, the measures contributing to the clusters may not be appropriate. There is an increasing interest in clustering methods when used in pattern recognition, image processing and information retrieval and also in fields like biology, geology and marketing. 1.1 Clusters and Clustering Clustering is an important real world problem. Clustering is a suitable example of unsupervised classification. It is the process of grouping data objects into a set of classes. These classes called clusters may have entities with high similarity and dissimilarity with entities in other clusters. So clustering can be used to find rules for classifying objects. To detect clusters with diverse shapes and
sizes is a fundamental limitation of every clustering algorithm. Even if we use the clustering criterion, the discovery of a majority of the clusters present in the data is a difficult goal while exploring the patterns. This becomes more difficult when there is no much information about the data organization. 1.2 Clustering and Classification Clustering and Classification both are very important for traditional pattern recognition and they also complement each other. Clustering can improve the generalization of classification while the information from classes can improve the accuracy of the clustering solutions.[1] To incorporate the advantages of both these learning methods, many algorithms have been developed. All these approaches uses the method of optimizing clustering criterion first and then the classification criterion obtained in the clustering solution.
1.3 Applications of clustering gene expression data Microarray technologies have been developed to monitor expression levels of genes. Clustering techniques proves to be helpful by giving insight to features like gene function, gene structure, and cellular processes [2]. Genes with similar expression patterns called as co-expressed genes can be clustered together with similar cellular functions. Moreover, co-expressed genes in the same cluster are likely to have same cellular processes. The inference of regulation through the clustering of gene expression data also gives details to information regarding the mechanism of the transcriptional regulatory network. Traditional Clustering Algorithms include hierarchical, partitioning, and density-based methods.
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page215
2. Hierarchical clustering Hierarchical clustering generates a hierarchical series of clusters which can be graphically represented by a tree, termed as dendrogram. A hierarchical clustering method can be started by defining a distance between two data points in gene expression. The groupings of two closest data points are done and it proceeds to the creation of a cluster tree. One way to evaluate whether the obtained clusters are stable or not is to explore the original data set and see whether the same clusters are found again. Hierarchical clustering identifies sets of correlated genes with same behavior present in the samples and gives thousands of clusters in a tree like structure which is difficult to understand and explore. Hierarchical clustering can be further divided into aggglometric and divisive based on the formation of dendrogram.[3],[4]Agglomerative algorithms are known to be bottom-up approach, which perform repeated incorporations of groups of data until some pre-defined threshold is reached. Here, linkage is used as the criteria to determine the distance between two clusters. Single linkage is the smallest minimum distance between two objects of two clusters, and complete linkage is the smallest maximum distance between two objects of the two clusters whereas average linkage is the mean distance between every pair of objects of two clusters In agglometric hierarchical clustering, all objects begin with individual clusters .Then the object pair with highest similarity is merged to the same cluster. Thus the result of agglometric cluster is a complete graph where each node has relations with all other nodes. Divisive hierarchical clustering is contrary to agglometric hierarchical clustering which uses top- down approach. This approach recursively divide the data until some pre-defined threshold is reached. The algorithm divides the complete graph into smaller components. It results in a dendogram with branches as clusters and also provides the information about the similarity between the clusters. CURE [5] is one of the agglomerative hierarchical clustering algorithms, which begins by choosing a constant number, of well scattered points, from a cluster. These points can be castoff to identify the shape and size of the cluster. The next step of the algorithm deals with the shrinkage of the selected points toward the centroid of the cluster using some predetermined fractional value. So this an agglomerative hierarchical clustering algorithm, relies on links and not distances, to measure the proximity between a pair of data points, before the merging is done. Some agglometric hierarchical clustering like CHAMELEON [6], uses a graph partitioning algorithm to partition based on the nearest neighbor approach and then uses an agglomerative hierarchical clustering algorithm to combine the sub-clusters and after that find the real clusters from them. Hierarchical clustering methods are popular because of their presentation of cluster results and are preferred widely by biologists .This method has many advantages in embedded flexibility regarding the level of granularity and it is easy to handle any forms of similarity or distance The method is effective[3] which is depended on : i. The appropriateness of the validity measure used. ii. The need for incorporating information, expression values and other type of biological domain knowledge while exploring the cluster. iii. Ability to be applied on high dimensional numeric data. iv. Representation of nested cluster structure is clear but is unsatisfactory in representing intersected clusters. 2. Partitional clustering This Clustering technique uses the partitioning of the available data points into different clusters based on a single center criterion. 2.1 Centroid models In many clustering algorithms, dissimilarity between the points in datasets are computed using the proximity distance measures like Euclidean distance, Mahalanobis distance,Pearson correlation,etc. Some algorithms have optimized validity measures like compactness, separation or both of the clusters. 2.1.1 K-means K-means [7] is a simple and fast centroid-based clustering algorithm in which division of the data is done based on pre-defined number of clusters in order to optimize a predefined criterion. The K- means algorithm calculates the center of a cluster by using the mean of that cluster features [3] .The users run the algorithm repeatedly with the use of different values of k and compare the clustering results. Thus the detection of the optimal number of clusters is done. When a large gene expression dataset is available, which consists of thousands of genes, this extensive process may not be practical. Also, gene expression data may normally contain a huge amount of noise and the k-means algorithm forces each gene to be included in a single cluster, which may lead to the generation of biologically irrelevant clusters. Here it becomes difficult to detect clusters of arbitrary shapes and structure. Even though all these difficulties prevail, the k-means algorithm is International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page216
frequently used to cluster gene expression data due to its simplicity and it provides baseline results when compared to the development of new clustering algorithms. The application of k-means algorithm and its variants are widely used for clustering gene expression data. This algorithm is generally inefficient in the case of categorical datasets which lack inherent distance measure. Other extensions of K-means with cluster medoids, cluster modes instead of cluster center were also used. All these algorithms use a single function for partitioning the data. In the cases where natural ordering of entities is not found, algorithms like K- means and fuzzy C-means fails. 2.1.2 K-medoids A variation of K-means clustering method with the cluster medoid taken as the most centrally located point is termed as K-medoids. All the rest of the data points are assigned to their clusters, based on the new medoid determined in each iteration. Early versions of k-medoid methods are the algorithms PAM and CLARA [8]. PAM (Partitioning around Method) uses dissimilarity values and an iterative approach to determine a representative object for each cluster called medoid. Once the medoids have been found, each non-selected object is grouped with the medoid to which it is found to be nearly similar. The quality of a clustering is measured by the average dissimilarity between an object and the medoid of its cluster. CLARA follows the same principle as PAM, except in the approach of finding representative objects for the entire dataset, instead it draws a sample of the dataset, and applies PAM to this sample. It then classifies the remaining objects using usual partitioning principles. CLARANS [9] functions on a randomized search of a graph to find medoids which represents the clusters. The algorithm takes input maxneighbor and numlocal, as input then selects a random node and then tries to check a sample of the neighbors of the particular node. If a better neighbor is found, it moves to the neighbor and continues the process until the maxneighbor criterion is met. Otherwise, it declares the current node as a local minimum and starts a new search for finding the other local minima. After the collection of a specified number of local minima or numlocal, the algorithm returns the best of these local values as the medoid of the cluster. 2.1.3 K-modes K-modes [7] work in similar to K-medoids except that instead of medoids, modes are used. This method can be told as a version of fuzzy K-modes algorithm. The steps are similar to K-medoids except in assignment of cluster centers. K-modes is a much faster extension of k-means algorithm to handle categorical data which uses a different similarity measure, as different from the case of k-means, k- modes uses a frequency based method to update modes. But the k-modes algorithm is useful only after the conversion of the numeric data into categorical data and may lead to information loss which subsequently may deteriorate the cluster result. All these traditional partitional clustering techniques take the advantage of greedy search techniques for the optimization of compactness of the clusters. But these approaches suffer from the problem of local optima and also only a single cluster validity index is taken for optimization. Partitioning based clustering algorithms can find separate clusters, in the context of gene expression clustering,[3]but it suffers from the following problems: i. In the case where the number of clusters is not known apriori. ii. The validity measures used by most of the above techniques are inadequate due to large size of gene data. iii. The gene data, apart from displaying disjoint patterns of clusters, often show evidences of intersected and embedded cluster patterns, which are usually not encouraged. 3. DBSCAN This is a density-based clustering algorithm. The algorithm progresses with regions of high density into clusters and discovers clusters of arbitrary shape in spatial databases. It defines a cluster as a maximal set of density-connected points. The given cluster continues to grow as long as the density number of objects or data points exceeds particular threshold. It can be effectively used to filter out noise and discover clusters of arbitrary shape. Statistically significant patterns can be derived from dense regions, which can then be used to identify genes of interest and also eliminate others. 4. Soft Computing Techniques 4.1 Fuzzy partitioning In classical fuzzy clustering, the fuzziness is usually a possibility of membership of each element into different classes with different degrees from [0, 1]. In this approach, fuzziness of clustering is evaluated as the detail of the properties of classified elements investigated. Sometimes Gene expression data analysis encounters an intersecting gene pattern in which case a crisp or hard clustering may not yield International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page217
a good result. But for fuzzy clustering, a gene can belong to several clusters with certain degrees of membership. 4.1.1 Fuzzy C-means Fuzzy C-means (FCM) [10] is a widely used technique that uses the principles of fuzzy principles to evolve a partition matrix.FCM algorithm starts with randomly selected initial K cluster centers, and then at every iteration it finds the fuzzy membership of each gene. The main advantage of this algorithm is that it has both the determination of initial clusters and cluster validity, which are the fundamental stages of a clustering process. Also it uses a more sensitive accurate analysis compared to DBSCAN algorithm since it uses the principles of the fuzzy theory. The genetic algorithms (GAs) can be incorporated with fuzzy clustering to produce a better cluster. In the approach of GA-based fuzzy clusters, real numbers represent the data points of the centers of the partitions. 5. Gene Expression Patterns and Clustering The study of expression levels of genes can be done with the advancement in the field of microarray technology. This technology finds its application in various fields including medical diagnosis of diseases. The analysis of microarray data can be done by clustering.[11]While considering the set of features in the clustering process, most of the existing clustering algorithms appears to be sensitive. 5.1.1 Single objective clustering Any external knowledge is unavailable during the process of clustering and they are generally unsupervised. In single objective clustering method, the criterion may be based on the similarity or dissimilarity of the data items[12].The clusters formed by these criterion may not be correct as they can take more other complex structures and the clusters can fail if the chosen objective function is not appropriate. 5.1.2 Multiobjective Clustering In Multiobjective clustering, we use several clustering algorithms along with different objective functions. The result may not only contain the clusters, but also the specific objective function contributing towards the cluster formation[13].Here the problem of choosing one objective function as in single objective function is alleviated as this approach uses the combination of different objective functions.
6. Evolutionary methods A genetic algorithm (or GA) is a search technique used to find true and approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search techniques.[14]These are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. The Single objective Genetic Algorithm uses the approach of using a single function with different clustering algorithms. In this clustering technique, each algorithm works with separate individual functions such that the result has cluster solution with unique functions different from other clusters. 7. Challenges of gene clustering Special characteristics of gene expression data, and the particular requirements from the biological domain, gene-based clustering presents several new challenges [3][2]. First, cluster analysis is typically the first step in data mining and knowledge discovery. The purpose of clustering gene expression data is to reveal the natural data structures and gain some initial insights regarding data distribution. Therefore, a good clustering algorithm should depend as little as possible on prior knowledge, which is usually not available before cluster analysis. Second, the effectiveness of a clustering technique is highly influenced by the proximity measure, used by the technique. Choosing or finding such an appropriate proximity measure is a challenging task. Third, gene expression data often contain a huge amount of noise and missing values, due to the complex procedures of microarray experiments. Therefore, clustering algorithms for gene expression data should be capable of extracting useful and needed information. Fourth, algorithms for gene-based clustering should be able to effectively handle the situation that gene expression data are highly connected, and clusters may be highly intersected with each other or even embedded. Finally, the associations between the clusters and also the relationship between the genes within the same cluster are of great importance. Thus the clustering algorithm should not only partition the data set but also provide some graphical representation of the cluster structure would be more efficient. Also, clustering algorithm should be efficient in order to scale with the increasing size of datasets as well as dimensionality.
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page218
Conclusion Although gene expression clustering has been done by applying k-means, hierarchical clustering and DBSCANS, the desired features for clustering can include the minimum user input for finding arbitrary shaped clusters, robustness to noises and the ability to handle higher dimensionality data. If outliers or noises create a problem, a new technique is suitable for outlier detection if hierarchical clustering is to be used. Fuzzy C-means or GA based approach supports overlapping clusters of co-regulated genes and needs user specified parameters. A clustering algorithms capability to cluster biological data set depends upon certain desirable features such as speed, minimum number of input parameters, robustness to noise and outliers. Though the principles of many clustering algorithms available satisfy these requirements, in most cases, they do not provide accurate solutions and cannot be applied for clustering of biological data. In this assessment, an attempt has been made to provide a comprehensive and precise survey of various clustering approaches in the context of pattern identification and recognition in the gene expression data. Effectiveness of a clustering technique is highly influenced by the selection of algorithm and criterion used by the technique. A short list of clustering approaches available for numeric data clustering is provided. Each algorithm is analyzed effectively and their shortcomings are also enumerated. Finally, discussion about the challenges faced by the clustering schemes available for the effective clustering of gene expression is also provided. Acknowledgement I feel it pleasure to be indebted to my guide Mr Mathew Kurian, M.E, Assistant professor, Department of Computer Science and Engineering for his invaluable support, advice and encouragement and the reference for his feedback. References [1] Weiling Cai, Songcan Chen, and Daoqiang Zhang A Multiobjective Simultaneous Learning Framework for Clustering and Classification IEEE ,2010. [2]Daxin J iang Chun Tang Aidong Zhang, Department of Computer Science and Engineering, State University of New York at Buffalo Cluster Analysis for Gene Expression Data: A Survey IEEE Transactions on Knowledge and Data Engineering archive,Volume 16 Issue 11, November 2004 ,Page 1370-1386 [3]Sajid Nagi, Dhruba K. Bhattacharyya, J ugal K. Kalit, Gene Expression Data Clustering Analysis: A Survey2011. [4]Clustering of high throughput gene expression data Harun Pirim, Burak E ks-ioglu , Andy D.Perkins ,Cetin Y uceer , Computers and Operations Research, 39(12):3046-3061, 2012. [5] S. Guha, R. Rastogi, and K Shim, CURE: An Efficient Clustering Algorithmfor Large Datasets, ACM SIGMOD Conf., 1998. [6] G. Karypis, E. H. Han and V. Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling, Computer, vol. 32, no. 8, pp. 68-75, 1999. [7] J McQueen,SomeMethodsforClassifications and Analysis of Multivariate Observations, in the Proc of 5th Barkeley Symposiumon Mathematics, Statistics and Probability, pp. 281- 197, 1967. [8] L. Kaufman and P. J . Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, J ohn Wiley & Sons, 1990.(8) [9] T. Ng. Raymand and J . Han, Efficient and Effective Clustering Method for Spatial Data Mining, In the VLDB94, pp. 144-155, 1994.(9) [10] An overview of fuzzy & crisp clustering algorithms,Dr. Gozde ULUTAGAY,IzmirUniversity,Department of Industrial Engineering, New Bulgarian University Lecture-3,May 31, 2012. [11] Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandyopadhyay, Simultaneous Informative Gene Selection and Clustering through Multiobjective OptimizationIEEE 2010. [12] J . Handl and J . Knowles, An evolutionary approach to multiobjective clustering, IEEE Trans.,2007. [13]Martin H. C. Law Alexander P. Topchy Anil K. J ain, Multiobjective Data Clustering,2004,IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [14] Ujjwal Maulik, Sanghamitra Bandyopadhyay. Genetic algorithm-based clustering technique,1999.(14) [15] A. K. J ain and R. C. Dubes, Data clustering: A review, ,1999. [16] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, Multi-objective genetic algorithm based fuzzy clustering of categorical attributes, IEEE ,2009. [17] Ujjwal Maulik Analysis of gene microarray data in a soft computing framework, Applied Soft Computing archive,Volume 11 Issue 6, September, 2011 ,Pages 4152-4160. [18] Z. Huang, A Fast Clustering Algorithm to cluster very large categorical datasets in Data Mining, SIGMOD Workshop on Research Issues on DM and Knowledge Discovery, May 1997