Sei sulla pagina 1di 16

O-Cluster: Scalable Clustering of Large High Dimensional Data Sets

Boriana L. Milenova
boriana.milenova@oracle.com

Marcos M. Campos
marcos.m.campos@oracle.com

Oracle Data Mining Technologies 10 Van de Graaff Drive Burlington, MA 1803, USA

Abstract
Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with very large number of records or data sets with very high number of dimensions. This paper provides a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the curse of dimensionality and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Clusters excellent scalability.

1. Introduction
With an increasing number of new database applications dealing with very large high dimensional data sets, data mining on such data sets has emerged as an important research area. These applications include multimedia content-based retrieval, geographic and molecular biology data analysis, text mining, bioinformatics, medical applications, and time-series matching. For example, in multimedia retrieval, the objects (e.g., images) are represented by their features (e.g., color histograms, texture vectors, Fourier vectors, text descriptors, and shape descriptors), which define high dimensional feature spaces. In many of the above-mentioned applications the data sets are very large, consisting of millions of data objects with several hundreds to thousands of dimensions. Clustering of very large high dimensional data sets is an important problem. There are a number of different clustering algorithms that are applicable to very large data sets, and a few that address high dimensional data. Clustering algorithms can be divided into partitioning, hierarchical, locality-based, and grid-based algorithms.

Copyright 2002 Oracle Corporation.

Given a data set with n objects and k n, the number of desired clusters, partitioning algorithms partition the objects into k clusters. The clusters are formed in order to optimize an objective criterion such as distance. Each object is assigned to the closest cluster. Clusters are typically represented by either the mean of the objects assigned to the cluster (k-means [Mac67]) or by one representative object of the cluster (k-medoid [KR90]). CLARANS (Clustering Large Applications based upon RANdomized Search) [NH94] is a partitioning clustering algorithm developed for large data sets, which uses a randomized and bounded search strategy to improve the scalability of the k-medoid approach. CLARANS enables the detection of outliers and its computational complexity is about O(n2). CLARANS performance can be improved by exploring spatial data structures such as R*-trees. Hierarchical clustering algorithms work by grouping data objects into a hierarchy (e.g., a tree) of clusters. The hierarchy can be formed top-down (divisive hierarchical methods) or bottom-up (agglomerative hierarchical methods). Hierarchical methods rely on a distance function to measure the similarity between clusters. These methods do not scale well with the number of data objects. Their computational complexity is usually O(n2). Some newer methods such as BIRCH [ZRL96] and CURE [GRS99] attempt to address the scalability problem and improve the quality of clustering results for hierarchical methods. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an efficient divisive hierarchical algorithm. It has O(n) computational complexity, can work with limited amount of memory, and has efficient I/O. It uses a special data structure, CFtree (Cluster Feature tree) for storing summary information about subclusters of objects. The CF-tree structure can be seen as a multilevel compression of the data that attempts to preserve the clustering structure inherent in the data set. Because of the similarity measure it uses to determine the data items to be compressed, BIRCH only performs well on data sets with spherical clusters. CURE (Clustering Using REpresentatives) is an O(n2) algorithm that produces high-quality clusters in the presence of outliers, and can identify clusters of complex shapes and different sizes. It employs a hierarchical clustering approach that uses a fixed number of representative points to define a cluster instead of a single centroid or object. CURE handles large data sets through a combination of random sampling and partitioning. Since CURE uses only a random sample of the data set, it manages to achieve good scalability for large data sets. CURE reports better times than BIRCH on the same benchmark data. Locality-based clustering algorithms group neighboring data objects into clusters based on local conditions. These algorithms allow clustering to be performed in one scan of the data set. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] is a typical representative of this group of algorithms. It regards clusters as dense regions of objects in the input space that are separated by regions of low density. DBSCANs basic idea is that the density of points in a radius around each point in a cluster has to be above a certain threshold. It grows a cluster as long as, for each data point within this cluster, a neighborhood of a given radius contains at least a minimum number of points. DBSCAN has computational complexity O(n2). If a spatial index is used, the computational complexity is O(n log n). The clustering generated by DBSCAN is very sensitive to parameter choice. OPTICS (Ordering Points To Identify Clustering

Structures) is another locality-based clustering algorithms. It computes an augmented cluster ordering for automatic and iterative clustering analysis. OPTICS has the same computational complexity as DBSCAN. In general, partitioning, hierarchical, and locality-based clustering algorithms do not scale well with the number of objects in the data set. To improve the efficiency, data summarization techniques integrated with the clustering process have been proposed. Besides the above-mentioned BIRCH and CURE algorithms, examples include: active data clustering [HB97], ScalableKM [BFR98], and simple single pass k-means [FLE00]. Active data clustering utilizes principles from sequential experimental design in order to interleave data generation and data analysis. It infers from the available data not only the grouping structure in the data, but also which data are most relevant for the clustering problem. The inferred relevance of the data is then used to control the re-sampling of the data set. ScalableKM requires at most one scan of the data set. The method identifies data points that can be effectively compressed, data points that must be maintained in memory, and data points that can be discarded. The algorithm operates within the confines of a limited memory buffer. Unfortunately, the compression schemes used by ScalableKM can introduce significant overhead. The simple single pass k-means algorithm is a simplification of ScalableKM. Like ScalableKM, it also uses a data buffer of fixed size. Experiments indicate that the simple single pass k-means algorithm is several times faster than standard k-means while producing clustering of comparable quality. All the above-mentioned methods are not fully effective when clustering high dimensional data. Methods that rely on near or nearest neighbor information do not work well on high dimensional spaces. In high dimensional data sets, it is very unlikely that data points are nearer to each other than the average distance between data points because of sparsely filled space. As a result, as the dimensionality of the space increases, the difference between the distance to the nearest and the farthest neighbors of a data object goes to zero. [BGRS99, HAK00]. Grid-based clustering algorithms do not suffer from the nearest neighbor problem in high dimensional spaces. Examples include STING (STatistical INformation Grid) [WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], and MAFIA (Merging Adaptive Finite Intervals And is more than a clique) [NGC99]. These methods divide the input space into hyper-rectangular cells, discard the low-density cells, and then combine adjacent high-density cells to form clusters. Grid-based methods are capable of discovering cluster of any shape and are also reasonably fast. However, none of these methods address how to efficiently cluster very large data sets that do not fit in memory. Furthermore, these methods also only work well with input spaces with low to moderate numbers of dimensions. As the dimensionality of the space increases, gridbased methods face some serious problems. The number of cells grows exponentially and finding adjacent high-density cells to form clusters becomes prohibitively expensive [HK99].

In order to address the curse of dimensionality a couple of algorithms have focused on data projections in subspaces. Examples include PROCLUS [APYP99], OptiGrid [HK99], and ORCLUS [AY00]. PROCLUS uses axis-parallel partitions to identify subspace clusters. ORCLUS uses generalized projections to identify subspace clusters. OptiGrid is an especially interesting algorithm due to its simplicity and ability to find clusters in high dimensional spaces in the presence noise. OptiGrid constructs a gridpartitioning of the data by calculating the partitioning hyperplanes using contracting projections of the data. OptiGrid looks for hyperplanes that satisfy two requirements: 1) separating hyperplanes should cut through regions of low density relative to the surrounding regions; and 2) separating hyperplanes should place individual clusters into different partitions. The first requirement aims at preventing oversplitting, that is, a cutting plane should not split a cluster. The second requirement attempts to achieve good cluster discrimination, that is, the cutting plane should contribute to finding the individual clusters. OptiGrid recursively constructs a multidimensional grid by partitioning the data using a set of cutting hyperplanes, each of which is orthogonal to at least one projection. At each step, the generation of the set of candidate hyperplanes is controlled by two threshold parameters. The implementation of OptiGrid described in the paper used axisparallel partitioning hyperplanes. The authors show that the error introduced by axisparallel partitioning decreases exponentially with the number of dimensions in the data space. This validates the use of axis-parallel projections as an effective approach for separating clusters in high dimensional spaces. Optigrid, however, has two main shortcomings. It is sensitive to parameter choice and it does not prescribe a strategy to efficiently handle data sets that do not fit in memory. To overcome both the scalability problems associated with large amounts of data and high dimensional data input space, this paper introduces a new clustering algorithm called O-Cluster (Orthogonal partitioning CLUSTERing). This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data.

2. The O-Cluster Algorithm


O-Cluster is a method that builds upon the contracting projection concept introduced by OptiGrid. Our algorithm makes two major contributions:

It proposes the use of a statistical test to validate the quality of a cutting plane. Such a test proves crucial for identifying good splitting points along data projections and makes possible automated selection of high quality separators. It can operate on a small buffer containing a random sample from the original data set. Active sampling ensures that partitions are provided with additional data points if more information is needed to evaluate a cutting plane. Partitions that do not have ambiguities are frozen and the data points associated with them are removed from the active buffer.

O-Cluster operates recursively. It evaluates possible splitting points for all projections in a partition, selects the best one, and splits the data into two new partitions. The algorithm proceeds by searching for good cutting planes inside the newly created partitions. Thus O-Cluster creates a hierarchical tree structure that tessellates the input space into rectangular regions. Figure 1 provides an outline of O-Clusters algorithm.

1. Load buffer

2. Compute histograms for active partitions

3. Find 'best' splitting points for active partitions

4. Flag ambiguous and 'frozen' partitions

Splitting points exist? No

Yes

5. Split active partitions

6. Reload buffer

Ambiguous partitions exist? Yes

No

Yes

Unseen data exist?

No

EXIT

Figure 1: O-Cluster algorithm block diagram.

The main processing stages are as follows: 1. Load data buffer: If the entire data set does not fit in the buffer, a random sample is used. O-Cluster assigns all points from the initial buffer to a root partition. 2. Compute histograms for active partitions: The goal is to determine a set of projections for the active partitions and compute histograms along these projections. Any partition that represents a leaf in the clustering hierarchy and is not explicitly marked ambiguous or frozen is considered active. The process whereby an active partition becomes ambiguous or frozen is explained in Step 4. It is essential to compute histograms that provide good resolution but also that have data artifacts smoothed out. A number of studies have addressed the problem of how many bins can be supported by a given distribution [Sco79,Wan96]. Based on these studies, a reasonable, simple approach would be to make the number of bins inversely proportional to the standard deviation of the data along a given dimension and directly proportional to N 1/3, where N is the number of points inside a partition. Alternatively, one can use a global binning strategy and coarsen the histograms as the number of points inside the partitions decreases. O-Cluster is robust with respect to different binning strategies as long as the histograms do not significantly undersmooth or oversmooth the distribution density. 3. Find best splitting points for active partitions: For each histogram, O-Cluster attempts to find the best valid cutting plane, if such exists. A valid cutting plane passes through a point of low density (a valley) in the histogram. Additionally, the point of low density should be surrounded on both sides by points of high density (peaks). O-Cluster attempts to find a pair of peaks with a valley between them where the difference between the peak and the valley histogram counts is statistically significant. Statistical significance is tested using a standard 2 test:

2(observed expected ) 2 2 = ,1 , expected


2

where the observed value is equal to the histogram count of the valley and the expected value is the average of the histogram counts of the valley and the lower 2 peak. The current implementation uses a 95% confidence level ( 0.05,1 = 3.843 ). Since multiple splitting points can be found to be valid separators per partition according to this test, O-Cluster chooses the one where the valley has the lowest histogram count as the best splitting point. Thus the cutting plane would go through the area with lowest density.

4. Flag ambiguous and frozen partitions: If no valid splitting points are found, O-Cluster checks whether the 2 test would have found a valid splitting point at a 2 lower confidence level (e.g., 90% with 0.1,1 = 2.706 ). If that is the case, the current partition can be considered ambiguous. More data points are needed to

establish the quality of the splitting point. If no splitting points were found and there is no ambiguity, the partition can be marked as frozen and the records associated with it marked for deletion from the active buffer. 5. Split active partitions: If a valid separator exists, the data points are split along the cutting plane and two new active partitions are created from the original partition. For each new partition the processing proceeds recursively from Step 2. 6. Reload buffer: This step can take place after all recursive partitioning on the current buffer has completed. If all existing partitions are marked as frozen and/or there are no more data points available, the algorithm exits. Otherwise, if some partitions are marked as ambiguous and additional unseen data records exist, O-Cluster proceeds with reloading the data buffer. The new data replace records belonging to frozen partitions. When new records are read in, only data points that fall inside ambiguous partitions are placed in the active buffer. New records falling within a frozen partition are not loaded into the buffer. If it is desirable to maintain statistics of the data points falling inside partitions (including the frozen partitions), such statistics can be continuously updated with the reading of each new record. Loading of new records continues until either: 1) the active buffer is filled again; 2) the end of the data set is reached; or 3) a reasonable number of records have been read, even if the active buffer is not full and there are more data. The reason for the last condition is that if the buffer is relatively large and there are many points marked for deletion, it may take a long time to fill the entire buffer with data from the ambiguous regions. To avoid excessive reloading during these circumstances, the buffer reloading process is terminated after reading through a number of records equal to the data buffer size. Once the buffer reload is completed, the algorithm proceeds from Step 2. The algorithm requires, at most, a single pass through the entire data set. In addition to the major differences from OptiGrid noted in the beginning of this section, there are two other important distinctions:

OptiGrids choice of a valid cutting plane depends on a pair of global parameters: noise level and maximal splitting density. Those two parameters act as thresholds for identifying valid splitting points. In OptiGrid, histogram peaks are required to be above the noise level parameter while histogram valleys need to have density lower than the maximum splitting density. The maximum splitting density should be set above the noise level threshold (personal communication with OptiGrids authors). Finding correct values for these parameters is critical for OptiGrids performance. O-Clusters 2 test for splitting points eliminates the need for preset thresholds the algorithm can find valid cutting planes at any density level within a histogram. While not strictly necessary for O-Clusters operation, it was found, in the course of algorithm evolution, useful to introduce a parameter called sensitivity (). Analogous to OptiGrids noise level, the role of this parameter is to suppress the creation of arbitrarily small clusters by setting a minimum count for O-Clusters histogram peaks. The effect of is illustrated in Section 4.

While OptiGrid attempts to find good cutting planes that optimally traverse the input space, it is prone to oversplitting. By design, OptiGrid can partition simultaneously along several cutting planes. This may result in the creation of clusters (with few points) that need to be subsequently removed. Additionally, OptiGrid works with histograms that undersmooth the distribution density (personal communication with OptiGrids authors). Undersmoothed histograms and the threshold-based mechanism of splitting point identification can lead to the creation of separators that cut through clusters. These issues may not be necessarily a serious hindrance in the OptiGrids framework since the algorithm attempts to construct a multidimensional grid where the highly populated cells are interpreted as clusters. O-Cluster on the other hand, attempts to create a binary clustering tree where the leaves are regions with flat or unimodal density functions. Only a single cutting plane is applied at a time and the quality of the splitting point is statistically validated.

O-Cluster functions optimally for large-scale data sets with many records and high dimensionality. It is desirable to work with a sufficiently large active buffer in order to calculate high quality histograms with good resolution. High dimensionality has been shown to significantly reduce the chance of cutting through data when using axis-parallel cutting planes [HK99]. There is no special handling for missing values if a data record has missing values, this record would not contribute to the histogram counts along certain dimensions. However, if a missing value is needed to assign the record to a partition, the record would not be assigned and it would be marked for deletion from the active buffer.

3. O-Cluster Complexity
O-Cluster can use an arbitrary set of projections. Our current implementation is restricted to projections that are axis-parallel. The histogram computation step is of complexity O(N x d) where N is the number of data points in the buffer and d is the number of dimensions. The selection of best splitting point for a single dimension is O(b) where b is the average number of histogram bins in a partition. Choosing the best splitting point over all dimensions is O(d x b). The assignment of data points to newly created partitions requires a comparison of an attribute value to the splitting point and the complexity has an upper bound of O(N). Loading new records into the data buffer requires their insertion into the relevant partitions. The complexity associated with scoring a record is depends on the depth of the binary clustering tree (s). The upper limit for filling the whole active buffer is O(N x s). The depth of the tree depends on the data set. In general, the total complexity can be approximated as O(N x d). It is shown in Section 4 that O-Cluster scales linearly with the number of records and number of dimensions.

4. Empirical Results
This section illustrates the general behavior of O-Cluster and evaluates the correctness of its solutions. The first series of tests were carried out on a two-dimensional data set - DS3 [ZRL96]. This is a particularly challenging benchmark. The low number of dimensions 8

makes the use of any axis-parallel partitioning algorithm problematic. Also, the data set consists of 100 spheric clusters that vary significantly in their size and density. The number of points per cluster is a random number in the range [0, 2000] drawn from a uniform distribution and the variance across dimensions for each cluster is a random number in the range [0, 2], also drawn from a uniform distribution.

4.1. O-Cluster on DS3


Figure 2 depicts the partitions found by O-Cluster on the DS3 data set. The centers of the original clusters are marked with squares while the centroids of the points assigned to each O-Cluster partition are represented by stars.

Figure 2: O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by OCluster. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition; recall = 71%, precision = 97%.

Although O-Cluster does not function optimally when the dimensionality is low, it produces a good set of partitions. It is noteworthy that O-Cluster finds cutting planes at different levels of density and successfully identifies nested clusters. Axis-parallel splits in low dimensions can easily lead to the creation of artifacts where cutting planes have to cut through parts of a cluster and data points are assigned to incorrect partitions. Such artifacts can either result in centroid imprecision or lead to further partitioning and creation of spurious clusters. For example, in Figure 2 O-Cluster creates 73 partitions. Of

these, 71 contain the centroids of at least one of the original clusters. The remaining 2 partitions were produced due to artifacts created by splits going through clusters. In general, there are two potential sources of imprecision in the algorithm: 1) O-Cluster may fail to create partitions for all original clusters; and/or 2) O-Cluster may create spurious partitions that do not correspond to any of the original clusters. To measure these two effects separately, we use two metrics borrowed from the information retrieval domain: Recall is defined as the percentage of the original clusters that were found and assigned to partitions; Precision is defined as the percentage of the found partitions that contain at least one original cluster centroid. That is, in Figure 2 O-Cluster found 71 out of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitions created contained at least one centroid of the original clusters (a precision of 97%).

4.2. The Sensitivity Parameter


The effect of creating spurious clusters due to splitting artifacts can be alleviated by using O-Clusters sensitivity () parameter. is a parameter in the [0, 1] range that is inversely proportional to the minimum count required to find a histogram peak. A value of 0 requires the histogram peaks to surpass the count corresponding to a global uniform level per dimension. The global uniform level is defined as the average histogram count that would have been observed if the data points in the buffer were drawn from a uniform distribution. A value of 0.5 sets the minimum histogram count for a peak to 50% of the global uniform level. A value of 1 removes the restrictions on peak histogram counts and the splitting point identification relies solely on the 2 test. The results shown in Figure 2 were produced with = 0.95. Figure 3 illustrates that effect of changing . Increasing enables O-Cluster to grow the clustering hierarchy deeper and thus obtain improved recall performance. However, values of that are too high may result in excessive splitting and thus poor precision. It should be noted that the effect of is magnified by the particular characteristics of the DS3 data set. The 2D dimensionality leads to splitting artifacts that become the main reason for oversplitting. Additionally, the original clusters in the DS3 data set vary significantly in their number of records, and low values can filter out some of the weaker clusters. Higher dimensionality and more evenly represented clusters reduce OClusters sensitivity to .

10

Figure 3: Effect of the sensitivity parameter. The grid depicts the splitting planes found by O-Cluster. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition. (a) = 0, recall = 22%, precision = 100%; (b) = 0.5, recall = 40%, precision = 100%; (c) = 0.75, recall = 56%, precision = 100%; (d) = 1, recall = 72%, precision = 84%.

4.3. The Effect of Dimensionality


In order to illustrate the benefits of higher dimensionality, the DS3 data set was extended to 5 and 10 dimensions. was set to 1 for both experiments. Figure 4 shows the 2D projection of the data set, the original cluster centroids, and the centroids of O-Clusters partitions in the plane specified by the original two dimensions. The O-Cluster grid was not included since the cutting planes in higher dimensions could not be plotted in a meaningful way. It can be seen that O-Clusters accuracy (both recall and precision) improves dramatically with increased dimensionality. The main reason for the remarkably good performance is that higher dimensionality allows O-Cluster to find cutting planes that do not produce splitting artifacts.

11

Figure 4: Effect of dimensionality. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition. (a) dimensionality = 5, recall = 99%, precision = 96%; (b) dimensionality = 10, recall = 100%, precision = 100%.

4.4.

The Effect of Uniform Noise

O-Cluster shares one remarkable feature with OptiGrid it resistance to uniform noise. To test O-Clusters robustness to uniform noise a synthetic data set consisting of 100,000 points was generated. It consisted of 50 spherical clusters, with variance in the range [0, 2], each represented by 2,000 points. To introduce uniform noise to the data set, a certain percentage of the original records were replaced by records drawn from a uniform distribution on each dimension. O-Cluster was tested with 25%, 50%, 75%, and 90% noise. For example, when the percentage of noise was 90%, the original clusters were represented by 10,000 points (200 on average per cluster) and the remaining 90,000 points were uniform noise. All experiments were run with = 0.8. Figure 5 illustrates OClusters performance under noisy conditions. O-Clusters accuracy degrades very gracefully with the increased percentage of background noise. Higher dimensionality provides a slight advantage when handling noise.

Figure 5: Effect of uniform noise. (a) Recall for 5 and 10 dimensions; (b) Precision for 5 and 10 dimensions.

12

It should also be noted that once background noise is introduced, the centroids of the partitions produced by O-Cluster are offset from the original cluster centroids. In order to identify the original centers, it is necessary to discount the background noise from the histograms and compute centroids on the remaining points. This can be accomplished by filtering out the histogram bins that would fall below a level corresponding to the average bin count for this partition.

4.5. O-Cluster Scalability


The next series of tests addresses O-Clusters scalability with increasing numbers of records and dimensions. All data sets used in the experiments consisted of 50 clusters. All 50 clusters were correctly identified in each test. When measuring scalability with increasing number of records, the number of dimensions was set to 10. When measuring scalability for increasing dimensionality, the number of records was set to 100,000. Figure 6 shows that there is a clear linear dependency of O-Clusters processing time on both the number of records and number of dimensions. In general, these timing results can be improved significantly because the algorithm was implemented as a PL/SQL package in an ORACLE 9i database. There is an overhead associated with the fact that PL/SQL is an interpreted language.

Figure 6: Scalability. (a) Scalability with number of records (10 dimensions); (b) Scalability with number of dimensions (100,000 records).

4.6. Working with a Limited Buffer Size


In all tests described so far, O-Cluster had a sufficiently large buffer to accommodate the entire data set. The next set of results illustrate O-Clusters behavior when the algorithm is required to have a small memory footprint such that the active buffer an contain only a fraction of the entire data set. This series of tests reuses the data set described in Section 4.4 (50 clusters, 2,000 point each, 10 dimensions). For all tests, was set to 0.8. Figure 7 shows the timing and recall numbers for different buffer sizes (0.5%, 0.8%, 1%, 5%, and 10% of the entire data set). Very small buffer sizes may require multiple refills. For example, the described experiment showed that when the buffer size was 0.5%, O-Cluster needed to refill it 5 times; when the buffer size was 0.8% or 1%, O-Cluster had to refill it once. For larger buffer sizes, no refills were necessary. As a result, using 0.8% buffer proves to be slightly faster than using 0.5% buffer. If no buffer refills were required (buffer size greater than 1%), O-Cluster followed a linear scalability pattern, as shown in

13

the previous section. Regarding O-Clusters accuracy, buffer sizes under 1% proved to be too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Cluster found 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49 out of 50 clusters (98% recall). Larger buffer sizes allowed O-Cluster to correctly identify all original clusters. For all buffer sizes (including buffer sizes smaller than 1%) precision was 100%.

Figure 7 Buffer size. (a) Time scalability; (b) Recall.

5. Conclusions
The majority of existing clustering algorithms encounter serious scalability and/or accuracy related problems when used on data sets with a large number of records and/or dimensions. We propose a new clustering algorithm O-Cluster capable of efficiently and effectively clustering large high dimensional data sets. It relies on a novel active sampling approach and uses an axis-parallel partitioning scheme to identify hyperrectangular regions of uni-modal density in the input feature space. O-Cluster has good accuracy and scalability, is robust to noise, automatically detects the number of clusters in the data, and can successfully operate with limited memory resources. Currently we are extending O-Cluster in a number of ways, including:

Parallel implementation. The results presented in this paper used a serial implementation of O-Cluster. Performance can be significantly improved by parallelizing the following steps of O-Cluster: o Buffer filling; o Histogram computation and splitting point determination; o Assigning records to partitions. Cluster representation through rules especially useful for noisy cases when centroids do not characterize a cluster well. Probabilistic modeling and scoring with missing values missing values can be a problem during record assignment. Handling categorical and mixed (categorical and numerical) data sets.

These extensions will be reported in a future paper.

14

References
[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD98), pages 94105, 1998. [APYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD99), pages 6172, 1999. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD00), pages 7081, 2000. P. Bradely, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD98), pages 815, 1998.

[AY00]

[BFR98]

[BGRS99] K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft. When is nearest neighbor meaningful? In Proc. 7th Int. Conf. on Database Theory (ICDT99), pages 217235, 1999. [EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial database. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD96), pages 226231, 1996. [FLE00] [GRS98] F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2: 5157, 2000. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD98), pages 7384, 1998. A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the nearest neighbor in high dimensional spaces? In Proc. 26th Int. Conf. on Very Large Data Bases (VLDB00), pages 506515, 2000. T. Hofmann and J. Buhmann. Active data clustering. In Advances in Neural Information Processing Systems (NIPS97), pages. 528534, 1997. A. Hinneburg, and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD98), pages 5865, 1998. A. Hinneburg, and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Proc. 25th Int. Conf. on Very Large Data Bases (VLDB99), pages 506517, 1999. L. Kaufman, and P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons, 1990.

[HAK00]

[HB97] [HK98]

[HK99]

[KR90]

15

[Mac67]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist, Prob., 1:281297, 1967. H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999. R. Ng, and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. 1994 Int. Conf. on Very Large Data Bases (VLDB94), pages 144155, 1994. D. W. Scott. Multivariate density estimation. New York: John Wiley & Sons, 1979. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. In Proc. 1998 Int. Conf. on Very Large Data Bases (VLDB98), pages 428439, 1998. M. P. Wand. Data-based choice of histogram bin width. The American Statistician, 51: 5964, 1996. W. Wang, J. Yang, M. Muntz. STING: A statistical information grid approach to spatial data mining. In Proc. 1997 Int. Conf. on Very Large Data Bases (VLDB97), pages 186195, 1997. T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD96), pages 103114, 1996.

[NGC99]

[NH94]

[Sco79] [SCZ98]

[Wan96] [WYM97]

[ZRL96]

16

Potrebbero piacerti anche