Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Boriana L. Milenova
boriana.milenova@oracle.com
Marcos M. Campos
marcos.m.campos@oracle.com
Oracle Data Mining Technologies 10 Van de Graaff Drive Burlington, MA 1803, USA
Abstract
Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with very large number of records or data sets with very high number of dimensions. This paper provides a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the curse of dimensionality and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Clusters excellent scalability.
1. Introduction
With an increasing number of new database applications dealing with very large high dimensional data sets, data mining on such data sets has emerged as an important research area. These applications include multimedia content-based retrieval, geographic and molecular biology data analysis, text mining, bioinformatics, medical applications, and time-series matching. For example, in multimedia retrieval, the objects (e.g., images) are represented by their features (e.g., color histograms, texture vectors, Fourier vectors, text descriptors, and shape descriptors), which define high dimensional feature spaces. In many of the above-mentioned applications the data sets are very large, consisting of millions of data objects with several hundreds to thousands of dimensions. Clustering of very large high dimensional data sets is an important problem. There are a number of different clustering algorithms that are applicable to very large data sets, and a few that address high dimensional data. Clustering algorithms can be divided into partitioning, hierarchical, locality-based, and grid-based algorithms.
Given a data set with n objects and k n, the number of desired clusters, partitioning algorithms partition the objects into k clusters. The clusters are formed in order to optimize an objective criterion such as distance. Each object is assigned to the closest cluster. Clusters are typically represented by either the mean of the objects assigned to the cluster (k-means [Mac67]) or by one representative object of the cluster (k-medoid [KR90]). CLARANS (Clustering Large Applications based upon RANdomized Search) [NH94] is a partitioning clustering algorithm developed for large data sets, which uses a randomized and bounded search strategy to improve the scalability of the k-medoid approach. CLARANS enables the detection of outliers and its computational complexity is about O(n2). CLARANS performance can be improved by exploring spatial data structures such as R*-trees. Hierarchical clustering algorithms work by grouping data objects into a hierarchy (e.g., a tree) of clusters. The hierarchy can be formed top-down (divisive hierarchical methods) or bottom-up (agglomerative hierarchical methods). Hierarchical methods rely on a distance function to measure the similarity between clusters. These methods do not scale well with the number of data objects. Their computational complexity is usually O(n2). Some newer methods such as BIRCH [ZRL96] and CURE [GRS99] attempt to address the scalability problem and improve the quality of clustering results for hierarchical methods. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an efficient divisive hierarchical algorithm. It has O(n) computational complexity, can work with limited amount of memory, and has efficient I/O. It uses a special data structure, CFtree (Cluster Feature tree) for storing summary information about subclusters of objects. The CF-tree structure can be seen as a multilevel compression of the data that attempts to preserve the clustering structure inherent in the data set. Because of the similarity measure it uses to determine the data items to be compressed, BIRCH only performs well on data sets with spherical clusters. CURE (Clustering Using REpresentatives) is an O(n2) algorithm that produces high-quality clusters in the presence of outliers, and can identify clusters of complex shapes and different sizes. It employs a hierarchical clustering approach that uses a fixed number of representative points to define a cluster instead of a single centroid or object. CURE handles large data sets through a combination of random sampling and partitioning. Since CURE uses only a random sample of the data set, it manages to achieve good scalability for large data sets. CURE reports better times than BIRCH on the same benchmark data. Locality-based clustering algorithms group neighboring data objects into clusters based on local conditions. These algorithms allow clustering to be performed in one scan of the data set. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] is a typical representative of this group of algorithms. It regards clusters as dense regions of objects in the input space that are separated by regions of low density. DBSCANs basic idea is that the density of points in a radius around each point in a cluster has to be above a certain threshold. It grows a cluster as long as, for each data point within this cluster, a neighborhood of a given radius contains at least a minimum number of points. DBSCAN has computational complexity O(n2). If a spatial index is used, the computational complexity is O(n log n). The clustering generated by DBSCAN is very sensitive to parameter choice. OPTICS (Ordering Points To Identify Clustering
Structures) is another locality-based clustering algorithms. It computes an augmented cluster ordering for automatic and iterative clustering analysis. OPTICS has the same computational complexity as DBSCAN. In general, partitioning, hierarchical, and locality-based clustering algorithms do not scale well with the number of objects in the data set. To improve the efficiency, data summarization techniques integrated with the clustering process have been proposed. Besides the above-mentioned BIRCH and CURE algorithms, examples include: active data clustering [HB97], ScalableKM [BFR98], and simple single pass k-means [FLE00]. Active data clustering utilizes principles from sequential experimental design in order to interleave data generation and data analysis. It infers from the available data not only the grouping structure in the data, but also which data are most relevant for the clustering problem. The inferred relevance of the data is then used to control the re-sampling of the data set. ScalableKM requires at most one scan of the data set. The method identifies data points that can be effectively compressed, data points that must be maintained in memory, and data points that can be discarded. The algorithm operates within the confines of a limited memory buffer. Unfortunately, the compression schemes used by ScalableKM can introduce significant overhead. The simple single pass k-means algorithm is a simplification of ScalableKM. Like ScalableKM, it also uses a data buffer of fixed size. Experiments indicate that the simple single pass k-means algorithm is several times faster than standard k-means while producing clustering of comparable quality. All the above-mentioned methods are not fully effective when clustering high dimensional data. Methods that rely on near or nearest neighbor information do not work well on high dimensional spaces. In high dimensional data sets, it is very unlikely that data points are nearer to each other than the average distance between data points because of sparsely filled space. As a result, as the dimensionality of the space increases, the difference between the distance to the nearest and the farthest neighbors of a data object goes to zero. [BGRS99, HAK00]. Grid-based clustering algorithms do not suffer from the nearest neighbor problem in high dimensional spaces. Examples include STING (STatistical INformation Grid) [WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], and MAFIA (Merging Adaptive Finite Intervals And is more than a clique) [NGC99]. These methods divide the input space into hyper-rectangular cells, discard the low-density cells, and then combine adjacent high-density cells to form clusters. Grid-based methods are capable of discovering cluster of any shape and are also reasonably fast. However, none of these methods address how to efficiently cluster very large data sets that do not fit in memory. Furthermore, these methods also only work well with input spaces with low to moderate numbers of dimensions. As the dimensionality of the space increases, gridbased methods face some serious problems. The number of cells grows exponentially and finding adjacent high-density cells to form clusters becomes prohibitively expensive [HK99].
In order to address the curse of dimensionality a couple of algorithms have focused on data projections in subspaces. Examples include PROCLUS [APYP99], OptiGrid [HK99], and ORCLUS [AY00]. PROCLUS uses axis-parallel partitions to identify subspace clusters. ORCLUS uses generalized projections to identify subspace clusters. OptiGrid is an especially interesting algorithm due to its simplicity and ability to find clusters in high dimensional spaces in the presence noise. OptiGrid constructs a gridpartitioning of the data by calculating the partitioning hyperplanes using contracting projections of the data. OptiGrid looks for hyperplanes that satisfy two requirements: 1) separating hyperplanes should cut through regions of low density relative to the surrounding regions; and 2) separating hyperplanes should place individual clusters into different partitions. The first requirement aims at preventing oversplitting, that is, a cutting plane should not split a cluster. The second requirement attempts to achieve good cluster discrimination, that is, the cutting plane should contribute to finding the individual clusters. OptiGrid recursively constructs a multidimensional grid by partitioning the data using a set of cutting hyperplanes, each of which is orthogonal to at least one projection. At each step, the generation of the set of candidate hyperplanes is controlled by two threshold parameters. The implementation of OptiGrid described in the paper used axisparallel partitioning hyperplanes. The authors show that the error introduced by axisparallel partitioning decreases exponentially with the number of dimensions in the data space. This validates the use of axis-parallel projections as an effective approach for separating clusters in high dimensional spaces. Optigrid, however, has two main shortcomings. It is sensitive to parameter choice and it does not prescribe a strategy to efficiently handle data sets that do not fit in memory. To overcome both the scalability problems associated with large amounts of data and high dimensional data input space, this paper introduces a new clustering algorithm called O-Cluster (Orthogonal partitioning CLUSTERing). This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data.
It proposes the use of a statistical test to validate the quality of a cutting plane. Such a test proves crucial for identifying good splitting points along data projections and makes possible automated selection of high quality separators. It can operate on a small buffer containing a random sample from the original data set. Active sampling ensures that partitions are provided with additional data points if more information is needed to evaluate a cutting plane. Partitions that do not have ambiguities are frozen and the data points associated with them are removed from the active buffer.
O-Cluster operates recursively. It evaluates possible splitting points for all projections in a partition, selects the best one, and splits the data into two new partitions. The algorithm proceeds by searching for good cutting planes inside the newly created partitions. Thus O-Cluster creates a hierarchical tree structure that tessellates the input space into rectangular regions. Figure 1 provides an outline of O-Clusters algorithm.
1. Load buffer
Yes
6. Reload buffer
No
Yes
No
EXIT
The main processing stages are as follows: 1. Load data buffer: If the entire data set does not fit in the buffer, a random sample is used. O-Cluster assigns all points from the initial buffer to a root partition. 2. Compute histograms for active partitions: The goal is to determine a set of projections for the active partitions and compute histograms along these projections. Any partition that represents a leaf in the clustering hierarchy and is not explicitly marked ambiguous or frozen is considered active. The process whereby an active partition becomes ambiguous or frozen is explained in Step 4. It is essential to compute histograms that provide good resolution but also that have data artifacts smoothed out. A number of studies have addressed the problem of how many bins can be supported by a given distribution [Sco79,Wan96]. Based on these studies, a reasonable, simple approach would be to make the number of bins inversely proportional to the standard deviation of the data along a given dimension and directly proportional to N 1/3, where N is the number of points inside a partition. Alternatively, one can use a global binning strategy and coarsen the histograms as the number of points inside the partitions decreases. O-Cluster is robust with respect to different binning strategies as long as the histograms do not significantly undersmooth or oversmooth the distribution density. 3. Find best splitting points for active partitions: For each histogram, O-Cluster attempts to find the best valid cutting plane, if such exists. A valid cutting plane passes through a point of low density (a valley) in the histogram. Additionally, the point of low density should be surrounded on both sides by points of high density (peaks). O-Cluster attempts to find a pair of peaks with a valley between them where the difference between the peak and the valley histogram counts is statistically significant. Statistical significance is tested using a standard 2 test:
where the observed value is equal to the histogram count of the valley and the expected value is the average of the histogram counts of the valley and the lower 2 peak. The current implementation uses a 95% confidence level ( 0.05,1 = 3.843 ). Since multiple splitting points can be found to be valid separators per partition according to this test, O-Cluster chooses the one where the valley has the lowest histogram count as the best splitting point. Thus the cutting plane would go through the area with lowest density.
4. Flag ambiguous and frozen partitions: If no valid splitting points are found, O-Cluster checks whether the 2 test would have found a valid splitting point at a 2 lower confidence level (e.g., 90% with 0.1,1 = 2.706 ). If that is the case, the current partition can be considered ambiguous. More data points are needed to
establish the quality of the splitting point. If no splitting points were found and there is no ambiguity, the partition can be marked as frozen and the records associated with it marked for deletion from the active buffer. 5. Split active partitions: If a valid separator exists, the data points are split along the cutting plane and two new active partitions are created from the original partition. For each new partition the processing proceeds recursively from Step 2. 6. Reload buffer: This step can take place after all recursive partitioning on the current buffer has completed. If all existing partitions are marked as frozen and/or there are no more data points available, the algorithm exits. Otherwise, if some partitions are marked as ambiguous and additional unseen data records exist, O-Cluster proceeds with reloading the data buffer. The new data replace records belonging to frozen partitions. When new records are read in, only data points that fall inside ambiguous partitions are placed in the active buffer. New records falling within a frozen partition are not loaded into the buffer. If it is desirable to maintain statistics of the data points falling inside partitions (including the frozen partitions), such statistics can be continuously updated with the reading of each new record. Loading of new records continues until either: 1) the active buffer is filled again; 2) the end of the data set is reached; or 3) a reasonable number of records have been read, even if the active buffer is not full and there are more data. The reason for the last condition is that if the buffer is relatively large and there are many points marked for deletion, it may take a long time to fill the entire buffer with data from the ambiguous regions. To avoid excessive reloading during these circumstances, the buffer reloading process is terminated after reading through a number of records equal to the data buffer size. Once the buffer reload is completed, the algorithm proceeds from Step 2. The algorithm requires, at most, a single pass through the entire data set. In addition to the major differences from OptiGrid noted in the beginning of this section, there are two other important distinctions:
OptiGrids choice of a valid cutting plane depends on a pair of global parameters: noise level and maximal splitting density. Those two parameters act as thresholds for identifying valid splitting points. In OptiGrid, histogram peaks are required to be above the noise level parameter while histogram valleys need to have density lower than the maximum splitting density. The maximum splitting density should be set above the noise level threshold (personal communication with OptiGrids authors). Finding correct values for these parameters is critical for OptiGrids performance. O-Clusters 2 test for splitting points eliminates the need for preset thresholds the algorithm can find valid cutting planes at any density level within a histogram. While not strictly necessary for O-Clusters operation, it was found, in the course of algorithm evolution, useful to introduce a parameter called sensitivity (). Analogous to OptiGrids noise level, the role of this parameter is to suppress the creation of arbitrarily small clusters by setting a minimum count for O-Clusters histogram peaks. The effect of is illustrated in Section 4.
While OptiGrid attempts to find good cutting planes that optimally traverse the input space, it is prone to oversplitting. By design, OptiGrid can partition simultaneously along several cutting planes. This may result in the creation of clusters (with few points) that need to be subsequently removed. Additionally, OptiGrid works with histograms that undersmooth the distribution density (personal communication with OptiGrids authors). Undersmoothed histograms and the threshold-based mechanism of splitting point identification can lead to the creation of separators that cut through clusters. These issues may not be necessarily a serious hindrance in the OptiGrids framework since the algorithm attempts to construct a multidimensional grid where the highly populated cells are interpreted as clusters. O-Cluster on the other hand, attempts to create a binary clustering tree where the leaves are regions with flat or unimodal density functions. Only a single cutting plane is applied at a time and the quality of the splitting point is statistically validated.
O-Cluster functions optimally for large-scale data sets with many records and high dimensionality. It is desirable to work with a sufficiently large active buffer in order to calculate high quality histograms with good resolution. High dimensionality has been shown to significantly reduce the chance of cutting through data when using axis-parallel cutting planes [HK99]. There is no special handling for missing values if a data record has missing values, this record would not contribute to the histogram counts along certain dimensions. However, if a missing value is needed to assign the record to a partition, the record would not be assigned and it would be marked for deletion from the active buffer.
3. O-Cluster Complexity
O-Cluster can use an arbitrary set of projections. Our current implementation is restricted to projections that are axis-parallel. The histogram computation step is of complexity O(N x d) where N is the number of data points in the buffer and d is the number of dimensions. The selection of best splitting point for a single dimension is O(b) where b is the average number of histogram bins in a partition. Choosing the best splitting point over all dimensions is O(d x b). The assignment of data points to newly created partitions requires a comparison of an attribute value to the splitting point and the complexity has an upper bound of O(N). Loading new records into the data buffer requires their insertion into the relevant partitions. The complexity associated with scoring a record is depends on the depth of the binary clustering tree (s). The upper limit for filling the whole active buffer is O(N x s). The depth of the tree depends on the data set. In general, the total complexity can be approximated as O(N x d). It is shown in Section 4 that O-Cluster scales linearly with the number of records and number of dimensions.
4. Empirical Results
This section illustrates the general behavior of O-Cluster and evaluates the correctness of its solutions. The first series of tests were carried out on a two-dimensional data set - DS3 [ZRL96]. This is a particularly challenging benchmark. The low number of dimensions 8
makes the use of any axis-parallel partitioning algorithm problematic. Also, the data set consists of 100 spheric clusters that vary significantly in their size and density. The number of points per cluster is a random number in the range [0, 2000] drawn from a uniform distribution and the variance across dimensions for each cluster is a random number in the range [0, 2], also drawn from a uniform distribution.
Figure 2: O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by OCluster. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition; recall = 71%, precision = 97%.
Although O-Cluster does not function optimally when the dimensionality is low, it produces a good set of partitions. It is noteworthy that O-Cluster finds cutting planes at different levels of density and successfully identifies nested clusters. Axis-parallel splits in low dimensions can easily lead to the creation of artifacts where cutting planes have to cut through parts of a cluster and data points are assigned to incorrect partitions. Such artifacts can either result in centroid imprecision or lead to further partitioning and creation of spurious clusters. For example, in Figure 2 O-Cluster creates 73 partitions. Of
these, 71 contain the centroids of at least one of the original clusters. The remaining 2 partitions were produced due to artifacts created by splits going through clusters. In general, there are two potential sources of imprecision in the algorithm: 1) O-Cluster may fail to create partitions for all original clusters; and/or 2) O-Cluster may create spurious partitions that do not correspond to any of the original clusters. To measure these two effects separately, we use two metrics borrowed from the information retrieval domain: Recall is defined as the percentage of the original clusters that were found and assigned to partitions; Precision is defined as the percentage of the found partitions that contain at least one original cluster centroid. That is, in Figure 2 O-Cluster found 71 out of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitions created contained at least one centroid of the original clusters (a precision of 97%).
10
Figure 3: Effect of the sensitivity parameter. The grid depicts the splitting planes found by O-Cluster. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition. (a) = 0, recall = 22%, precision = 100%; (b) = 0.5, recall = 40%, precision = 100%; (c) = 0.75, recall = 56%, precision = 100%; (d) = 1, recall = 72%, precision = 84%.
11
Figure 4: Effect of dimensionality. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition. (a) dimensionality = 5, recall = 99%, precision = 96%; (b) dimensionality = 10, recall = 100%, precision = 100%.
4.4.
O-Cluster shares one remarkable feature with OptiGrid it resistance to uniform noise. To test O-Clusters robustness to uniform noise a synthetic data set consisting of 100,000 points was generated. It consisted of 50 spherical clusters, with variance in the range [0, 2], each represented by 2,000 points. To introduce uniform noise to the data set, a certain percentage of the original records were replaced by records drawn from a uniform distribution on each dimension. O-Cluster was tested with 25%, 50%, 75%, and 90% noise. For example, when the percentage of noise was 90%, the original clusters were represented by 10,000 points (200 on average per cluster) and the remaining 90,000 points were uniform noise. All experiments were run with = 0.8. Figure 5 illustrates OClusters performance under noisy conditions. O-Clusters accuracy degrades very gracefully with the increased percentage of background noise. Higher dimensionality provides a slight advantage when handling noise.
Figure 5: Effect of uniform noise. (a) Recall for 5 and 10 dimensions; (b) Precision for 5 and 10 dimensions.
12
It should also be noted that once background noise is introduced, the centroids of the partitions produced by O-Cluster are offset from the original cluster centroids. In order to identify the original centers, it is necessary to discount the background noise from the histograms and compute centroids on the remaining points. This can be accomplished by filtering out the histogram bins that would fall below a level corresponding to the average bin count for this partition.
Figure 6: Scalability. (a) Scalability with number of records (10 dimensions); (b) Scalability with number of dimensions (100,000 records).
13
the previous section. Regarding O-Clusters accuracy, buffer sizes under 1% proved to be too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Cluster found 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49 out of 50 clusters (98% recall). Larger buffer sizes allowed O-Cluster to correctly identify all original clusters. For all buffer sizes (including buffer sizes smaller than 1%) precision was 100%.
5. Conclusions
The majority of existing clustering algorithms encounter serious scalability and/or accuracy related problems when used on data sets with a large number of records and/or dimensions. We propose a new clustering algorithm O-Cluster capable of efficiently and effectively clustering large high dimensional data sets. It relies on a novel active sampling approach and uses an axis-parallel partitioning scheme to identify hyperrectangular regions of uni-modal density in the input feature space. O-Cluster has good accuracy and scalability, is robust to noise, automatically detects the number of clusters in the data, and can successfully operate with limited memory resources. Currently we are extending O-Cluster in a number of ways, including:
Parallel implementation. The results presented in this paper used a serial implementation of O-Cluster. Performance can be significantly improved by parallelizing the following steps of O-Cluster: o Buffer filling; o Histogram computation and splitting point determination; o Assigning records to partitions. Cluster representation through rules especially useful for noisy cases when centroids do not characterize a cluster well. Probabilistic modeling and scoring with missing values missing values can be a problem during record assignment. Handling categorical and mixed (categorical and numerical) data sets.
14
References
[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD98), pages 94105, 1998. [APYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD99), pages 6172, 1999. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD00), pages 7081, 2000. P. Bradely, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD98), pages 815, 1998.
[AY00]
[BFR98]
[BGRS99] K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft. When is nearest neighbor meaningful? In Proc. 7th Int. Conf. on Database Theory (ICDT99), pages 217235, 1999. [EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial database. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD96), pages 226231, 1996. [FLE00] [GRS98] F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2: 5157, 2000. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD98), pages 7384, 1998. A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the nearest neighbor in high dimensional spaces? In Proc. 26th Int. Conf. on Very Large Data Bases (VLDB00), pages 506515, 2000. T. Hofmann and J. Buhmann. Active data clustering. In Advances in Neural Information Processing Systems (NIPS97), pages. 528534, 1997. A. Hinneburg, and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD98), pages 5865, 1998. A. Hinneburg, and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Proc. 25th Int. Conf. on Very Large Data Bases (VLDB99), pages 506517, 1999. L. Kaufman, and P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons, 1990.
[HAK00]
[HB97] [HK98]
[HK99]
[KR90]
15
[Mac67]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist, Prob., 1:281297, 1967. H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999. R. Ng, and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. 1994 Int. Conf. on Very Large Data Bases (VLDB94), pages 144155, 1994. D. W. Scott. Multivariate density estimation. New York: John Wiley & Sons, 1979. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. In Proc. 1998 Int. Conf. on Very Large Data Bases (VLDB98), pages 428439, 1998. M. P. Wand. Data-based choice of histogram bin width. The American Statistician, 51: 5964, 1996. W. Wang, J. Yang, M. Muntz. STING: A statistical information grid approach to spatial data mining. In Proc. 1997 Int. Conf. on Very Large Data Bases (VLDB97), pages 186195, 1997. T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD96), pages 103114, 1996.
[NGC99]
[NH94]
[Sco79] [SCZ98]
[Wan96] [WYM97]
[ZRL96]
16