Sei sulla pagina 1di 12

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Computer Science 148 (2019) 291–302

Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018)

A survey of clustering algorithms for an industrial context


𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐶𝐶ℎ𝐴𝐴𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝐵𝐵𝐵𝐵𝑜𝑜𝐴𝐴𝐴𝐴𝑑𝑑𝐵𝐵𝐴𝐴𝐴𝐴𝐴𝐴ℎ1 *, Asmaa Benghabrit 2 , Imane Bouhaddou3
1,3
LM2I Laboratory ENSAM, Moulay Ismail University E-mail: benabdellah.abla@gmail.com; b_imane@yahoo.fr;
2
LMAID Laboratory, ENSMR, Mohamed V University E-mail: a.benghabrit@gmail.com

Abstract

Across a wide variety of fields and especially for industrial companies, data are being collected and accumulated at a dramatic pace from many
different resources and services. Hence, there is an urgent need for a new generation of computational theories and tools to assist humans in
extracting useful information from the rapidly growing volumes of digital data. A well-known fundamental task of data mining to extract
information is clustering. However, with the modified applications for various domains, several researchers have developed and have provided
many clustering algorithms. This complexity makes it difficult for researchers and practitioners to keep up with clustering algorithms
development. As a result, finding appropriate algorithms helps significantly to organize information and extract the correct answer from different
queries of the databases. In this respect, the aim of this paper is to find the appropriate clustering algorithm for sparse industrial dataset. To
achieve this goal, we first present related work that focus on comparing different clustering algorithms over the past twenty years. After that, we
provide a categorization of different clustering algorithms found in the literature by matching their properties to the 4V’s challenges of Big data
which allow us to select the candidate clustering algorithm. Finally, using internal validity indices, K-means, agglomerative hierarchical,
DBSCAN and SOM have been implemented and compared on four datasets. In addition, we highlighted the best performing clustering algorithm
that gives us the efficient clusters for each dataset.

© 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in
Data Sciences (ICDS 2018).
Keywords: Clustering algorithms, Unsupervised learning, Sparse dataset, Aircraft, Automotive, Logistics, Industrial datasets.

* Abla Chaouni Benabdellah. Tel.: +212 5 35 46 71 60; fax: +212 5 35 46 71 64.


E-mail address: benabdellah.abla@gmail.com

1877-0509 © 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in
Data Sciences (ICDS 2018).
10.1016/j.procs.2019.01.022
29 Abla Chouni Benabellah et al. / Procedia Computer Science 148 (2019) 291–302

1. Introduction

In the current digital era, according to massive progress and development of the internet and online world
technologies, data generated by machines and devices, product lifecycle management (PLM) solutions, supply chain
process, product development, production planning systems or quality and inventory management systems has
reached a huge volume of more than a thousand Exabyte day by day and is expected to increase in the next years. To
capture long-term revenues and sustainable competitive advantages, companies must manage the knowledge, must
have the right information at the right time and under the right format. It can be said that the management of
knowledge is a crucial issue in an industry [1], [2], [3]. Hence, to extract implicit, previously unknown and
potentially useful information from data, we use the data mining process which is at the intersection of artificial
intelligence, machine learning, statistics, and database systems [4].
Cluster analysis is one of the most widely fundamental tasks of data mining for exploratory data analysis, with
applications ranging from image processing [5], speech processing [6], information retrieval [7] and Web
applications [8]. As a basic tool, clustering is a technique that aims at grouping dataset (objects) into clusters such
that similar objects are aggregated together in the same cluster while dissimilar ones should be belonging to different
clusters. Furthermore, from an optimization perspective, the main goal of clustering is to maximize both the
(internal) homogeneity within a cluster and the (external) heterogeneity among different clusters [9]. However,
different to classification, clustering does not use a subset of the data objects with known class labels to learn a
classification model. In other words, with the terminology of machine learning, clustering is deemed as a form of an
unsupervised task, which calculates the similarity between objects without having any information about their
correct distribution (also known as the ground truth). Due to its unsupervised nature, clustering is known to be one
of the most difficult tasks. The different algorithms developed over the years by the researchers, lead to different
clusters of data, even for the same algorithm, the selection of different parameters or the presentation order of data
objects may greatly affect the final clustering partitions. However, with the vast number of surveys and comparative
studies concerning the clustering algorithms, exploring the algorithm that cluster industrial sparse dataset still
remains an open issue. Therefore, the main goal of this paper is to provide comprehensive reviews of the clustering
algorithms that optimally cluster the sparse industrial datasets. To achieve this goal, we seek to reflect the profile of
corresponding algorithms by making first an analysis and comparison between five categorized groups (partitioning,
hierarchical, density, grid and model-based algorithms). Second, we provide the readers with a proper analysis of the
selected algorithms by experimentally comparing them on real datasets with respect to internal clustering validation.
This paper is structured as follows, section 2 presents related works over the past twenty years on research and
review articles with focus on comparative research. Section 3 provides a review of clustering algorithms categories
and explores the different clustering algorithms with respect to several criteria and properties linked with the 4V’s
challenges of Big data. By providing a categorized framework, we have selected the candidate clustering algorithms,
which will be evaluated properly. Section 4, introduces the datasets and clustering evaluation measurements that
allow us to explore and evaluate the selected algorithms. Section 5, concludes the paper and discuss future research.

2. Related works

Over the past twenty years, the clustering publications became increasingly important, which proves that
researchers are paying more and more attention to this problem. Through the use of Science Direct database with a
constraint on research and review article, we have classified the literature into four main categories: review and
surveys; comparative studies; clustering approaches which dealt with a new algorithm and finally clustering
applications. According to figure 1 and based on [10] methodology which select the papers based on several criteria
(relevance, citation, applicability, etc.); twenty-three percent of publications focused on comparing different
algorithms, while thirty-eight percent of publications applied clustering to several domains such as image
processing, speech processing, information retrieval, Web applications, and industry. This structured approach
ensured that the review was comprehensive, resulting in the analysis of a vast number of comparative studies
concerning the clustering algorithms for various domains. In order to deeply study these works, among 32 papers
selected in the literature review, 20 were ultimately used for comparative studies, while 10 were used for applying
clustering method in the industry sector.
With an in-depth analysis of these papers, we can state that there are several research [15, 17, 20, 23] that have
extensively studied popular and known algorithm such as K-means, DBSCAN, DENCLUE, K-NN, fuzzy k-means,
and SOM to discuss their advantages and disadvantages with taking into account several factors which may
influence the criterion in choosing an appropriate clustering algorithm for a given dataset. While other research [11-
14, 16] have looked at providing clustering algorithms surveys based on different criteria such as their score
(merits), their solved problems, their applicability, their knowledge about the domain and also based on the size of
dataset, the number of clusters, the type of dataset, used software, time complexity, stability and so on. In other
research [18, 19, 21, 22, 27], authors linked big data challenges to clustering algorithms. They provide a
categorizing framework that systematically groups a collection of existing clustering algorithm into categories with
respect to the 4V’s of Big data which are Volume, Variety, Velocity, Value and conclude the suitable algorithm for a
variety of big datasets with respect to different measurements types: internal, external, stability and runtime
performance indices. In a further research and especially in the industry domain, researchers [26-36] have looked at
different algorithms such as K means, DBSCAN, agglomerative hierarchical clustering and SOM algorithm to
respectively cluster packaging and environmental risk, financial, female workers, customer’s preferences, industrial
hygiene and forest industry datasets. However, with all these surveys and comparison found in the literature, there
exist some limitations such as the characteristics of the algorithms are not well studied, no rigorous empirical
analysis has been carried out to ascertain the benefit of one algorithm over another for one specific type of dataset.

10% Review and surveys Comparative studies Clustering approaches


38% 23% Clustering applications

29%

Figure 1: Classification of clustering publications from 1993 to 2017

In addition, excepted for two researches [24-25], which compare respectively DBSCAN and Kmeans for
financial datasets and agglomerative hierarchical clustering and SOM for packaging modularization datasets, there is
no paper that deals with different algorithms properly evaluated and compared on real industrial datasets. Therefore,
overviewing and exploring the algorithms that determine the best clusters for sparse industrial dataset still remains
an open issue. As a consequence and motivated by these reasons, the next section proposes a categorization
framework that groups a collection of existing algorithm into categories and compares their advantages and
drawbacks in order to select the candidate clustering algorithms that will be properly evaluated.

3. Clustering algorithms review

Clustering algorithms have a strong relationship with many fields, especially with statistic and science. Different
starting points and criteria usually lead to different taxonomies of clustering algorithms. A rough but widely agreed
frame is to classify clustering techniques into several groups. In this paper, we allocate the clustering algorithms
according to five categories: (1) Partitioning based algorithms which regard the center of data points as the center
of the corresponding cluster when initial groups are specified and reallocated towards a union [37]; (2)
Hierarchical- based algorithms shows the relation between each pair of clusters depending on the medium of
similarity or dissimilarity in a hierarchical manner called dendrogram.; (3) Density-based algorithms separate data
objects based on their regions of density, connectivity and boundary. The data which is in the region with high
density of the data space is considered to belong to the same cluster [38]; (4) Grid-based algorithms change the
original data space into a grid structure with a definite size of clusters to collect regional statistical data, and then
perform the clustering on the grid, instead of the database directly. (5) Model-based algorithms select a particular
model for each cluster and find the best fitting for that model. There are mainly two kinds of model-based clustering
algorithms, one based on statistical learning method and other one based on neural network learning method.
However, to facilitate the choice
of the appropriate clustering algorithms, table 1 and 2 provides a summary of clustering algorithm with respect to
the relative strength and weakness of each five categorizations described above and also by matching the considered
factors to the 4V’s of Big data namely Volume, Variety, Velocity and Value. For more information about these V’s
see [39]. According to the tables, we can state that the algorithms that are chosen are: k-means algorithm,
agglomerative hierarchical algorithm with ward distance, Density-based Spatial Clustering of Application with noise
(DBSCAN) and Self-Organization Map (SOM). The general reasons for selecting these four algorithm are:
popularity, flexibility, applicability to industrial datasets (inventory categorization, market analysis, financial
analysis, environmental risks, packaging, customer preferences and so on.). Therefore, we haven’t selected an
algorithm in the grid based clustering algorithm due to it incapacity to locate clusters in a low dimensional subspace
and also to the required type of dataset which is spatial in this case. Thus, the main focus of the next section is to
investigate the behavior of the selected algorithm with respect to four industrial validated dataset.
Table 1: Categorization of clustering algorithm with respect to 4V of big data

Volume Variety Velocity Value


Categories Algorithms Size of dataset High Noisy Type of Clusters Time complexity Inputs
dimensionality data dataset Shape
K-means [40] Large &Small No No Numerical Non- 𝑂𝑂(𝑛𝑛𝑛𝑛𝑛𝑛) 1
K-modes [41] Large Yes No Categorical convex 𝑂𝑂(𝑛𝑛) 1
Partitioning- K-medoids [42] Small Yes Yes Categorical Non- 𝑂𝑂(𝑛𝑛2 𝑛𝑛𝑑𝑑) 1
based PAM [43] Small No No Numerical convex 𝑂𝑂(𝑛𝑛(𝑛𝑛 − 𝑛𝑛)2) 1
CLARA [44] Large No No Numerical Non- 𝑂𝑂(𝑛𝑛(40 + 𝑛𝑛)2 + 1
convex 𝑛𝑛(𝑛𝑛 − 𝑛𝑛))
Non-
convex
Non-
convex
Ward [45] Large &Small No No Numerical Non- 𝑂𝑂(𝑛𝑛) 1
BIRCH [46] Large No No Numerical convex 𝑂𝑂(𝑛𝑛) 2
Hierarchical- CURE [47] Large Yes Yes Numerical Non- 𝑂𝑂(𝑛𝑛2 log 𝑛𝑛) 2
based ROCK [48] Large No No Categorical convex 𝑂𝑂(𝑛𝑛2 + 1
Chameleon [49] Large Yes No All type Arbitrary 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 + 3
Arbitrary 𝑛𝑛2 log 𝑛𝑛)
Arbitrary 𝑂𝑂(𝑛𝑛2 )
DBSCAN [50] Large No No Numerical Arbitrary 𝑂𝑂(𝑛𝑛 log 𝑛𝑛) 2
Density- OPTICS [51] Large No Yes Numerical Arbitrary 𝑂𝑂(𝑛𝑛 log 𝑛𝑛) 2
based DENCLUE [52] Large Yes Yes Numerical Arbitrary 𝑂𝑂(log |𝐷𝐷|) 2
Wavecluster[53] Large No Yes Spatial Arbitrary 𝑂𝑂(𝑛𝑛) 3
Grid-based STING [54] Large No Yes Spatial Arbitrary 𝑂𝑂(𝑛𝑛) 1
CLIQUE [55] Large Yes No Numerical Arbitrary 𝑂𝑂(𝑐𝑐𝑛𝑛 + 𝑛𝑛𝑛𝑛) 2
OPTIGRID[ 56] Large Yes Yes Spatial Arbitrary 𝑂𝑂(𝑛𝑛𝑛𝑛) ; 3
𝑂𝑂(𝑛𝑛𝑛𝑛 log 𝑛𝑛)
EM [57] Large Yes No Spatial Non- 𝑂𝑂(𝑛𝑛𝑛𝑛𝑘𝑘) 3
Model-based COBWEB [58] Small No No Numerical convex 𝑂𝑂(𝑛𝑛2 ) 1
SOM[ 59] Small Yes No Multivariate Non- 𝑂𝑂(𝑛𝑛2 𝑛𝑛) 2
convex
Non-
convex

Table 2: Advantages and disadvantages analysis of the fifth clustering categories

Categories Advantages Disadvantages


-Robust, scalable and simple. -Difficulty to predict the number of clusters
Partitioning-based -Easy to understand and it does not require domain -Sensitivity to scale which means that normalization or
knowledge standardization will completely change the results
-The clusters change when the centroid is recomputed.
-We do not need to know how many clusters are required in -Lack of interpretability.
initial phase. -Very sensitive to outliers
Hierarchical-based -No input parameters are necessary. - It is not possible to undo the previous step: once the
- Easy to implement instances have been assigned to a cluster.
-Random shaped cluster are formed. -Cannot work efficiently with huge and sparse datasets.
Density-based -Resistance to noise and outliers -Unsuitable for high dimensional datasets due to the curse
of dimensionality phenomenon
-Quantize space into a finite number of cells that form a - Perform poorly in locating clusters embedded in a low
Grid-based grid. dimensional subspace of a high dimensional dataset.
-Statistical information exists independently of queries. - Requires a careful selection of the projections, density
- Incremental update and efficient for high dimensional estimate and the optimal cutting plane.
datasets
-Obtained partition can be interpreted from a statistical - Parameters need to be estimated, requiring more data
point of view. points in each components.
Model-based - Successfully used for vector quantization - Inflexibility
- Flexibility in choosing the component distribution. - Quality of predictions due to the fact that we don’t use
-Provides a density estimation for each cluster. the whole dataset.

4. Clustering comparison results: Experimental evaluation

4.1. The datasets

To compare the advantages of the candidate clustering algorithms, experiments were carried out on four different
industrial datasets. The first dataset is gathered from the paper [60] where the author suggests that the design for
logistics encompass four essential subsystems that interact to determine the content of design for logistics. Each of
the four subsystems is decomposed into 15 manageable and homogenous units which are further decomposed into
174 design factors. The aim of this dataset is to regroup modules into homogenous clusters to decrease complexity
and enhance efficiency. The second dataset is generated from a list of surveys and studies collected from several
papers which deal with the quality and safety requirements [61-63]. The dataset contains 98 guidelines decomposed
into 16 different modules. The clustering of this dataset allow us to cluster it into group 98 guidelines on the basis of
16 modules. The third dataset is generated from the quality requirements of automotive sector following the
International Automotive Task Force IATF/ISO 16949. This dataset contains 10 categories of requirements on the
basis of 210 guidelines. Finally, the fourth dataset is the aircraft dataset extracted from the paper [64]. It is a well-
known and frequently used on numerous information retrieval papers. The aim of this dataset is to cluster 53
different transport aircraft into groups on the basis of 37 applications. The input data array considered in this paper
indicates the degree to which each aircraft can perform the indicated function.
After presenting the selected dataset used in the experimental comparison of our algorithms and in order to
improve the quality of the datasets, one important step to achieve in data mining process is data-preprocessing. This
critical step deals with the preparation and transformation of the initial dataset, it is divided into four categories:
Data cleaning (remove noisy features), Data integration, Data transformation (vector space model) and Data
reduction [65-66]. In addition, the formulated incidence matrix for the three first datasets is constructed only with
two entries Entry 1 indicates that a particular design factor does belong to a module whereas entry 0 indicates that
the design factors does not belong to the module, while the matrix in the fourth dataset is subjectively on the interval
of [0, 2]. Hence, we have conducted these steps to enhance the quality of our four datasets.

4.2. Validity metrics

In order to investigate cluster validity, several authors have suggested various indices to facilitate the decision
making [67-70]. External clustering validation and internal clustering validation are the two main categories of
cluster validity. The first one compares the clustering result to a reference result which is considered as the ground
truth [71]. The second one, evaluates the clustering only with the result itself, i-e, the structure of found clusters and
their relations to each other. Therefore, in the situation that there is no external information available, internal
validation measures are the only option for cluster validation. In order to detailed the 10 internal measures that are
considered in this paper,
we shall from now that n is the number of observations, p the number of variables, q the number of clusters, X the n×
p matrix of p variables measured on n independent observations.

 CH index [72] is a popular index using a ratio of the between cluster-means and the within-cluster sum of
squares statistic:
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝐵𝐵𝑞𝑞 )\(𝑞𝑞 − 1)
𝐶𝐶𝐶𝐶 (𝑞𝑞 ) = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝑊𝑊 )(𝑛𝑛 − 𝑞𝑞) ,
𝑞𝑞
where 𝐵𝐵 = 𝑛𝑛 (𝑡 − 𝑥𝑥)(𝑡 − 𝑥𝑥)𝑇𝑇 is the between group dispersion matrix and 𝑊𝑊 (𝑥𝑥 − 𝑡𝑡 )(𝑥 − 𝑡𝑡 )𝑇𝑇
∑𝑞 =∑
𝑞𝑞 𝑘𝑘=1 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑞𝑞 𝑖 ∈𝐶𝐶𝑘𝑘 𝑖 𝑘𝑘 𝑖 𝑘𝑘
is the within group dispersion matrix for data clustered into q clusters. The value of q, which maximizes 𝐶𝐶𝐶𝐶(𝑞𝑞), is
regarded as specifying the number of clusters.
 C-index [73] is calculated using the equation:
𝑆𝑆𝑤𝑤 − 𝑆𝑆𝑚𝑚𝑖𝑖𝑚𝑚
𝐶𝐶𝐶𝐶𝑛𝑛𝐶𝐶𝑡𝑡𝑥𝑥 = , 𝑆𝑆 ≠ 𝑆𝑆𝑚𝑚𝑚𝑚𝑚𝑚 ,
𝑆𝑆𝑚𝑚𝑚 − 𝑆𝑆𝑚𝑚𝑖𝑖𝑚 𝑚𝑚𝑖𝑖𝑚𝑚
𝑚𝑚𝑚
where is the sum of the = ∑𝑞𝑞 (n − 1)\2) smallest distances and is the sum of the largest
𝑆𝑆 𝑁𝑁 (n 𝑆𝑆 𝑁𝑁
𝑚𝑚𝑖𝑖𝑚𝑚 𝑤𝑤 𝑘𝑘=1 k k 𝑚𝑚𝑚𝑚𝑚𝑚 𝑤
𝑤
distances between all the pairs of points 𝑁𝑁𝑡𝑡 = n(n − 1)\2 in the entire data. The minimum value of the index is
used to indicate the optimal number of clusters.
 Connectivity index [74] measures the distance between observations placed in the same cluster as their nearest
neighbors. We define 𝑛𝑛𝑛𝑛𝑖 (𝑗𝑗) as the 𝑗 th nearest neighbor of observation 𝐶𝐶 and let 𝑥𝑥𝑖 ,𝑚𝑚𝑚𝑚𝑖 (𝑗𝑗) be zero if 𝐶𝐶 and 𝑗 are
in the same cluster and 1/𝑗𝑗 otherwise. Then for a particular clustering partition 𝜁𝜁 = {𝑞𝑞1 , ⋯ , 𝑞𝑞𝑞𝑞 } of the n
observations into q disjoint clusters, the connectivity is defined as
𝑚𝑚 𝐿𝐿
1
𝐶𝐶𝐶𝐶𝑛𝑛𝑛𝑛 (𝜁𝜁) = ∑ ∑ 𝑥𝑥𝑖 ,𝑚𝑚𝑚𝑚 (𝑗𝑗) ,
𝐵𝐵 𝑖
𝑖 =1 𝑗𝑗=1
Where L is a parameter giving the number of nearest neighbors to use. The connectivity has a value between zero and
∞ and should be minimized.
 Gamma index [75] is based on the comparison between all within-cluster dissimilarities and between-cluster
dissimilarities:
𝑠𝑠(+) − 𝑠𝑠(−)
𝐺𝐺𝑡𝑡𝐺𝐺𝐺𝐺𝑡𝑡 = ,
𝑠𝑠(+) + 𝑠𝑠(−)
where 𝑠𝑠(+) is the number of concordant comparisons; and the 𝑠𝑠(−) is the number of discordant comparisons. The
maximum value of the index is taken to present the correct number of clusters.
 Tau index [76] is computed between corresponding entries in two matrices. The first contains the distances
between items and the second 0/1 matrix indicates, whether or not, each pair of points are within the same cluster.
This index is computed using the equation:
𝑠𝑠(+) − 𝑠𝑠(−)
𝑇𝑇𝑡𝑡𝑇𝑇 = ,
√𝑁𝑁𝑡𝑡 (𝑁𝑁𝑡𝑡 − 1) × 𝑁𝑁𝑡𝑡 (𝑁𝑁𝑡𝑡 − 1)
2 − 𝑡𝑡 2
where 𝑠𝑠(+) represents the number of concordance comparisons, 𝑠𝑠(−) the reverse outcome if discordant
comparisons and 𝑁𝑁𝑡𝑡 the total number of distances when t refer to the number of comparisons of two pairs of points.
The maximum value of the index is taken as indication of the correct number of clusters.
 Ball-Hall index [77] is based on the average distance of the items to their respective cluster centroids. It is
computed using the formula:
𝐵𝐵𝑡𝑡𝐵𝐵𝐵𝐵 = 𝑊𝑊𝑞
𝑞
.
𝑞𝑞
The largest difference between levels is used to indicate the optimal solution.
 Davies-Bouldin (DB) index [78] is a function of the sum ratio of within cluster scatter to between cluster
separations. It is calculated using the equation:

𝑞𝑞
1 + 𝛿𝛿
𝛿 𝑘𝑘 𝑙
𝐷𝐷𝐵𝐵(𝑞𝑞) = ∑ max , 𝐵𝐵 = 1, ⋯ , 𝑞𝑞
𝑞𝑞 𝑘𝑘≠𝑙𝑙
𝜐𝜐
𝑞𝑞 𝜐𝜐
𝑘𝑘=1
√∑ |𝑡𝑡𝑘𝑘𝑗𝑗 − 𝑡𝑡𝑙 𝑗 |
( 𝑘𝑘=1 )
where
𝑑𝑑 is the distance between centroids of clusters 𝐶𝐶 and 𝐶𝐶 and 𝛿𝛿 1
∑ ∑𝑝𝑝 𝜈𝜈 the
=
𝜐𝜐
𝑖 =1|𝑥𝑥𝑖 𝑖
𝑘𝑘𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 √𝑛 𝑘𝑘 𝑥𝑥𝑖 ∈𝐶𝐶𝑘𝑘 − 𝑐𝑐𝑘𝑘𝑖𝑖 |
𝑛
dispersion measure of a cluster 𝐶𝐶𝑘𝑘 . The value of q minimizing 𝐷𝐷𝐷𝐷(𝑞𝑞) is regarded as specifying the number of
clusters and a DB values close to 0 indicates that the clusters are compact and far from each other.
 Dunn index [79] is based on the idea of identifying the clusters sets that are compact and well separated. It defines
the ratio between the minimal intercluster distances to maximal intracluster distance. It is computed as

min
1≤𝑖𝑖 <𝑖𝑖≤𝑞𝑞
𝑑𝑑(𝐶𝐶𝑖 , 𝐶𝑖𝑖
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = )
,
max 𝑑𝑑 (𝐶𝐶𝑘𝑘 )
1≤𝑘𝑘≤𝑞𝑞
where 𝑑𝑑(𝐶𝐶𝑖 , 𝐶𝑖𝑖 ) = min 𝑑𝑑(𝑥𝑥, 𝑦𝑦) is the dissimilarity function between two clusters and 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝐶𝐶 ) = max 𝑑𝑑(𝑥𝑥, 𝑦𝑦)
𝑥𝑥∈𝐶𝐶𝑖 ,𝑦𝑦∈𝐶𝐶𝑗𝑗 is
𝑥𝑥,𝑦𝑦∈𝐶𝐶
the maximum distance between observations in a cluster. The number of cluster that maximizes Dunn is taken as the
optimal number of clusters and indicates that the clusters are compact and well separated.
 Hubert statistic [80] is the point serial correlation coefficient between any two matrices. The definition of Hubert’s
normalized Γ statistic is given by the formula:
𝑛𝑛−1
∑𝑖 =1 ( 𝑃𝑖 𝑖 − µ𝑃𝑃 )( 𝑄𝑄𝑖 𝑖𝑖 − µ𝑄𝑄 )
𝛤𝛤 = 𝑖 <𝑖𝑖
𝜎𝜎𝑃𝑃𝜎𝜎𝑄𝑄
Where µ𝑃𝑃 , µ𝑃𝑃 , 𝜎𝜎𝑃𝑃, 𝜎𝜎𝑄𝑄 are the respective means and variances of the P and Q matrices. This index takes values
between
-1 and 1. The number of clusters at which the knee occurs is an indication of the number of clusters that underlie the
data.

4.1 Results and discussions

One of the most crucial questions in many real-world cluster applications is how to determine a suitable number
of clusters q, also known as the model selection problem. Without a priori knowledge, there is no simple way of
knowing that number, the only way to do it is by computing several indices, which allow us to choose the best
number of cluster. In this respect, our experiments are divided into two parts. We define first, the appropriate
number of clusters that should be considered in each of the fourth datasets using quantitative and graphical
interpretation. After that, we compute all the previous indices to compare the selected four algorithms and to select
the most appropriate one for clustering small numerical sparse dataset.
In order to find the relevant number of clusters in each dataset, we use Nbclust package that has been developed
by [81]. It provides 30 indices available in SAS and R in one package. To compute the indices, user can request the
indices one by one, by setting the argument index to the name of the index. Hence and according to the majority
rules, 4 would be the best number of cluster for logistics dataset, 3 for the aircraft dataset, 5 for the automotive
quality system and 4 for the customer’s requirements dataset. Moreover, if we look at Hubert index, which is a
graphical method, the optimal number of clusters is identified by a significant knee in the plot of index values
against the number of clusters. This knee corresponds to a significant increase or significant decrease of the index,
as the number of clusters varies from the minimum to the maximum. In other words, a significant peak in the plot of
second differences values indicates the relevant number of clusters. Hence, as shown in figure 2, for the logistics and
aircraft datasets for example, the Hubert index confirm our purpose and propose 3 and 4 as the best number of
clusters for the aircraft and logistics dataset respectively. Consequently, using this approach, the user faces the
dilemma of choosing the best number of cluster.
Figure 2: (a) Hubert index for aircraft dataset; (b) Hubert index for logistics dataset

Furthermore, and after selecting the best number of clusters in each dataset, we are able now to run the selected
four algorithms in order to compare first the best clustering algorithm and second to define the appropriate clustering
for each dataset. To do so, we compute all the presented internal indices (section 5.2) to exploit prior knowledge of
data and cluster their label. In this sense, table 3 reports the results of the candidate clustering algorithms according
to the internal validity measures, from which we can infer several observations.

Table 3: Internal validity results for the candidate clustering algorithms.

Datasets Algorithms Internal indices


C-index CH Dunn Gamma BH DB tau Connectivity
K-means 0.36 1.63 0.69 0.26 79.17 1.69 0.16 18.77
Hierarchical 0.22 1.88 0.71 0.61 79.59 1.93 0.38 18.78
Logistics DBSCAN 0.17 1.67 0.75 0.60 121.51 0.76 0.43 26.34
SOM 0.15 1.67 0.80 0.67 68.28 2.66 0.45 14.98
K-means 0.19 7.80 0.29 0.60 13.61 1.67 0.39 23.62
Customer’s Hierarchical 0.31 5.77 0.26 0.32 9.16 1.61 0.22 26.50
DBSCAN 0.16 8.33 0.35 0.60 18.42 1.45 0.42 16.93
requirements SOM 0.16 8.20 0.35 0.60 11.12 1.42 0.49 16.24
K-means 0.05 28.66 0.67 0.79 13.37 1.30 0.39 26.84
Hierarchical 0.12 16.61 0.44 0.69 3.55 0.43 0.47 32.98
Automotive DBSCAN 0.06 22.27 0.56 0.82 1.85 1.01 0.55 27.60
quality systems SOM 0.05 38.77 0.43 0.87 2.10 0.69 0.56 25.30
K-means 0.14 15.81 0.47 0.87 24.29 0.98 0.58 10.22
Aircraft Hierarchical 0.14 14.04 0.33 0.86 24.05 0.98 0.57 11.22
DBSCAN 0.18 18.46 0.35 0.63 26.19 1.34 0.38 18.09
SOM 0.14 13.58 0.25 0.70 25.81 1.34 0.47 10.78

First, it can be seen from the table that according to the sizes of the datasets, SOM clustering output for the three
first datasets shows good results on almost all internal measures in comparison to the remaining clustering
algorithms, while K-means perform well for the aircraft dataset (K=5). In other words, the quality of the Kmeans
becomes very good when using huge dataset and it is confirmed by the table 2 presented in the previous section.
Furthermore, as the number of instances becomes lower, SOM shows more accuracy in clustering the objects into
their suitable clusters than other algorithms. Moreover, in most of the time hierarchical clustering and SOM show
the same results when the number of instances becomes greater (aircraft dataset). Second, according to the number
of clusters, as the value of q becomes greater, the performance of SOM algorithm becomes lower and the Kmeans
algorithm become higher. This is obviously seen in the aircraft dataset which requires the higher number of cluster
compared to other ones (k=5). Third, DBSCAN gives better result compared to hierarchical and Kmeans
algorithms when using noisy dataset. In
fact, for the customer’s requirements dataset which is generated from surveys and studies, there were responses
which are incomplete, there were responses which are not compatible and also there were responses were the
responders are not professional in designing for safety and quality. For this reason and due to the fact that Kmeans,
hierarchical and SOM algorithms are sensitive for noise, this makes it difficult to cluster an object into its suitable
cluster and this will certainly affect the result of the algorithm. In addition, we can state that hierarchical clustering
algorithm is the most sensitive for noisy dataset, more than Kmeans and SOM.
However, as a general conclusion, SOM is recommended for small dataset while DBSCAN for noisy dataset and
K-means for the huge dataset. In fact, SOM algorithm differs from other clustering algorithm and especially from
other artificial neural networks. It is a popular nonlinear technique that uses a neighborhood function to preserve the
topological properties for the input space and to reduce and visualize the dataset. In this respect, it is considered as
an efficient method able to product not only compact and connected clusters but also a well separated one. Which
means that if you take for example the first dataset gathered from the paper [60], SOM algorithm indicates that the
overall Design for logistics can be accomplished in four self-contained modules. Group 1 addresses design
characteristics, it consists of design criteria, packaging design features, packaging materials, and materials. Group 2
addresses production issues, it consists of three modules: design for manufacture, manufacturing process and
product lines. Group 3 addresses control parameters, it consists of production planning and control, packaging
testing, plant location and design attributes. Finally, group 4 addresses transportation issues, it consists of design for
supportability, transport mode, transport issues and functional packaging requirements. By designing the whole
group instead of separated modules, the designer considers only the necessary and relevant design factors which
reduce the degree of complexity and facilitate the manipulation and modification of different design factors
associated to each group.

5. Conclusion and future works

This paper provides a comprehensive survey and intends to compare popular, flexible and applicable clustering
algorithms in industry field. Through an extensive search, there exists no such comprehensive survey that has
intuitively attempted to compare between the four clustering algorithms under investigation in this field. Thus, to
reflect the profile of this field, a related work has been presented first according to twenty past year with respect to
four categories namely review and surveys; comparative studies; clustering approaches and clustering applications.
After that, we have presented a simple categorization framework that would automatically recommend the most
suitable algorithm that will be properly evaluated in a real industrial dataset. To do so, the most representative
clustering algorithm of each category (excepted for the grid-based) have been empirically analyzed over a vast
number of internal indices and industrial datasets. Following the empirical study, we can draw the following
conclusions: (1) SOM shows in most of the datasets excellent performance and the best clusters in comparison with
the reaming clustering algorithms; (2) As the number of clusters q becomes greater the performance of SOM
algorithm becomes lower; (3) Kmeans and SOM are sensitive to noisy dataset but the hierarchical clustering is the
more sensitive one, hence the DBSCAN is the best in this case; (4) No clustering algorithm performs well for all the
internal validity measures; (5) Kmeans give a higher result for huge data than SOM and hierarchical clustering
algorithm.
Furthermore, we have tried to reveal future directions for the researchers. First, we can attempt to consider other
factors than those considered in this paper to compare our algorithms for example, including the accuracy, the
stability, the normalization of not of the dataset, the volume of the dataset. These considerations will probably affect
the quality and the performance of the algorithms. Second, we can also extend our comparison to other clustering
algorithms with other different industrial datasets. Third, we can develop an algorithm to directly and automatically
compare the algorithms based on different validity indices such as internal, stability and biology indices. Fourth,
with the data explosion in many applications, several issues such as clustering online data, imbalanced data and high
dimensional data have become popular research topics for researchers and have to be taken into consideration in
different comparison surveys to facilitate the selection of the appropriate clustering algorithm for a given dataset. On
the basis of the promising findings presented in this paper, work on the remaining issues is continuing and will be
presented in future papers.
References

[1] Davenport, T.H., and L.Prusak. 2000. Working knowledge: How organizations manage what they know. Boston, MA: Harvard Business
School Press.
[2] Barnes, S.2002. Introduction. In Knowledge management systems: Theory and practice, ed. S. Barnes, and S. Barnes, 1-11. Oxford, UK:
Thomson Learning.
[3] Holsapple, C.W. 2003 Handbook on knowledge management. Berlin: Springer.
[4] Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999). Weka: Practical machine learning tools and
techniques with Java implementations.
[5] Zhang, B., Zhang, C., & Yi, X. (2004). Competitive EM algorithm for finite mixture models. Pattern recognition, 37(1), 131-144.
[6] McLachlan, G., & Peel, D. (2004). Finite mixture models. John Wiley & Sons.
[7] Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in
multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), 561-575.
[8] Zhang, Z., Dai, B. T., & Tung, A. K. (2008, July). Estimating local optimums in EM algorithm over Gaussian mixture model. In Proceedings
of the 25th international conference on Machine learning (pp. 1240-1247). ACM.
[9] Hancer, E., & Karaboga, D. (2017). A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for
determination of cluster number. Swarm and Evolutionary Computation, 32, 49-67.
[10] Newbert, S. L. (2007). Empirical research on the resource‐based view of the firm: an assessment and suggestions for future research.
Strategic management journal, 28(2), 121-146.
[11] He, L., WU, L. D., & CAI, Y. C. (2007). Survey of Clustering Algorithms in Data Mining [J]. Application Research of Computers, 1, 10-13.
[12] Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3), 645-678.
[13] Treshansky, A., & McGraw, R. M. (2001, September). Overview of clustering algorithms. In Enabling Technology for Simulation Science
V (Vol. 4367, pp. 41-52). International Society for Optics and Photonics.
[14] Abbas, O. A. (2008). Comparisons between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT), 5(3).
[15] Chen, G., Jaradat, S. A., Banerjee, N., Tanaka, T. S., Ko, M. S., & Zhang, M. Q. (2002). Evaluation and comparison of clustering algorithms
in analyzing ES cell gene expression data. Statistica Sinica, 241-262.
[16] Singhal, G., Panwar, S., Jain, K., & Banga, D. (2013). A comparative study of data clustering algorithms. International Journal of Computer
Applications, 83(15).
[17] Feizi-Derakhshi, M. R., & Zafarani, E. (2012). Review and comparison between clustering algorithms with duplicate entities detection
purpose. International Journal of Computer Science & Emerging Technologies, 3(3).
[18] [19] Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data mining. Indian Journal of Science and
Technology, 9(3).
[19] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y. & Bouras, A. (2014). A survey of clustering algorithms for big data:
Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3), 267-279.
[20]Ayed, A. B., Halima, M. B., & Alimi, A. M. (2014, August). Survey on clustering methods: Towards fuzzy clustering for big data. In Soft
Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of (pp. 331-336). IEEE.
[21] Sherin, A., Uma, D. S., Saranya, K., & Vani, M. S. (2014). Survey On Big Data Mining Platforms, Algorithms And Challenges.
International Journal of Computer Science & Engineering Technology, 5.
[22] Nagpal, P. B., & Mann, P. A. (2011). Survey of Density Based Clustering Algorithms. International journal of Computer Science and its
Applications, 1(1), 313-317.
[23] Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3), 645-678.
[24] Cai, F., Le-Khac, N. A., & Kechadi, T. (2016). Clustering approaches for financial data analysis: a survey. arXiv preprint arXiv:1609.08520.
[25] Zhao, C., Johnsson, M., & He, M. (2017). Data mining with clustering algorithms to reduce packaging costs: A case study. Packaging
Technology and Science, 30(5), 173-193.
[26] Craigen, D., Gerhart, S., & Ralston, T. (1993). An international survey of industrial applications of formal methods. In Z User Workshop,
London 1992 (pp. 1-5). Springer, London.
[27] Simula, O., Vasara, P., Vesanto, J., & Helminen, R. (1999). The self-organizing map in industry analysis. Industrial Applications of Neural
Networks, Washington.
[28] Milton, D. K., Hammond, S. K., & Spear, R. C. (1999). Hierarchical cluster analysis applied to workers exposures in fiberglass insulation
manufacturing. The Annals of occupational hygiene, 43(1), 43-55.
[29] Swan, S. H., Beaumont, J. J., Hammond, S. K., Vonbehren, J., Green, R. S., Hallock, M. F., ... & Schenker, M. B. (1995). Historical cohort
study of spontaneous abortion among fabrication workers in the semiconductor health study: Agent‐level analysis. American journal
of industrial medicine, 28(6), 751-769.
[30] Gao, S., Wang, Y., Cheng, J., Inazumi, Y., & Tang, Z. (2016). Ant colony optimization with clustering for solving the dynamic location
routing problem. Applied Mathematics and Computation, 285, 149-173.
[31] Rajagopal, D. (2011). Customer data clustering using data mining technique. arXiv preprint arXiv:1112.2663.
[32] Shi, W., & Zeng, W. (2014). Application of k-means clustering to environmental risk zoning of the chemical industrial area. Frontiers of
Environmental Science & Engineering, 8(1), 117-127.
[33] Yıldız, T., & Aykanat, Z. (2015). Clustering and innovation concepts and innovative clusters: an application on technoparks in
Turkey. Procedia-Social and Behavioral Sciences, 195, 1196-1205.
[34] G.J Zheng, W. Zhang, P. Hu & D.Y SHI, (2015), Optimization of hot forming process using DMT and Finite element method, International
Journal of Automative Technology, Vol 16, no.2, pp: 329-337.
[35] Saraee, M., Moghimi, M., & Bagheri, A. (2011, August). Modeling batch annealing process using data mining techniques for cold rolled
steel sheets. In Proceedings of the First International Workshop on Data Mining for Service and Maintenance (pp. 18-22). ACM.
[36] Umeshini, S., & PSumathi, C. ASurvey ON DATA MINING IN STEEL INDUSTRIES.
[37] A.K. Jain,Data clustering:50 years beyond K-means,PatternRecogn.Lett.31 (2010) 651–666.
[38] Kriegel H, Kröger P, Sander J, Zimek A (2011) Densitybased clustering. Wiley Interdiscip Rev 1:231–240
[39] Benabdellah, A. C., Benghabrit, A., & Bouhaddou, I. (2016, November). Big data for supply chain management: opportunities and
challenges. In Computer Systems and Applications (AICCSA), 2016 IEEE/ACS 13th International Conference of (pp. 1-6). IEEE.
[40] MacQueen J (1967) some methods for classification and analysis of multivariate observations. ProcFifth Berkeley Symp Math Stat Probab
1:281–297
[41] Park H, Jun C (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl36:3336–3341
[42] Kaufman L, Rousseeuw P (1990) Partitioning around medoids (program pam). Finding group’s indata: an introduction to cluster analysis.
Wiley, Hoboken
[43] Kaufman L, Rousseeuw P (2008) Finding groups in data: an introduction to cluster analysis, vol 344.Wiley, Hoboken. Doi:
10.1002/9780470316801
[44] L. Kaufman and P.J. Rousseeuw. (1990) Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons
[45] Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241-254.
[46] Zhang T, Ramakrishnan R, LivnyM(1996) BIRCH: an efficient data clustering method for very largedatabases. ACM SIGMOD Rec
25:103– 104
[47] Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. ACMSIGMOD Rec 27:73–84
[48] Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes.In: Proceedings of the 15th international
conference on data engineering, pp 512-521
[49] Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling.Computer 32:68–75
[50] Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in largespatial databases with noise. In:
Proceedings of the second ACM SIGKDD international conferenceon knowledge discovery and data mining, pp 226–231
[51] AnkerstM, BreunigM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clusteringstructure. In: Proceedings on 1999
ACMSIGMOD international conference on management of data,vol 28, pp 49–60
[52] Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases withnoise. In Proceedings of the 4th ACM
SIGKDD international conference on knowledge discoveryand data mining 98: 58–65
[53] Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inf Process Lett76:175–181
[54] Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial datamining. In VLDB, pp 186–195.
[55] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of highdimensional data for data mining applications.
In: Proceedings 1998 ACM sigmod internationalconference on management of data, vol 27, pp 94–105
[56] Hinneburg, A., & Keim, D. A. (1999). Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering.
[57] Wu, C. J. (1983). On the convergence properties of the EM algorithm. The Annals of statistics, 95-103.
[58] FisherD(1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172
[59] Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21(1-3), 1-6.
[60] Dowlatshahi, S. (1999). A modeling approach to logistics in concurrent engineering. European Journal of Operational Research, 115(1).
[61] Dowlatshahi, D., MacQueen, G. M., Wang, J. F., Reiach, J. S., & Young, L. T. (1999). G Protein‐Coupled Cyclic AMP Signaling in
Postmortem Brain of Subjects with Mood Disorders. Journal of neurochemistry, 73(3), 1121-1126.
[62] Zhu, A. Y., von Zedtwitz, M., Assimakopoulos, D., & Fernandes, K. (2016). The impact of organizational culture on Concurrent
Engineering, Design-for-Safety, and product safety performance. International Journal of Production Economics, 176, 69-81.
[63] Goh, Y. M., & Chua, S. (2016). Knowledge, attitude and practices for design for safety: A study on civil & structural engineers. Accident
Analysis & Prevention, 93.
[64] McCormick Jr, W. T., Schweitzer, P. J., & White, T. W. (1972). Problem decomposition and data reorganization by a clustering
technique. Operations Research, 20(5), 993-1009.
[65] Piatetsky-Shapiro, G. (1996). Advances in knowledge discovery and data mining (Vol. 21).
[66] U. M. Fayyad, P. Smyth, & R. Uthurusamy (Eds.). Menlo Park: AAAI press.
[67] Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psyc
[68] Halkidi, M., Vazirgiannis, M., & Batistakis, Y. (2000, September). Quality scheme assessment in the clustering process. In European
Conference on Principles Data Mining and Knowledge Discovery (pp. 265-276). Springer, Berlin, Heidelberg.
[69] Theodoridis, S., & Koutroubas, K. (1999). Feature generation II. Pattern recognition, 2, 269-320.
[70] Handl, J., Knowles, J., & Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15).
[71] Beecks, C., Hassani, M., Brenger, B., Hinnell, J., Schüller, D., Mittelberg, I., & Seidl, T. (2016). Efficient query processing in 3D motion
capture gesture databases. International Journal of Semantic Computing, 10(01), 5-25.
[72] Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
[73] Hubert, L. J., & Levin, J. R. (1976). A general statistical framework for assessing categorical clustering in free recall. Psychological
bulletin, 83(6), 1072.
[74] Xing, G., Wang, X., Zhang, Y., Lu, C., Pless, R., & Gill, C. (2005). Integrated coverage and connectivity configuration for energy
conservation in sensor networks. ACM Transactions on Sensor Networks (TOSN), 1(1), 36-72.
[75] Baker, F. B., & Hubert, L. J. (1975). Measuring the power of hierarchical cluster analysis. Journal of the American Statistical
Association, 70(349), 31-38.
[76] Shieh, G. S. (1998). A weighted Kendall's tau statistic. Statistics & probability letters, 39(1), 17-24.
[77] Cleland, J. G., Erhardt, L., Murray, G., Hall, A. S., & Ball, S. G. (1997). Effect of ramipril on morbidity and mode of death among survivors
of acute myocardial infarction with clinical evidence of heart failure. A report from the AIRE Study Investigators. European heart journal,
18(1), 41-51.
[78] Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence.
[89] Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern recognition, 37(3), 487-501.
[80] Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American
statistical Association, 62(318), 399-402.
[81] Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A., & Charrad, M. M. (2014). Package ‘NbClust’. Journal of Statistical Software, 61, 1-36.

Potrebbero piacerti anche