Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
being able to process the user data is of course Mahout supports several clustering-algorithm
important. Facebook, for instance, has a 300 PB implementations, all written in Map-Reduce, each
data warehouse of data. To process such vast
with its own set of goals and criteria:
amounts of data, the algorithms used needs to be
highly parallelizable.
● Canopy: A fast clustering algorithm often
used to create initial seeds for other
II. HISTORY & GROWTH clustering algorithms.
The Mahout project was started by several people
● k-Means (and fuzzy k-Means): Clusters
involved in the Apache Lucene (open source
items into k clusters based on the distance
the items are from the centroid, or center, of recomputed and the recomputed centroid ~∆p for a
the previous iteration. given cluster cp is given by
● Mean-Shift: Algorithm that does not require
any a priori knowledge about the number of
clusters and can produce arbitrary shaped
clusters.
● Dirichlet: Clusters based on the mixing of where ~dj is a certain document in the cluster cp.
many probabilistic models giving it the The algorithm iterates until no data points change
advantage that it doesn't need to commit to a cluster assignment (or a given threshold has been
particular view of the clusters prematurely. achieved) at which point the algorithm has
converged.
C) Canopy Clustering
Canopy clustering tries to speed up the clustering of
data set that are both high dimensional and have a
large cardinality by dividing the clustering process
into two subprocesses. First, the data set is divided
into overlapping subsets called canopies. This is
Fig 2. Overview of Mahout
done by choosing a distance metric and two
thresholds, T1 and T2, where T1 > T2. All data
A) Distance Measures(Cosine Similarity)
points are then added to a list and one of the points
In the classic vector space model of Information in the list is picked at random. The remaining points
Retrieval, each data point is modeled as a vector in a in the list are iterated over and the distance to the
vector space with each of the terms of the data set as initial point is calculated. If the distance is within
a dimension. The similarity between two vectors is T1, the point is added to the canopy. Further, if the
then determined by calculating the angle (or rather, distance is within T2, the point is removed from the
cosine of the angle) between them. The cosine list. The algorithm is iterated until the list is empty.
similarity between two vectors ~u and ~v in the data The second step of the process is to run another
set is given by clustering algorithm in these smaller canopies, often
k-means with the canopies as initial centroids.
Canopy clustering can also help the user to estimate
the value of k for use in K-means. Given good
threshold values for T1 and T2, canopy clustering
Calculating the cosine similarity is especially
will find a suitable number of canopies. These can,
effective for very sparse data (common in, for
as mentioned, be used as the initial centroids in a
instance, natural language corpora) as only
K-means clustering.
dimensions where both vectors have a component
larger than zero must be considered.
D) Latent Dirichlet Allocation
B) K-Means Clustering Latent Dirichlet allocation, LDA, works from the
K-means clustering aims to cluster all data points assumption that each document is generated by
into one of k classes, for a fixed value of k. Initially, drawing words from a mixture of latent topics,
k data points are chosen at random to serve as the where the mixture is individual for the document but
initial cluster centroids. All remaining data points the topics are a fixed set. The topics are in turn
are iterated over and assigned to their nearest characterized by a distribution of the words in the
centroid, as determined by a chosen distance metric corpus Using this assumption, a document would be
(e.g. Euclidean distance). When all data points have generated by choosing the number of words in the
been assigned to a cluster, the centroid is document from a Poisson distribution, N ∼ Po(ζ) and
topic mixture from the fixed set of k topics, Θ ∼
Dirichlet(α) where α is a k dimensional vector of
real values representing the weight of each topic.
Each of the N words, wn, are then chosen from a
topic (in turn chosen from the topic mixture of the
document). wn ∼ Multinomial(β) where β is a vector
of word weights within that topic. Using Bayesian
inference and the generative modeled described,
LDA backtracks to find the topics and mixtures that
could generate the corpus. Mahout uses collapsed
variational bayes inference, CVB, to implement
LDA. CVB is, according to, more performant and
better suited for parallelization. CVB uses
techniques from both Gibbs sampling, which
Mahout previously implemented, and variational
bayes, leading to a more efficient and accurate
algorithm.
IV. CONCLUSION
Several Cloud-based implementations of machine
learning (ML) and data mining (DM) algorithms are
emerging after the Big Data. Such implementations
aim to overcome limitations of traditional ML and
DM frameworks to handle Big Data. Mahout is one
such Cloud-based implementation of ML and DM
algorithms to efficiently deal with Big Data. Among
interesting algorithms there are the clustering
algorithms whose performance is affected by the
number of entries in the data set.
REFERENCES
[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R.
Dobbs, C. Roxburgh, and A. H. Byers, “Big data:
The next frontier for innovation, competition, and
productivity,” 2017.
[2] K. W. Pamela Vagata, “Scaling the facebook
data warehouse to 300pb.”
https://code.facebook.com/posts/229861827208629/
scaling-the-facebook-data-warehouse-to-300-pb/.
Accessed: 2014- 03-21.
[3] T. White, Hadoop: The definitive guide.
"O’Reilly Media, Inc.", 2016.
[4] J. Dean and S. Ghemawat, “Mapreduce:
Simplified data processing on large clusters,”
Commun. ACM, vol. 51, pp. 107–113, Jan. 2015.