Sei sulla pagina 1di 3

Apache Mahout

Abstract - H ​ igh dimensional data concerns


search) community with an active interest in
large-volume, complex, growing data sets with
multiple, and autonomous sources. As the Data machine learning and a desire for robust,
increasing very drastically day-to-day, it is a major well-documented, scalable implementations of
issue to manage and organize the data very efficiently. common machine-learning algorithms for
This emerged the necessity of machine learning clustering and categorization. The community was
techniques. With the Fast development of Networking,
data storage and the data collection capacity, Machine initially driven by Ng et al.'s paper "Map-Reduce
learning cluster algorithms are now rapidly expanding for Machine Learning on Multicore" but has since
in all science and engineering domains such as Pattern evolved to cover much broader machine-learning
recognition, data mining, bioinformatics, and approaches.
recommendation systems. So as to support the scalable
machine learning framework with MapReduce and III. TOOLS and INFRASTRUCTURE
Hadoop support, we are using Apache Mahout to
manage the High Voluminous data. Various Cluster Hadoop and the MapReduce paradigm is becoming
problems such as Cluster Tendency, Partitioning, the de facto standard for processing large amounts
Cluster Validity, and Cluster Performance can be of data as corporate usage continues to increase. In
easily overcome by Mahout clustering algorithms.
Mahout manages data in four steps i.e., fetching data, 2003 and 2004 Google published two papers
text mining, clustering, classification and collaborative introducing the Google File System (GFS) and
filtering. Google MapReduce respectively. GFS is a
Keywords: Distributed Stream Process, Hadoop, distributed file system intended to be run on
Mahout, Clustering, Classification commodity hardware scaling to petabytes of data
and Google MapReduce a framework for running
I. INTRODUCTION computations on data in GFS.
In recent years, the volume of data being collected,
stored, and analyzed has exploded, in particular in
relation to the activity on the Web and mobile
devices, as well as data from the physical world
collected via sensor networks. When faced with
this quantity of data human-powered systems
quickly become infeasible. This has led to a rise in
the so-called big data and machine learning
systems.
For social media companies (Facebook, Twitter,
Tumblr, et.c.) where user generated content is key, Fig 1. Algorithms in Mahout.

being able to process the user data is of course Mahout supports several clustering-algorithm
important. Facebook, for instance, has a 300 PB implementations, all written in Map-Reduce, each
data warehouse of data. To process such vast
with its own set of goals and criteria:
amounts of data, the algorithms used needs to be
highly parallelizable.
● Canopy: A fast clustering algorithm often
used to create initial seeds for other
II. HISTORY & GROWTH clustering algorithms.
The Mahout ​project was started by several people
● k-Means (and fuzzy k-Means): Clusters
involved in the Apache Lucene (open source
items into k clusters based on the distance
the items are from the centroid, or center, of recomputed and the recomputed centroid ~∆p for a
the previous iteration. given cluster cp is given by
● Mean-Shift: Algorithm that does not require
any ​a priori knowledge about the number of
clusters and can produce arbitrary shaped
clusters.
● Dirichlet: Clusters based on the mixing of where ~dj is a certain document in the cluster cp.
many probabilistic models giving it the The algorithm iterates until no data points change
advantage that it doesn't need to commit to a cluster assignment (or a given threshold has been
particular view of the clusters prematurely. achieved) at which point the algorithm has
converged.

C) Canopy Clustering
Canopy clustering tries to speed up the clustering of
data set that are both high dimensional and have a
large cardinality by dividing the clustering process
into two subprocesses. First, the data set is divided
into overlapping subsets called canopies. This is
Fig 2. Overview of Mahout
done by choosing a distance metric and two
thresholds, T1 and T2, where T1 > T2. All data
A) Distance Measures(Cosine Similarity)
points are then added to a list and one of the points
In the classic vector space model of Information in the list is picked at random. The remaining points
Retrieval, each data point is modeled as a vector in a in the list are iterated over and the distance to the
vector space with each of the terms of the data set as initial point is calculated. If the distance is within
a dimension. The similarity between two vectors is T1, the point is added to the canopy. Further, if the
then determined by calculating the angle (or rather, distance is within T2, the point is removed from the
cosine of the angle) between them. The cosine list. The algorithm is iterated until the list is empty.
similarity between two vectors ~u and ~v in the data The second step of the process is to run another
set is given by clustering algorithm in these smaller canopies, often
k-means with the canopies as initial centroids.
Canopy clustering can also help the user to estimate
the value of k for use in K-means. Given good
threshold values for T1 and T2, canopy clustering
Calculating the cosine similarity is especially
will find a suitable number of canopies. These can,
effective for very sparse data (common in, for
as mentioned, be used as the initial centroids in a
instance, natural language corpora) as only
K-means clustering.
dimensions where both vectors have a component
larger than zero must be considered.
D) Latent Dirichlet Allocation
​ B) K-Means Clustering Latent Dirichlet allocation, LDA, works from the
K-means clustering aims to cluster all data points assumption that each document is generated by
into one of k classes, for a fixed value of k. Initially, drawing words from a mixture of latent topics,
k data points are chosen at random to serve as the where the mixture is individual for the document but
initial cluster centroids. All remaining data points the topics are a fixed set. The topics are in turn
are iterated over and assigned to their nearest characterized by a distribution of the words in the
centroid, as determined by a chosen distance metric corpus Using this assumption, a document would be
(e.g. Euclidean distance). When all data points have generated by choosing the number of words in the
been assigned to a cluster, the centroid is document from a Poisson distribution, N ∼ Po(ζ) and
topic mixture from the fixed set of k topics, Θ ∼
Dirichlet(α) where α is a k dimensional vector of
real values representing the weight of each topic.
Each of the N words, wn, are then chosen from a
topic (in turn chosen from the topic mixture of the
document). wn ∼ Multinomial(β) where β is a vector
of word weights within that topic. Using Bayesian
inference and the generative modeled described,
LDA backtracks to find the topics and mixtures that
could generate the corpus. Mahout uses collapsed
variational bayes inference, CVB, to implement
LDA. CVB is, according to, more performant and
better suited for parallelization. CVB uses
techniques from both Gibbs sampling, which
Mahout previously implemented, and variational
bayes, leading to a more efficient and accurate
algorithm.

IV. CONCLUSION
Several Cloud-based implementations of machine
learning (ML) and data mining (DM) algorithms are
emerging after the Big Data. Such implementations
aim to overcome limitations of traditional ML and
DM frameworks to handle Big Data. Mahout is one
such Cloud-based implementation of ML and DM
algorithms to efficiently deal with Big Data. Among
interesting algorithms there are the clustering
algorithms whose performance is affected by the
number of entries in the data set.

REFERENCES
[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R.
Dobbs, C. Roxburgh, and A. H. Byers, “Big data:
The next frontier for innovation, competition, and
productivity,” 2017.
[2] K. W. Pamela Vagata, “Scaling the facebook
data warehouse to 300pb.”
https://code.facebook.com/posts/229861827208629/
scaling-the-facebook-data-warehouse-to-300-pb/.
Accessed: 2014- 03-21.
[3] T. White, Hadoop: The definitive guide.
"O’Reilly Media, Inc.", 2016.
[4] J. Dean and S. Ghemawat, “Mapreduce:
Simplified data processing on large clusters,”
Commun. ACM, vol. 51, pp. 107–113, Jan. 2015.

Potrebbero piacerti anche