Big Data Analytics With Storm, Spark and GraphLab

Big Data Analytics with Storm, Spark and GraphLab
Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus
1
Contents
Big Data Computations
Introduction to ML Characterization
Programming Abstractions
Berkeley data analytics stack

Spark
Hadoop 2.0 (Hadoop YARN)
Real-time Analytics with Storm
GraphLab
PMML Scoring for Nave Bayes

PMML Primer Nave Bayes Primer
Introduction to Machine Learning

What is it? learn patterns in data improve accuracy by learning Examples
Speech recognition systems
Recommender systems
Medical decision aids Robot navigation systems
3

Attributes and their values: Outlook: Sunny, Overcast, Rain Humidity: High, Normal Wind: Strong, Weak Temperature: Hot, Mild, Cool
Target prediction - Play Tennis: Yes,
No
4

Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temp. Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong Strong Strong Weak Strong Play Tennis No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Tom Mitchell, Machine Learning, Tata McGraw Hill Publications.
Introduction to Machine Learning: Decision Trees
Outlook
Sunny
Humidity
Overcast
Yes
Rain
Wind
High
No
Normal
Yes
Strong
No
Weak
Yes
Decision Trees to Random Forests

Decision trees
Pros
Handling of mixed data, Robustness to outliers, Computational scalability
cons
Low prediction accuracy, High variance, Size VS Goodness of fit
Can we have an ensemble of trees? random forests

Final prediction is the mean (regression) or class with max votes (categorization) Does not need tree pruning for generalization Greater accuracy across domains.
K-means Clustering
Support Vector Machines

Machine learning tasks
Data Mining
Learning associations market basket analysis
Application of machine learning to large data
Supervised learning (Classification/regression) random forests, support vector machines (SVMs), logistic regression (LR), Nave Bayes
Knowledge Discovery in Databases (KDD)
Unsupervised learning (clustering) - k-means, sentiment analysis
Credit scoring, fraud detection, market basket analysis, medical diagnosis, manufacturing optimization
Prediction random forests, SVMs, LR
10

Giant 1 (simple stats) is perfect for Hadoop 1.0.
Computations/Operations
Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark from UC Berkeley is efficient.
Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares.
Example is social group-first approach for consumer churn analysis [2]
Interactive/On-the-fly data processing Storm.
OLAP data cube operations. Dremel/Drill
Data sets not embarrassingly parallel? Machine vision from Google [3] Deep Learning Artificial Neural Networks Speech analysis from Microsoft Giant 5 Graph processing GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 11
Iterative ML Algorithms
What are iterative algorithms? Those that need communication among the computing entities Examples neural networks, PageRank algorithms, network traffic analysis Conjugate gradient descent Commonly used to solve systems of linear equations [CB09] tried implementing CG on dense matrices DAXPY Multiplies vector x by constant a and adds y. DDOT Dot product of 2 vectors MatVec Multiply matrix by vector, produce a vector. Communication Overhead 1 MR per primitive 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices. Other iterative algorithms fast fourier transform, block tridiagonal
[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.
ML realizations: 3 Generational view

Generation First Generation
SAS, R, Weka, SPSS in native form
Second Generation
Third Generation
Examples
Mahout, Pentaho, Revolution R, SAS Inmemory Analytics (Hadoop) Horizontal (over Hadoop) Small subset sequential logistic regression, linear SVMs, Stochastic Gradient Descent, k-means clustering, Random Forests etc. Vast no. Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS etc. Most tools are FT, as they are built on top of Hadoop
Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, Golden ORB, Stanford GPS, ML over Storm Horizontal (Beyond Hadoop) Much wider including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling etc. Multivariate logistic regression in general form, K-means clustering etc. work in progress to expand the set of algorithms available. FT HaLoop, Spark Not FT Pregel, GraphLab, Giraph Spark giant 2, 3 and 4. GraphLab giant 5.
Scalability Algorithms Available
Vertical Huge collection of algorithms
Algorithms Not Available
Practically Nothing
FaultTolerance Giants
Single point of failure
All 7 giants for Giants 1, and 2. small data sets
Vijay Srinivas Agneeswaran, Pranay Tonpay and Jayati Tiwari, Paradigms for Realizing Machine Learning Algorithms, Big Data Journal (Libertpub), 1(4), 207-214.
13
Contents

Spark
GraphLab

14
Data Flow in Spark and Hadoop
15
Berkeley Big-data Analytics Stack (BDAS)
16
BDAS: Use Cases

Ooyala
Uses Cassandra for video data personalization.
Conviva
Uses Hive for repeatedly running ad-hoc queries on video data.
Yahoo
Advertisement targeting: 30K nodes on Hadoop Yarn
Pre-compute aggregates VS onthe-fly queries.
Optimized ad-hoc queries using Spark RDDs found Spark is 30 times faster than Hive
Hadoop batch processing Spark iterative processing Storm on-the-fly processing
Moved to Spark for ML and computing views.
ML for connection analysis and video streaming optimization.
Content recommendation collaborative filtering
Moved to Shark for on-the-fly queries C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark
17
BDAS: Spark
Transformations/Actions Map(function f1) Filter(function f2) flatMap(function f3) Union(RDD r1) Sample(flag, p, seed) groupByKey(noTasks) Description Pass each element of the RDD through f1 in parallel and return the resulting RDD. Select elements of RDD that return true when passed through f2. Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Returns result of union of the RDD r1 with the self. Returns a randomly sampled (with seed) p percentage of the RDD. Can only be invoked on key-value paired data returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Joins RDD r2 with self computes all possible pairs for given key. Joins RDD r3 with self and groups by key.
reduceByKey(function f4, noTasks) Join(RDD r2, noTasks) groupWith(RDD r3, noTasks) sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs ) that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6Chowdhury, in parallel on Tathagata elements of self RDD. [MZ12] Matei Zaharia, Mosharaf Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael
J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
Representation of an RDD
Information Set of partitions HadoopRDD 1 per HDFS block FilteredRDD Same as parent JoinedRDD 1 per reduce task Set of dependencies None 1-to-1 on parent Shuffle on each parent
Function to compute data set based on parents Meta-data on location (preferredLocaations) Meta-data on partitioning (partitioningScheme)
Read corresponding block
Compute parent and Read and join shuffled filter it data
HDFS block location from namenode None
None (parent) None
None HashPartitioner
19
Some Spark(ling) examples

Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examples

Spark code (parallel) val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 1 val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)
Notable points: 1. 2. 3. Spark context created talks to Mesos1 master. Count becomes shared variable accumulator. For loop is an RDD breaks scala range object (1 to 100000) into 12 slices.
4.
Parallelize method invokes foreach method of RDD.
Mesos is an Apache incubated clustering system http://mesosproject.org
Logistic Regression in Spark: Serial Code

// Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))

// Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient
}
println("Result: " + w)
Logistic Regression in Spark

// Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)
Logistic Regression: Spark VS Hadoop
http://spark-project.org
24
25
Contents

Spark
GraphLab

26
27
Solution to Internet Traffic Analysis Use Case
Contents

Spark
GraphLab

29
PMML Primer
Predictive Model Markup Language
Developed by DMG (Data Mining Group)
XML representation of a model.
PMML offers a standard to define a model, so that a model generated in tool-A can be directly used in tool-B.
May contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models.
30
Nave Bayes Primer

A simple probabilistic classifier based on Bayes Theorem
Likelihood
Prior
Given features X1,X2,,Xn, predict a label Y by calculating the probability for all possible Y value
31
Normalization Constant

Wrote a PMML based scoring engine for Nave Bayes algorithm. Deployed a Nave Bayes PMML generated from R into Storm / Spark and Samza frameworks
This can theoretically be used in any framework for data processing by invoking the API
Real time predictions with the above APIs
32
Header Version and timestamp Model development environment information
Data Dictionary Variable types, missing valid and invalid values,
Data Munging/Transformation Normalization, mapping, discretization
Model Model specifi attributes Mining Schema Treatment for missing and outlier values Targets Prior probability and default Outputs List of computer output fields Post-processing Definition of model architecture/parameters.
33

<DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary>
(ctd on the next slide)
34

<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page)
35

<BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> * </BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
36

Definition Of Elements:DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is predicted field, V1,V2,V3 are active predictor fields Output: Describes a set of result values that can be returned from a model
37

Definition Of Elements (ctd .. ) :BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field
38

Sample Input
Eg1 - n y y n y y n n n n n n y y y y Eg2 - n y n y y y n n n n n y y y n y
1st , 2nd and 3rd Columns:

Predictor variables ( Attribute name in element MiningField )
Using these we predict whether the Output is Democrat or

Republican ( PMML element BayesOutput)
39

3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )
Number of records ( in millions ) 0.1 0.4 1.0 Time Taken (seconds) 4 7 12
2.0
10 25
21
129 310
40

3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32 GB Swap space )
Number of records ( in millions ) 0.1 0.2 0.4 Time Taken ( 1 min 47 sec 3 min 35 src 6 min 40 secs
1.0
10
35 mins 17 sec
More than 3 hrs
41
Contents

Spark
GraphLab

42
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]

Goals targeted at machine learning.
Model graph dependencies, be asynchronous, iterative, dynamic.
Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.).
Update functions lives on each vertex

Transforms data in scope of vertex. Can choose to trigger neighbours (for example only if Rank changes drastically) Run asynchronously till convergence no global barrier.
Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates collaborative filtering).
GraphLab provides varying level of consistency. Parallelism VS consistency.
Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.
Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR on distributed GraphLab, only 0.3% of Hadoop execution time. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
GraphLab 2: PowerGraph Modeling Natural Graphs [1]
GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges.
Most graph parallel abstractions assume small neighbourhoods low degree vertices But natural graphs (LinkedIn, Facebook, Twitter) power law graphs. Hard to partition power law graphs, high degree vertices limit parallelism.
Powergraph provides new way of partitioning power law graphs

Edges are tied to machines, vertices (esp. high degree ones) span machines Execution split into 3 phases: Gather, apply and scatter.
Triangle counting on Twitter graph

Hadoop MR took 423 minutes on 1536 machines GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
Contents

Spark
GraphLab

45
Hadoop YARN Requirements or 1.0 shortcomings

R1: Scalability
single cluster limitation
R2: Multi-tenancy
Addressed by Hadoopon-Demand Security, Quotas
R3: Locality awareness

Shuffle of records
R4: Shared cluster utilization

Hogging by users Typed slots
R5: Reliability/Availability
Job Tracker bugs
R6: Iterative Machine Learning
Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, Apache Hadoop YARN: Yet Another Resource Negotiator, ACM Symposium on Cloud Computing, Oct 2013, ACM Press. 46
Hadoop YARN Architecture
47
YARN Internals
Application Master Sends ResourceRequests to the YARN RM Captures containers, resources per container, locality preferences.
YARN RM Generates tokens and containers Global view of cluster monolithic scheduling.
Node Manager Node health monitoring, advertise available resources through heartbeats to RM.
48
Contents

Spark
GraphLab

49
PMML
XML based representation of the analytical model
Spark
Scala collection over a distributed shared memory system
GraphLab
Gather-ApplyScatter
Forge
Domain Specific Language
50
Forge: Approach to build high performance Domain Specific Languages
Domain specific language approach from Stanford.

51
Forge [AKS13] a meta DSL for high performance DSLs. 40X faster than Spark! OptiML DSL for machine language
[Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13). ACM, New York, NY, USA, 145-154.
Conclusions
Beyond Hadoop Map-Reduce philosophy
Optimization and other problems. Real-time computation
Processing specialized data structures
PMML scoring
Spark for batch computations
Spark streaming and Storm for real-time. Allows traditional analytical tools/algorithms to be
re-used.
52
Thank You!
Mail LinkedIn Blogs Twitter
vijay.sa@impetus.co.in http://in.linkedin.com/in/vijaysrinivasagneeswaran blogs.impetus.com @a_vijaysrinivas.

Big Data Analytics With Storm, Spark and GraphLab

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Big Data Analytics With Storm, Spark and GraphLab

Caricato da

Copyright:

Big Data Analytics with Storm, Spark and GraphLab

Berkeley data analytics stack

Hadoop 2.0 (Hadoop YARN)

Real-time Analytics with Storm

PMML Scoring for Nave Bayes

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Tom Mitchell, Machine Learning, Tata McGraw Hill Publications.

Introduction to Machine Learning: Decision Trees

Decision Trees to Random Forests

Can we have an ensemble of trees? random forests

Support Vector Machines

Introduction to Machine Learning

Learning associations market basket analysis

Application of machine learning to large data

Knowledge Discovery in Databases (KDD)

Unsupervised learning (clustering) - k-means, sentiment analysis

Prediction random forests, SVMs, LR

Big Data Computations

Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark from UC Berkeley is efficient.

Example is social group-first approach for consumer churn analysis [2]

Interactive/On-the-fly data processing Storm.

OLAP data cube operations. Dremel/Drill

ML realizations: 3 Generational view

Scalability Algorithms Available

Vertical Huge collection of algorithms

Algorithms Not Available

Single point of failure

All 7 giants for Giants 1, and 2. small data sets

Berkeley data analytics stack

Real-time Analytics with Storm

PMML Scoring for Nave Bayes

Hadoop 2.0 (Hadoop YARN)

Data Flow in Spark and Hadoop

Berkeley Big-data Analytics Stack (BDAS)

BDAS: Use Cases

Pre-compute aggregates VS onthe-fly queries.

Hadoop batch processing Spark iterative processing Storm on-the-fly processing

Moved to Spark for ML and computing views.

ML for connection analysis and video streaming optimization.

Content recommendation collaborative filtering

Read corresponding block

Compute parent and Read and join shuffled filter it data

HDFS block location from namenode None

None (parent) None

Some Spark(ling) examples

Some Spark(ling) examples

if (x*x + y*y < 1) count += 1 }

Parallelize method invokes foreach method of RDD.

Mesos is an Apache incubated clustering system http://mesosproject.org

Logistic Regression in Spark: Serial Code

val points = lines.map(x => parsePoint(x))

Logistic Regression in Spark

Logistic Regression: Spark VS Hadoop

Berkeley data analytics stack

Hadoop 2.0 (Hadoop YARN)

Real-time Analytics with Storm

PMML Scoring for Nave Bayes

Real-time Analytics with Storm

Solution to Internet Traffic Analysis Use Case

Berkeley data analytics stack

Hadoop 2.0 (Hadoop YARN)

Real-time Analytics with Storm

PMML Scoring for Nave Bayes

Predictive Model Markup Language

if (xx + yy < 1) count += 1 }