Apache Spark: What's in It For Your Business - Adarsh Pannu

DSK-3572
Apache Spark:
What's in It for Your Business?
Adarsh Pannu
Senior Technical Staff Member
IBM Analytics Platform
2015 IBM Corporation

Outline
Spark Overview
Spark Stack
Spark in Action
Spark vs MapReduce
Spark @ IBM
1
Started in 2009 at UC
Berkeleys AMP Lab
_______________________
6 years!
!
7+ components!
!
129+ third-party packages!
!
500+ contributors!
!
400,000+ lines of code!
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
The Analytics Operating System
3
1. A general-purpose data processing (compute) engine
2. A technology that interoperates well with Apache Hadoop
3. A big data ecosystem
4
Why Spark?
Expressiveness Speed
+ Capabilities + 2-10x faster on disk

+ Productivity + 100x faster in-memory
More function, less code (vs. Hadoop MapReduce)
5
Spark Stack
DataFrames ML Pipelines
SQL Streaming MLLib GraphX
Spark Core
6
Spark in the real world
7
Spark in Action
Spark is essentially a collection of APIs.

Best way to appreciate Sparks value is
via examples. Lets do so using a real
world dataset
On-Time Arrival Performance Record

of every US airline flight since 1980s.
Year, Month, DayofMonth
UniqueCarrier,FlightNum
DepTime, ArrTime, ActualElapsedTime
ArrDelay, DepDelay
Origin, Dest, Distance
Cancelled, ...
Where, When, How Long? ...

8
Spark in Action (contd.)
Year,Month,DayofMonth UniqueCarrier
DepTime FlightNum
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
Origin
ActualElapsedTime Dest
Distance
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
9
Which airports had the most flight cancellations?
_________________________________________
Lets compute this using Spark Core
10
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
2004,2,5...EWR...1 2004,2,5...HNL...0 2004,2,5...ORD...1 2004,2,5...ORD...1
map(row => (row.split(",")(16), row.split(",")(21).toInt)).
(EWR, 1) (HNL,0) (ORD, 1) (ORD, 1)
filter(row => row._2 == 1).

(EWR, 1) (ORD, 1) (ORD, 1)
reduceByKey((a,b) => a + b).
(EWR, 1) (ORD, 2)
sortBy(_._2, ascending=false).
(ORD, 2) (EWR, 1)
collect
[ (ORD, 2), (EWR, 1) ]
11
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
map(row => (row.split(",")(16), row.split(",")(21).toInt)).
filter(row => row._2 == 1).
reduceByKey((a,b) => a + b).
sortBy(_._2, ascending=false).
collect
Not bad, eh? Do try this at
home ... just not on
MapReduce!
12
Answer: In 2008, it was:

ORD
ATL
DFW
LGA
EWR
Were not surprised, are we?
13
Spark builds Data Pipelines (DAGs)
Nodes in the pipeline

are RDDs
Leaf nodes represent

base datasets.
Intermediate nodes
represent computations
Directed (arrows)
Acyclic (no loops)
Graph
14
Resilient Distributed Datasets (RDD)
CO780, IAH, MCI Immutable collection of objects

CO683, IAH, MSY Distributed across machines
CO1707, TPA, IAH
...
DL282, ATL, CVG Can be operated on in parallel

DL2128, CHS, ATL
DL2592, PBI, LGA
DL417, FLL, ATL
... Can hold any kind of data
Hadoop datasets
UA620, SJC, ORD Parallelized Scala collections
UA675, ORD, SMF RDBMS or No-SQL
UA676, ORD, LGA
...
...
Can be cached, recover from failures

15
Spark Execution
Data pipelines are

broken up into
stages
Each stage is
proceeded in
parallel (across
partitions)
At stage
boundaries, data
is shuffled or
returned to client
Of course, all of shuffle

this is done under
the covers for you! 16
Wait ... this is just like MapReduce? Whats the difference?
17
Spark vs MapReduce
Map Reduce Map Reduce
HDFS
Simple pipelines, heavyweight JVMs, communication through HDFS, No

explicit memory exploitation
Map Filter Reduce Join Sort
Local
Disk
HDFS
Complex pipelined DAGs, Threads (vs JVMs), Memory and

Disk exploitation, Caching, Fast Shuffle, and more...
Spark has Rich Functional, Relational
RDD Operations & Iterative APIs
Transformations
" Create a new RDD (dataset) from an existing one
" Lazily evaluated
map flatMap filter
reduceByKey groupByKey mapPartitions
join cogroup coalesce
union distinct intersection
sortByKey sample repartition ...
Actions
" Run a computation (i.e. job), optionally returning data to client
count collect cache
first take saveAsTextFile ...
RDD Persistence and Caching
Spark can persist RDDs in several
ways count take(5)
" In-memory (a.k.a. caching)

" On disk
" Both airports
This allows Spark to avoid re- Cached

computing portions of a DAG
cancelled
" Beneficial for repeating workloads
such as machine learning
flights Skipped
But you have to tell Spark which
RDDs to persist (and how)
data
Common misconceptions around Spark
Spark requires you to load all your data into memory.
Spark is a replacement for Hadoop.
Spark is only for machine learning.
Spark solves global hunger and world peace.
21
Sorting 100 TB of data
Hadoop in 2013 Spark in 2014 (Current Record)
2100 machines, 50400 cores 206 virtual machines, 6952 cores

72 mins 23 mins
Also sorted 1 PB in 234 mins

Source: sortbenchmark.org
on similar hardware (it scales)
22
OK! Lets get back to the Spark application.
23

____________________________________________
Can we do compute this using SQL?

Yes!
Can we do that without leaving my Spark environment?

Yes!
24
Spark SQL Example
// Specify table schema
RDD <-> DataFrame
val schema = StructType(...) interoperability
// Define a data frame using the schema

Ma.. Look! I
val rowRDD = flights.map(e => Row.fromSeq(e))
can use SQL
val flightsDF = sqlContext.createDataFrame(rowRDD, schema) seamlessly
with RDDs
// Give the data frame a name
flightsDF.registerTempTable("flights)
// Use SQL!
val results = sqlContext.sql("""SELECT Origin, count(*) Cnt FROM flights
where Cancelled = "1"
group by Origin order by Cnt desc)
25
Spark SQL and DataFrames
Spark SQL
" SQL engine with an optimizer
" Not ANSI compliant
" Operates on DataFrames (RDDs with schema)
" Primarily goal: Write (even) less code
" Secondary goal: Provide a way around performance penalty
when using non-JVM languages (e.g. Python or R)
DataFrame API
" Programmatic API to build SQL-ish constructs
" Shares machinery with SQL engine
These allow you to create RDD DAG using a higher-level language / API
26
Which flights are likely to be delayed? And by how much?
____________________________________________
Can we do build a predictive model for this?

Yes!
Again, can we do that without leaving my Spark environment?

Yes!
27
Spark MLLib Example
val trainRDD = sqlContext.sql("""SELECT ArrDelay, DepDelay, ArrTime,
DepTime, Distance from flights"").rdd.map {
case Row(label: String, features @ _*) => // Build training set
LabeledPoint(toDouble(label),
Vectors.dense(features.map(toDouble(_)).toArray))
}
// Train a decision tree (regression) model to predict flight delays
val model = DecisionTree.train(trainRDD, Regression, Variance, maxDepth = 5)
// Evaluate model on training set

val valuesAndPreds = trainRDD.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Tight SQL DataFrame/ RDD ->
ML Integration
28
Machine Learning on Spark
MLLib
" Goal: Make machine learning scalable
" Growing collection of algorithms: Regression, classification,
clustering, recommendation, frequent itemsets, linear algebra
and statistical primitives, ...
ML Pipelines
" Goal: Move beyond list of algorithms, make ML practical
" Support ML workflows
E.g.: Load data -> Extract features -> Train Model -> Evaluate
IBMs SystemML
" Declarative machine learning using R-like language
" Radically simplifies algorithm development
" Recently open-sourced, being integrated with Spark
29
What is the shortest path to travel from Maui to Ithaca?
____________________________________________
Can Spark help with this graph query too?

Yes!
Can we do that without leaving my Spark environment?

Yes!
30
What is the shortest path to travel from Maui to Ithaca?

____________________________________________
Using Spark, we can turn the flight data into schedule graph
" Vertices = Cities, Edges = Flights
ITH
ORD
SJC
EWR
OGG DFW
LAX
31
Tight RDD <-> Graph API
Spark GraphX interoperability
GraphX is an API to
" Represent property graphs Graph[Vertex, Edge]
" Manipulate graphs to yield subgraphs or other data
" View data as both graphs and collections
" Write custom iterative graph algorithms using the Pregel API
ITH
ORD
SJC
EWR
OGG DFW
LAX
32
At any given moment, there are hundreds of flights in the sky, all
generating streaming data. Sites such as flightaware.com make
this data available to consumers.
____________________________________________
Can we use Spark to analyze this data in motion and possibly

correlate it with historical trends?
Yes!
Can we do that all inside my Spark environment?

Yes!
33
Spark Streaming
Scalable and fault-tolerant
Micro-batching model
" Input data split into batches based on time intervals
" Batches presented as RDDs
" RDD, DataFrame/SQL and GraphX APIs available to streams
Kafka
...
Flume
Kinesis
Twitter
...
Interoperability within the Spark Stack:
A sample scenario
Data at rest
Core Model
Data in motion
35
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
The Analytics Operating System
36
Where does Spark fit in a Hadoop World?
Cluster
Storage Management
(HDFS) (YARN)
Compute Consider Spark for new

Spark is an
workloads
alternate (and
MapReduce Carefully evaluate
faster) compute
porting existing
engine in your
workloads
Hadoop stack
If it aint broken, dont
fix it!
37
Spark @ IBM
Spark Technology Shipping with
Center @ SF BigInsights
Spark as a
Service
Enhance it! Distribute it!
Exploit it!
Inside our products

(Next Gen Analytics, +++)
IBM has made a significant investment in Spark
IBM Spark Technology Center

San Francisco

Growing pool of contributors

300+ inventors

Contributed SystemML

Founding member of AMPLab

Partnerships in the ecosystem

http://spark.tc
39
IBM ANALYTICS PLATFORM
Data Built on Spark. Hybrid. Trusted. Business priorities
sources
Discovery Predictive Prescriptive Content

& Exploration Analytics Analytics Analytics Predict the future
RDBMS for the business
Business Intelligence
Object stores
Delight customers
by understanding
Data Content Hadoop & Data them better
NoSQL engines Management Management NoSQL Warehousing
Derive business
Information Integration & Governance value from
Document stores unstructured
Spark Analytics Operating System content
On premises Machine Learning On cloud
Data at Rest & In-motion Inside & Outside the Firewall Structured & Unstructured
Data-centric
priorities Logical data warehouse Data reservoir Fluid data layer
40
Want to learn more about Spark?
______________________________________________
Reference slides follow.
41
Notices and Disclaimers
Copyright 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customers business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
42
Notices and Disclaimers (cont)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
43
Thank You
2015 IBM Corporation

We Value Your Feedback!
Dont forget to submit your Insight session and speaker

feedback! Your feedback is very important to us we use it
to continually improve the conference.
Access your surveys at insight2015survey.com to quickly

submit your surveys from your smartphone, laptop or
conference kiosk.
45
Reference slides
46
How deep do you want to go?
What is Spark? How does it relate

Intro to Hadoop? When would you use
it?
1-2 hours
Understand basic technology and

Basic write simple programs
1-2 days
Start writing complex Spark 5-15 days, to

Intermediate programs even as you understand
operational aspects
weeks and
months
Become a Spark Black Belt! Know Months to

Expert Spark inside out. years
Intro Spark
Go through these additional presentations to understand the value of Spark. These
speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative
strengths. (Not much code here)
# Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
# Spark: Whats in it your your business? Adarsh Pannu

(This presentation itself )
# How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
# Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins

https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
Basic Spark
#Pick up some Scala through this article co-authored

by Scalas creator, Martin Odersky. Link
http://www.artima.com/scalazine/articles/steps.html
Estimated time: 2 hours

Basic Spark (contd.)
# Do these two courses. They cover Spark basics and include a

certification. You can use the supplied Docker images for all other
labs.
7 hours
Basic Spark (contd.)
# Go to spark.apache.org and study the Overview and the

Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.

Intermediate Spark
# Stay at spark.apache.org. Go through the component specific

Programming Guides as well as the sections on Deploying and More.
Browse the Spark API as needed.
Estimated time 3-5 days and more.

Intermediate Spark (contd.)
Learn about the operational aspects of Spark:
# Advanced Apache Spark (DevOps) 6 hours $ EXCELLENT!
Video https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
Slides https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
# Tuning and Debugging Spark Slides 48 mins

Video https://www.youtube.com/watch?v=kkOG_aJ9KjQ
Gain a high-level understanding of Spark architecture:

# Introduction to AmpLab Spark Internals, Matei Zaharia, 1 hr 15 mins
Video https://www.youtube.com/watch?v=49Hr5xZyTEA
# A Deeper Understanding of Spark Internals, Aaron Davidson, 44 mins
Video https://www.youtube.com/watch?v=dmL0N3qfSc8
PDF https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-
Understanding-of-Spark-Internals-Aaron-Davidson.pdf
Intermediate Spark (contd.)
Experiment, experiment, experiment ...
# Setup your personal 3-4 node cluster
# Download some open data. E.g. airline data on

stat-computing.org/dataexpo/2009/
# Write some code, make it run, see how it performs, tune it, trouble-shoot it
# Experiment with different deployment modes (Standalone + YARN)
# Play with different configuration knobs, check out dashboards, etc.
# Explore all subcomponents (especially Core, SQL, MLLib)

Advanced Spark: Original Papers
Read the original academic papers
# Resilient Distributed Datasets: A Fault-

Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
# Discretized Streams: An Efficient and Fault-

Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
# GraphX: A Resilient Distributed Graph

System on Spark, Reynold S. Xin, et. al.
# Spark SQL: Relational Data Processing in

Spark, Michael Armbrust, et. al.
Advanced Spark: Enhance your Scala skills
# Use this as your This book by # Excellent MooC by Odersky. Some of

primary Scala text Odersky is excellent the material is meant for CS majors.
but it isnt meant to Highly recommended for STC
give you a quick developers.
start. Its deep stuff. 35+ hours
Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Dont forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype
100+ hours of FREE videos and associated PDFs available on spark-

summit.org. You dont even have to pay the conference fee! Go back in time and
attend these conferences!
Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so

much. These are the only channels you need to watch though.
There is a lot of repetition in the material, and some of the
videos are from the conferences mentioned earlier.
Advanced Spark: Check out these books
Provides a good overview of Covers concrete statistical analysis /

Spark much of material is also machine learning use cases. Covers
available through other sources Spark APIs and MLLib. Highly
previously mentioned. recommended for data scientists.
Advanced Spark: Yes ... read the code
Even if you dont intend to contribute to Spark, there are a ton of valuable
comments in the code that provide insights into Sparks design and these will
help you write better Spark applications. Dont be shy! Go to github.com/
apache/spark and check it to out.

Apache Spark: What's in It For Your Business - Adarsh Pannu

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Apache Spark: What's in It For Your Business - Adarsh Pannu

Caricato da

Copyright:

Formati disponibili

DSK-3572

2015 IBM Corporation

The Analytics Operating System

2. A technology that interoperates well with Apache Hadoop

3. A big data ecosystem

+ Capabilities + 2-10x faster on disk

SQL Streaming MLLib GraphX

Spark is essentially a collection of APIs.

On-Time Arrival Performance Record

Where, When, How Long? ...

Which airports had the most flight cancellations?

Lets compute this using Spark Core

map(row => (row.split(",")(16), row.split(",")(21).toInt)).

(EWR, 1) (HNL,0) (ORD, 1) (ORD, 1)

filter(row => row._2 == 1).

reduceByKey((a,b) => a + b).

[ (ORD, 2), (EWR, 1) ]

map(row => (row.split(",")(16), row.split(",")(21).toInt)).

filter(row => row._2 == 1).

reduceByKey((a,b) => a + b).

Which airports had the most flight cancellations?

Answer: In 2008, it was:

Were not surprised, are we?

Nodes in the pipeline

Leaf nodes represent

CO780, IAH, MCI Immutable collection of objects

DL282, ATL, CVG Can be operated on in parallel

Can be cached, recover from failures

Data pipelines are

Of course, all of shuffle

Map Reduce Map Reduce

Simple pipelines, heavyweight JVMs, communication through HDFS, No

Map Filter Reduce Join Sort

Complex pipelined DAGs, Threads (vs JVMs), Memory and

" In-memory (a.k.a. caching)

This allows Spark to avoid re- Cached

Spark requires you to load all your data into memory.

Spark is a replacement for Hadoop.

Spark is only for machine learning.

Spark solves global hunger and world peace.

Hadoop in 2013 Spark in 2014 (Current Record)

2100 machines, 50400 cores 206 virtual machines, 6952 cores

Also sorted 1 PB in 234 mins

Which airports had the most flight cancellations?

Can we do compute this using SQL?

Can we do that without leaving my Spark environment?

// Define a data frame using the schema

Which flights are likely to be delayed? And by how much?

Can we do build a predictive model for this?

Again, can we do that without leaving my Spark environment?

// Evaluate model on training set

What is the shortest path to travel from Maui to Ithaca?

Can Spark help with this graph query too?

Can we do that without leaving my Spark environment?

What is the shortest path to travel from Maui to Ithaca?

Can we use Spark to analyze this data in motion and possibly

Can we do that all inside my Spark environment?

The Analytics Operating System

Compute Consider Spark for new

Inside our products

IBM Spark Technology Center

Discovery Predictive Prescriptive Content

Reference slides follow.