Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Apache Spark:
What's in It for Your Business?
Adarsh Pannu
Senior Technical Staff Member
IBM Analytics Platform
Spark Overview
Spark Stack
Spark in Action
Spark vs MapReduce
Spark @ IBM
1
Started in 2009 at UC
Berkeleys AMP Lab
_______________________
6 years!
!
7+ components!
!
129+ third-party packages!
!
500+ contributors!
!
400,000+ lines of code!
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
3
1. A general-purpose data processing (compute) engine
4
Why Spark?
Expressiveness Speed
5
Spark Stack
DataFrames ML Pipelines
Spark Core
6
Spark in the real world
7
Spark in Action
Year,Month,DayofMonth UniqueCarrier
DepTime FlightNum
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
Origin
ActualElapsedTime Dest
Distance
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
9
Spark in Action (contd.)
_________________________________________
10
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
2004,2,5...EWR...1 2004,2,5...HNL...0 2004,2,5...ORD...1 2004,2,5...ORD...1
(EWR, 1) (ORD, 2)
sortBy(_._2, ascending=false).
(ORD, 2) (EWR, 1)
collect
11
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
sortBy(_._2, ascending=false).
collect
Not bad, eh? Do try this at
home ... just not on
MapReduce!
12
Spark in Action (contd.)
13
Spark builds Data Pipelines (DAGs)
Intermediate nodes
represent computations
Directed (arrows)
Acyclic (no loops)
Graph
14
Resilient Distributed Datasets (RDD)
17
Spark vs MapReduce
HDFS
Local
Disk
HDFS
flights Skipped
But you have to tell Spark which
RDDs to persist (and how)
data
Common misconceptions around Spark
21
Sorting 100 TB of data
22
OK! Lets get back to the Spark application.
23
Spark in Action (contd.)
24
Spark SQL Example
// Specify table schema
RDD <-> DataFrame
val schema = StructType(...) interoperability
// Use SQL!
val results = sqlContext.sql("""SELECT Origin, count(*) Cnt FROM flights
where Cancelled = "1"
group by Origin order by Cnt desc)
25
Spark SQL and DataFrames
Spark SQL
" SQL engine with an optimizer
" Not ANSI compliant
" Operates on DataFrames (RDDs with schema)
" Primarily goal: Write (even) less code
" Secondary goal: Provide a way around performance penalty
when using non-JVM languages (e.g. Python or R)
DataFrame API
" Programmatic API to build SQL-ish constructs
" Shares machinery with SQL engine
These allow you to create RDD DAG using a higher-level language / API
26
Spark in Action (contd.)
____________________________________________
27
Spark MLLib Example
val trainRDD = sqlContext.sql("""SELECT ArrDelay, DepDelay, ArrTime,
DepTime, Distance from flights"").rdd.map {
case Row(label: String, features @ _*) => // Build training set
LabeledPoint(toDouble(label),
Vectors.dense(features.map(toDouble(_)).toArray))
}
// Train a decision tree (regression) model to predict flight delays
val model = DecisionTree.train(trainRDD, Regression, Variance, maxDepth = 5)
29
Spark in Action (contd.)
____________________________________________
30
Spark in Action (contd.)
ITH
ORD
SJC
EWR
OGG DFW
LAX
31
Tight RDD <-> Graph API
Spark GraphX interoperability
GraphX is an API to
" Represent property graphs Graph[Vertex, Edge]
" Manipulate graphs to yield subgraphs or other data
" View data as both graphs and collections
" Write custom iterative graph algorithms using the Pregel API
ITH
ORD
SJC
EWR
OGG DFW
LAX
32
Spark in Action (contd.)
At any given moment, there are hundreds of flights in the sky, all
generating streaming data. Sites such as flightaware.com make
this data available to consumers.
____________________________________________
33
Spark Streaming
Scalable and fault-tolerant
Micro-batching model
" Input data split into batches based on time intervals
" Batches presented as RDDs
" RDD, DataFrame/SQL and GraphX APIs available to streams
Kafka
...
Flume
Kinesis
...
Interoperability within the Spark Stack:
A sample scenario
Data at rest
Core Model
Data in motion
35
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
36
Where does Spark fit in a Hadoop World?
Cluster
Storage Management
(HDFS) (YARN)
37
Spark @ IBM
Spark Technology Shipping with
Center @ SF BigInsights
Spark as a
Service
Enhance it! Distribute it!
Exploit it!
39
IBM ANALYTICS PLATFORM
Data Built on Spark. Hybrid. Trusted.
Business priorities
sources
Data at Rest & In-motion Inside & Outside the Firewall Structured & Unstructured
Data-centric
priorities Logical data warehouse Data reservoir Fluid data layer
40
Want to learn more about Spark?
______________________________________________
41
Notices and Disclaimers
Copyright 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customers business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
42
Notices and Disclaimers (cont)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
43
Thank You
45
Reference slides
46
How deep do you want to go?
# Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
# How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
http://www.artima.com/scalazine/articles/steps.html
7 hours
Basic Spark (contd.)
# Write some code, make it run, see how it performs, tune it, trouble-shoot it
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Dont forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype