Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Who am I?
Project Management Committee (PMC) member of Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
What is Databricks?
Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
What is Spark
Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing system
High-level API
Fault-tolerant
Integration
joins, windows,
often 5x less code
Exactly-once semantics,
even for stateful ops
Kafka
File systems
Flume
Kinesis
HDFS/S3
Twitter
Streaming
Databases
Dashboards
data streams
receivers
Streaming
batches
results
7
create DStream
from data over socket
ssc.start()
Word Count
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(localhost, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
Word Count
Spark Streaming
Storm
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(localhost, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
cluster.shutdown();
10
Languages
Can natively use
11
Spark
Spark SQL
Streaming
MLlib
GraphX
Spark Core
12
Spark SQL
Spark
Streaming
MLlib
GraphX
Spark Core
13
Spark
Streaming
MLlib
GraphX
Spark Core
14
Spark
Streaming
MLlib
GraphX
Spark Core
History
Late 2011 research idea
AMPLab, UC Berkeley
We need to
make Spark
faster
Okay...umm,
how??!?!
16
History
Late 2011 idea
AMPLab, UC Berkeley
Q3 2012
Q2 2012 prototype
17
History
Late 2011 idea
AMPLab, UC Berkeley
Q3 2012
Q2 2012 prototype
18
Current state of
Spark Streaming
Development
Adoption
Roadmap
20
21
Python API
Core functionality in Spark 1.2,
with sockets and files as sources
Kafka support coming in Spark 1.3
22
model.trainOn(trainingDStream)
model.predictOnValues(testDStream.map { lp =>
(lp.label, lp.features) } ).print()
24
System Infrastructure
Automated driver fault-tolerance [Spark 1.0]
Graceful shutdown [Spark 1.0]
Write Ahead Logs for zero data loss [Spark 1.2]
25
Contributors to Streaming
40
30
20
10
0
Spark 0.9
Spark 1.0
Spark 1.1
Spark 1.2
26
Streaming
90
Core + Streaming
(w/o SQL, MLlib,)
60
30
0
Spark 0.9
Spark 1.0
Spark 1.1
Spark 1.2
All contributions
to core Spark
directly improve
Spark Streaming
27
Spark Packages
More contributions from the
community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
28
Who is using
Spark Streaming?
Production
9%
Prototyping
31%
Not using
21%
31
80+
known
deployments
32
Spark
Streaming
YARN
HBase
Kafka
Spark
Streaming
YARN
Cassandra
Apache Blur
Spark
Streaming
Mesos+Marathon
MySQL
Redis
SQS
http://techblog.netflix.com/2015/02/whats-trending-on-netflix.html
http://goo.gl/mJNf8X
43
44
45
Takeaways
Spark Streaming is scalable, fault-tolerant stream processing
system with high-level API and rich set of libraries
Over 80+ deployments in the industry
More libraries and operational ease in the roadmap
47
Backup slides
48
More usecases
52
Service Providers
Kafka SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims