Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
© EduPristine – www.edupristine.com
© EduPristine Spark Apache
Agenda
Interactive
• Batch
– Ad-hoc queries on historical data
– Trend reporting and data analysis queries
• Interactive
– Queries on historical data
– Queries to get answers for a specific question Big Data
• Streaming
– Real time queries
– In-stream analytics-recommendations
– Eg: Processing the facebook news feeds, tweets in twitter etc, log parsing etc.
Map
Reduce
▪ Most of Machine Learning Algorithms are iterative because each iteration can improve the results
▪ With Disk based approach each iteration’s output is written to disk making it slow
▪ Mapreduce is slow due to replication, serialization, and disk IO
http://www.wiziq.com/blog/hype-around-apache-spark/
▪ It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports
general execution graphs.
▪ In spark, behind the curtain, it is RDD which is doing the fault tolerant in memory data processing.
▪ RDD is resilient distributed datasets.
▪ It is a fault-tolerant abstraction for In-memory cluster computing.
Support@edupristine.com
www.edupristine.com
© EduPristine – www.edupristine.com
© EduPristine Spark Apache 12