Sei sulla pagina 1di 13

Apache Spark

The Next Gen Toolset for Big Data processing

© EduPristine – www.edupristine.com
© EduPristine Spark Apache
Agenda

▪ Big Data Analysis methodologies


▪ Drawbacks of Map Reduce
▪ Overview of Spark
▪ RDD – A Restricted form of distributed shared memory
▪ Overview of a Spark application

© EduPristine Spark Apache 1


Big Data

▪ Data too huge for normal systems


▪ Big Data has 3 dimensions:
• 3Vs : Volume, Velocity and Variety
▪ Storage is a challenge because of its huge size
▪ Processing without failure is a challenge
▪ The processing may take hours, days, weeks or months
▪ The process may fail because of huge data size

© EduPristine Spark Apache 2


Big Data Analysis methodologies

▪ Data will be generally processed in 3 triads

Interactive
• Batch
– Ad-hoc queries on historical data
– Trend reporting and data analysis queries

• Interactive
– Queries on historical data
– Queries to get answers for a specific question Big Data

• Streaming
– Real time queries
– In-stream analytics-recommendations
– Eg: Processing the facebook news feeds, tweets in twitter etc, log parsing etc.

© EduPristine Spark Apache 3


Drawbacks of Mapreduce

▪ Mapreduce writes the intermediate results to disk.


▪ Writing to and reading from disk will be very slow compared to in-memory processing
▪ In iterative algorithms and data mining processes mapreduce will be slow because of disk I/O,
serialization and replication

HDFS HDFS HDFS HDFS


read write read write

Input iter. 1 iter. 2 . . .

Map
Reduce

Input Map Output


Reduce
Map

© EduPristine Spark Apache 4


Need for In memory Processing ?

▪ Most of Machine Learning Algorithms are iterative because each iteration can improve the results
▪ With Disk based approach each iteration’s output is written to disk making it slow
▪ Mapreduce is slow due to replication, serialization, and disk IO

© EduPristine Spark Apache 5


Spark – The Ultimate Solution

▪ Hadoop execution flow

▪ Spark execution flow

http://www.wiziq.com/blog/hype-around-apache-spark/

© EduPristine Spark Apache 6


Overview of Spark

▪ Apache Spark is a fast and general-purpose cluster computing system.

▪ It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports
general execution graphs.

▪ It also supports a rich set of higher-level tools

© EduPristine Spark Apache 7


About Apache Spark

▪ “A big data analytics cluster-computing framework written in Scala.”


▪ Open Sourced originally developed in AMPLab at UC Berkley.
▪ Designed to work with data in memory
▪ Designed for running Iterative algorithms & Interactive analytics
▪ Highly compatible with Hadoop’s Storage APIs.
▪ Can run on your existing Hadoop Cluster Setup.
▪ Developers can write driver programs using multiple programming languages. (Java, Scala,
Python)
▪ Spark is programmatic as well as interactive

© EduPristine Spark Apache 8


More on Spark

▪ Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).


▪ Spark has similar scalability and fault tolerance features as Hadoop mapreduce
▪ Spark is a generalized framework of mapreduce build using RDD (Resilient Distributed Dataset)
▪ Spark uses Lineage to reconstitute data.
▪ Spark is compatible with Hadoop
▪ Spark supports Batch, Streaming and Interactive operations.

© EduPristine Spark Apache 9


Spark Applications

▪ Spark applications are similar to MR jobs.


▪ Each application is a self-contained computation which runs some user-supplied code to compute
a result.
▪ Like mapreduce jobs, spark applications can make use of the resources available in the cluster.
▪ Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-
tolerant collection of elements that can be operated on in parallel.

© EduPristine Spark Apache 10


What makes Spark powerful ?

▪ In spark, behind the curtain, it is RDD which is doing the fault tolerant in memory data processing.
▪ RDD is resilient distributed datasets.
▪ It is a fault-tolerant abstraction for In-memory cluster computing.

© EduPristine Spark Apache 11


Thank You!

Support@edupristine.com
www.edupristine.com

© EduPristine – www.edupristine.com
© EduPristine Spark Apache 12

Potrebbero piacerti anche