Sei sulla pagina 1di 30

Exploratory Data Analysis

in Spark with Jupyter


https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
● Madhukara Phatak

● Team Lead at Tellius

● Work in Hadoop, Spark, ML


and Scala

● www.madhukaraphatak.com
Agenda
● Introduction to EDA
● EDA on Big Data
● EDA with Notebooks
● Five Point Summary
● Pyspark and EDA
● Histograms
● Outlier detection
● Correlation
Introduction to EDA
What’s EDA
● Exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main
characteristics, often with visual methods
● Uses statistical methods to analyse different aspects of
the data
● Puts lot of importance on visualisation
● Some of the EDA techniques are
○ Historgrams
○ Correlations etc
Why EDA?
● EDA helps data scientist to understand the distribution
of the data before they are fed to downstream
algorithms
● EDA also helps to understand the correlation between
different variables collected as part of the data collection
● Visualising the data also helps us to see the different
patterns in the data which can inform our later part of
the analysis
● Interactivity of EDA helps exploration of various different
assumptions
EDA in Hadoop ERA
EDA in Hadoop ERA
● Typical EDA is an interactive process and highly
experimental
● The first generation Hadoop systems where mostly built
for batch processes and don't offer much tools for
interactivity
● So typically data scientist used to take sample of the
data and run EDA using traditional tools like R / Python
etc
Limitation of Sample EDA
● Running EDA on Sample requires the sampling
techniques to sample data which represents the
distribution of full data
● It’s hard to achieve for the multi dimensional data which
is most of real world data
● Sample sometimes create issue for skewed distributions
Ex : Payment type in nyc taxi data
● So though sample works for most of the cases, it’s not
most accurate
EDA in Spark ERA
Interactive Analysis in Spark
● Spark is built for interactive data analysis from day one
● Below are some of the features for good for Interactive
analysis
○ Interactive spark-shell
○ Local mode for low latency
○ Caching for Speed up
○ Dataframe abstraction to support structured data
analysis
○ Support for Python
EDA on Notebooks
● Spark shell is good for one liners
● It’s not that great interface for writing long interactive
queries
● It’s also doesn’t support good visualisation options
which are important for EDA
● So notebooks systems are an alternative to spark shell
which keeps interactivity of the shell with other
advanced features
● So a notebook interface is good for EDA
Jupyter Notebook
Introduction to Notebook System
● Notebook is a interactive web interface primarily used
for exploratory programming
● They are spiritual successors for interactive shells found
in languages like python, scala etc
● Notebook systems typically supports multiple language
backends using kernels or interpreters
● Interpreter is language runtime which is responsible for
actual interpretation of code
● Ex : IPython, Zeppelin, Jupyter
Introduction to Jupyter
● Jupyter is one of the notebook systems which evolved
from the IPython shell and notebook system
● Primarily built for python based analysis
● Now supports multiple languages like python,R,scala
etc
● Also has good support for big data frameworks like
spark, flink
● http://jupyter.org/
Five Point Summary
Five Point Summary
● Five number summary is one of the basic data
exploration technique where we will find how values of
dataset columns are distributed.
● It calculates below values for a column
○ Min - Minimum value of the column
○ First Quartile - The 25% th data
○ Median - Middle Value
○ Third Quartile - 75% of the value
○ Max - Maximum value
Five Point Summary in Spark
● In spark , we can use describe method on dataframe to
get this summary for a given column
● In our example, we'll be using life expectancy data and
generating five point summary
● Ex : SummaryExample.scala
● From the results we can observe that
○ They miss quantiles and median
○ Spark gives stddev which is not there in original
definition
Approximate Quantities
● Quantiles are costly to calculate on large data as they
require sorting and result in skewed calculation
● So by default spark skips them in the describe function
● Spark 2.1 has introduced new method approxQuantile
on stat functions of dataframe
● This allows us to calculating these different quantiles
with reasonable time with threshold for accuracy
● Ex : SummaryExample.scala
Visualizing Five Point Summary
● In earlier examples, we have calculated the five point
summary
● By just looking at the numbers, it’s difficult to
understand how the data is distributed
● It’s always good to have visualize the numbers to
understand distribution
● Box plot is a good way visualize these numbers
● But how to visualize in Scala?
Scala and Visualisation Libraries
● Scala is often is choice of language to develop spark
application
● Scala gives rich language primitives to build robust
scalable systems
● But when it comes EDA, ecosystem support
visualization and other tools in not great in Scala
● Even though there are effort like plot.ly or Vegas they
are not as mature as pyplot or similar ones
● So Scala may not be great language of choice for EDA
EDA and PySpark
Pyspark
● Pyspark is a python interface for Spark API’s
● With Dataframe and Dataset API, performance is on par
with scala equivalent
● One of the advantage of pyspark over scala is it
seamless ability to convert between spark & pandas
dataframe
● Converting padas helps to use myriad of python
ecosystem tools for visualization
● But what about memory limitation about pandas?
EDA with Pyspark
● If we directly use pandas dataframe for EDA we will be
limited by data size
● So the trick is to calculate all the values using spark
API’s and then convert only result to pandas
● Then use visualize libraries like pyplot , seaborn etc to
visualize results on jupyter
● This combo of pyspark and python libraries enables us
to do interactive and high quality EDA on spark
Pyspark Boxplot
● In our example, we will first calculate five point summary
using pyspark code
● Then convert the result to pandas dataframe to extract
values
● Render box plot matplotlib.pyplot library
● One of the challenge is we need to draw using
precompute results rather than actual data itself
● It needs understanding lower level API
● Ex : EDA on Life Expectancy Data
Outlier Detection
Outlier Detection using IQR
● One of the use case to calculate five point summary is
to find outliers in data
● Idea is the any value which are significantly outside
IQR, interquartile range are typically signified as outliers
● IQR = Q3 - Q1
● One of the formula is to find the outlier which are
outside Q1- 1.5*IQR to Q3+1.5*IQR
Ex : OutliersWithIQR.scala
Histogram
Histogram
● A histogram is an accurate representation of the
distribution of numerical data
● It is a kind of bar graph
● To construct a histogram, the first step is to "bin" the
range of values—that is, divide the entire range of
values into a series of intervals—and then count how
many values fall into each interval
Histogram API
● Dataframe doesn’t have direct histogram method, but
RDD does have on DoubleRDD
● histogram API takes number buckets and it return two
things
○ Start Values for Each Buckets
○ No of elements in the bucket
● We can use pyplot barchart API to draw histogram
using these result
● Ex : EDA on Life Expectancy Data

Potrebbero piacerti anche