Best Practices Writing Production-Grade PySpark Jobs

6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
Eran Kampf Follow

Maker of things. Big data geek. Food Lover.
Jan 24, 2017 · 7 min read
Best Practices Writing Production-Grade

PySpark Jobs
How to Structure Your PySpark Job Repository and Code
Using PySpark to process large amounts of data in a distributed fashion

is a great way to manage large-scale data-heavy tasks and gain business
insights while not sacri cing on developer e ciency.
In short, PySpark is awesome.

However, while there are a lot of code examples out there, there’s isn’t
a lot of information out there (that I could nd) on how to build a
PySpark codebase— writing modular jobs, building, packaging,
handling dependencies, testing, etc. — that could scale to a larger
development team.
So, following a year+ working with PySpark I decided to collect all the
know-hows and conventions we’ve gathered into this post (and
accompanying boilerplate project)
In this post we’ll cover:
• Structuring PySpark Jobs
• Handling 3rd-party dependencies
• Writing a PySpark Job
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 1/11
• Unit Testing
Structuring our Jobs Repository

First, let’s go over how submitting a job to PySpark works:
spark-submit --py-files pyfile.py,zipfile.zip main.py --arg1 val1
When we submit a job to PySpark we submit the main Python le to

run — main.py — and we can also add a list of dependent les that will
be located together with our main le during execution.
These dependency les can be .py code les we can import from, but
can also be any other kind of les. For example, .zip packages.
One of the cool features in Python is that it can treat a zip le as a

directory as import modules and functions from just as any other
directory.
All that is needed is to add the zip le to its search path.
import sys
sys.path.insert(0, jobs.zip)
now (assuming jobs.zip contains a python module called jobs) we can

import that module and whatever that’s in it. For example:
from jobs.wordcount import run_job

run_job()
This will allow us to build our PySpark job like we’d build any Python
project — using multiple modules and les — rather than one bigass
myjob.py (or several such les)
Armed with this knowledge let’s structure out PySpark project…
Jobs as Modules
We’ll de ne each job as a Python module where it can de ne its code
and transformation in whatever way it likes (multiple les, multiple sub
modules…).
.
├── README.md
├── src
│ ├── main.py
│ ├── jobs
│ │ └── wordcount
│ │ └── __init__.py
The job itself has to expose an analyze function:
def analyze(sc, **kwargs):

...
and a main.py which is the entry point to our job — it parses command
line arguments and dynamically loads the requested job module and
runs it:
import pyspark
if os.path.exists('jobs.zip'):
sys.path.insert(0, 'jobs.zip')
else:
sys.path.insert(0, './jobs')
parser = argparse.ArgumentParser()
parser.add_argument('--job', type=str, required=True)
parser.add_argument('--job-args', nargs='*')
args = parser.parse_args()
sc = pyspark.SparkContext(appName=args.job_name)
job_module = importlib.import_module('jobs.%s' % args.job)
job_module.analyze(sc, job_args)
To run this job on Spark we’ll need to package it so we can submit it via
spark-submit …
Packaging
As we previously showed, when we submit the job to Spark we want to
submit main.py as our job le and the rest of the code as a --py-
files extra dependency jobs.zip le.

So, out packaging script (we’ll add it as a command to our Makefile )
is:
build:
mkdir ./dist
cp ./src/main.py ./dist
cd ./src && zip -x main.py -r ../dist/jobs.zip .
Now we can submit our job to Spark:
make build
cd dist && spark-submit --py-files jobs.zip main.py --job
wordcount
If you noticed before, out main.py code runs

sys.path.insert(0, 'jobs.zip)
making all the modules inside it available for import.

Right now we only have one such module we need to import — jobs —
which contains our job logic.
We can also add a shared module for writing logic that is used by
multiple jobs. That module we’ll simply get zipped into jobs.zip too
and become available for import.
.
├── Makefile
├── README.md
├── src
│ ├── jobs
│ │ └── __init__.py
│ └── shared
│ └── __init__.py
Handling 3rd Party Dependencies

One of the requirements anyone who’s writing a job bigger the the
“hello world” probably needs to depend on some external python pip
packages.
To use external libraries, we’ll simply have to pack their code and ship it
to spark the same way we pack and ship our jobs code.
pip allows installing dependencies into a folder using its -
t ./some_folder options.
The same way we de ned the shared module we can simply install all
our dependencies into the src folder and they’ll be packages and be
available for import the same way our jobs and shared modules are:
pip install -r requirements.txt -t ./src
However, this will create an ugly folder structure where all our
requirement’s code will sit in source, overshadowing the 2 modules we
really care about: shared and jobs
That’s why I nd it useful to add a special folder — libs — where I

install requirements to:
.
├── Makefile
├── README.md
├── requirements.txt
├── src
│ ├── jobs
│ │ └── __init__.py
│ └── libs
│ │ └── requests
│ │ └── ...
│ └── shared
│ └── __init__.py
With our current packaging system will break imports as import
some_package will now have to be written as import
libs.some_package.
To solve that we’ll simply package our libs folder into a separate zip
package who’s root older is libs .
build: clean
mkdir ./dist
cp ./src/main.py ./dist
cd ./src && zip -x main.py -x \*libs\* -r ../dist/jobs.zip
.
cd ./src/libs && zip -r ../../dist/libs.zip .
Now we can import our 3rd party dependencies without a libs.
pre x, and run our job on PySpark using:
cd dist
spark-submit --py-files jobs.zip,libs.zip main.py --job
wordcount
The only caveat with this approach is that it can only work for pure-
Python dependencies. For libraries that require C++ compilation,
there’s no other choice but to make sure they’re pre-installed on all
nodes before the job runs which is a bit harder to manage. Fortunately,
most libraries do not require compilation which makes most
dependencies easy to manage,
Writing a PySpark Job

The next section is how to write a jobs’s code so that it’s nice, tidy and
easy to test.
Providing a Shared Context

When writing a job, there’s usually some sort of global context we want
to make available to the di erent transformation functions.
Spark broadcast variables, counters, and misc con guration data coming
from command-line are the common examples for such job context
data.
For this case we’ll de ne a JobContext class that handles all our
broadcast variables and counters:
from collections import OrderedDict

from tabulate import tabulate
class JobContext(object):
def __init__(self, sc):
self.counters = OrderedDict()
self._init_accumulators(sc)
self._init_shared_data(sc)
def _init_accumulators(self, sc):

pass
def _init_shared_data(self, sc):

pass
def initalize_counter(self, sc, name):

self.counters[name] = sc.accumulator(0)
def inc_counter(self, name, value=1):

if name not in self.counters:
raise ValueError("%s counter was not initialized.
(%s)" % (name, self.counters.keys()))
self.counters[name] += value
def print_accumulators(self):
print 'aa\n' * 2
print tabulate(self.counters.items(),
self.counters.keys(),
tablefmt="simple")
We’ll create an instance of it on our job’s code and pass it to our

transformations.
For example, let’s say we want to test the number of words on our
wordcount job:
class WordCountJobContext(JobContext):
def _init_accumulators(self, sc):
self.initalize_counter(sc, 'words')
def to_pairs(context, word):

context.inc_counter('words')
return word, 1
def analyze(sc):
print "Running wordcount"
context = WordCountJobContext(sc)
text = " ... some text ..."
words = sc.parallelize(text.split())
pairs = words.map(lambda word: to_pairs(context, word))
ordered = counts.sortBy(lambda pair: pair[1],
ascending=False)
print ordered.collect()
context.print_accumulators()
Besides sorting the words by occurrence, we’ll now also keep a

distributed counter on our context that counts the number of words we
processed in total. We can then nicely print it at the end by calling
`context.print_accumulators()` or access it via
context.counters['words']
Writing Transformations
The code above is pretty cumbersome to write instead of simple

transformations that look like pairs = words.map(to_pairs) we now
have this extra context parameter requiring us to write a lambda
expression: pairs = words.map(lambda word: to_pairs(context, word)
So we’ll use functools.partial to make our code nicer:
def analyze(sc):
print "Running wordcount"
context = WordCountJobContext(sc)
text = " ... some text ..."
to_pairs_step = partial(to_pairs, context)
words = sc.parallelize(text.split())
pairs = words.map(to_pairs_step)
ordered = counts.sortBy(lambda pair: pair[1],
ascending=False)
print ordered.collect()
context.print_accumulators()
Unit Testing
When looking at PySpark ode, there are few ways we can (should) test
our code:
Transformation Tests — since transformations (like our to_pairs
above) are just regular Python functions, we can simply test them the
same way we’d test any other python Function
from mock import MagicMock

from jobs.wordcount import to_pairs
def test_to_pairs():
context_mock = MagicMock()
result = to_pairs(context_mock, 'foo')
assert result[0] == 'foo'
assert result[1] == 1
context_mock.inc_counter.assert_called_with('words')
These tests cover 99% of our code, so if we just test our transformations
we’re mostly covered.
Entire Flow Tests — testing the entire PySpark ow is a bit tricky

because Spark runs in JAVA and as a separate process.
The best way to test the ow is to fake the spark functionality.
The PySparking is a pure-Python implementation of the PySpark RDD
interface.
It acts like a real Spark cluster would, but implemented Python so we
can simple send our job’s analyze function a
pysparking.Context instead of the real SparkContext to make our job
run the same way it would run in Spark.
Since we’re running on pure Python we can easily mock things like
external http requests, DB access etc. which is necessary for writing
good unit tests.
import pysparkling
from mock import patch
from jobs.wordcount import analyze
@patch('jobs.wordcount.get_text')
def test_wordcount(get_text_mock):
get_text_mock.return_value = "foo bar foo"
sc = pysparkling.Context()
result = analyze(sc)
assert result[0] == ('foo', 2)
assert result[1] == ('bar', 1)
Testing the entire job ow requires refactoring the job’s code a bit so
that analyze returns a value to be tested and that the input is
con gurable so that we could mock it.
Where to go from here…

You can nd the full source code for a PySpark starter boilerplate
implementing the concepts described above on
https://github.com/ekampf/PySpark-Boilerplate

Best Practices Writing Production-Grade PySpark Jobs

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Best Practices Writing Production-Grade PySpark Jobs

Caricato da

Copyright:

Formati disponibili

6/16/2018 Best Practices Writing Production-Grade PySpark Jobs

Eran Kampf Follow

Best Practices Writing Production-Grade

Using PySpark to process large amounts of data in a distributed fashion

In short, PySpark is awesome.

In this post we’ll cover:

• Structuring PySpark Jobs

• Handling 3rd-party dependencies

• Writing a PySpark Job

Structuring our Jobs Repository

When we submit a job to PySpark we submit the main Python le to

One of the cool features in Python is that it can treat a zip le as a

now (assuming jobs.zip contains a python module called jobs) we can

from jobs.wordcount import run_job

Armed with this knowledge let’s structure out PySpark project…

The job itself has to expose an analyze function:

def analyze(sc, **kwargs):

files extra dependency jobs.zip le.

Now we can submit our job to Spark:

If you noticed before, out main.py code runs

making all the modules inside it available for import.

Handling 3rd Party Dependencies

pip install -r requirements.txt -t ./src

That’s why I nd it useful to add a special folder — libs — where I

With our current packaging system will break imports as import

some_package will now have to be written as import

Now we can import our 3rd party dependencies without a libs.

pre x, and run our job on PySpark using:

Writing a PySpark Job

Providing a Shared Context

from collections import OrderedDict

def _init_accumulators(self, sc):

def _init_shared_data(self, sc):

def initalize_counter(self, sc, name):

def inc_counter(self, name, value=1):

We’ll create an instance of it on our job’s code and pass it to our

def to_pairs(context, word):

Besides sorting the words by occurrence, we’ll now also keep a

The code above is pretty cumbersome to write instead of simple

So we’ll use functools.partial to make our code nicer:

to_pairs_step = partial(to_pairs, context)

Transformation Tests — since transformations (like our to_pairs

from mock import MagicMock

Entire Flow Tests — testing the entire PySpark ow is a bit tricky

Where to go from here…

Potrebbero piacerti anche