Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
So, following a year+ working with PySpark I decided to collect all the
know-hows and conventions we’ve gathered into this post (and
accompanying boilerplate project)
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 1/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
• Unit Testing
import sys
sys.path.insert(0, jobs.zip)
This will allow us to build our PySpark job like we’d build any Python
project — using multiple modules and les — rather than one bigass
myjob.py (or several such les)
Jobs as Modules
We’ll de ne each job as a Python module where it can de ne its code
and transformation in whatever way it likes (multiple les, multiple sub
modules…).
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 2/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
.
├── README.md
├── src
│ ├── main.py
│ ├── jobs
│ │ └── wordcount
│ │ └── __init__.py
and a main.py which is the entry point to our job — it parses command
line arguments and dynamically loads the requested job module and
runs it:
import pyspark
if os.path.exists('jobs.zip'):
sys.path.insert(0, 'jobs.zip')
else:
sys.path.insert(0, './jobs')
parser = argparse.ArgumentParser()
parser.add_argument('--job', type=str, required=True)
parser.add_argument('--job-args', nargs='*')
args = parser.parse_args()
sc = pyspark.SparkContext(appName=args.job_name)
job_module = importlib.import_module('jobs.%s' % args.job)
job_module.analyze(sc, job_args)
To run this job on Spark we’ll need to package it so we can submit it via
spark-submit …
Packaging
As we previously showed, when we submit the job to Spark we want to
submit main.py as our job le and the rest of the code as a --py-
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 3/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
build:
mkdir ./dist
cp ./src/main.py ./dist
cd ./src && zip -x main.py -r ../dist/jobs.zip .
make build
cd dist && spark-submit --py-files jobs.zip main.py --job
wordcount
We can also add a shared module for writing logic that is used by
multiple jobs. That module we’ll simply get zipped into jobs.zip too
and become available for import.
.
├── Makefile
├── README.md
├── src
│ ├── main.py
│ ├── jobs
│ │ └── wordcount
│ │ └── __init__.py
│ └── shared
│ └── __init__.py
To use external libraries, we’ll simply have to pack their code and ship it
to spark the same way we pack and ship our jobs code.
pip allows installing dependencies into a folder using its -
t ./some_folder options.
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 4/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
The same way we de ned the shared module we can simply install all
our dependencies into the src folder and they’ll be packages and be
available for import the same way our jobs and shared modules are:
However, this will create an ugly folder structure where all our
requirement’s code will sit in source, overshadowing the 2 modules we
really care about: shared and jobs
.
├── Makefile
├── README.md
├── requirements.txt
├── src
│ ├── main.py
│ ├── jobs
│ │ └── wordcount
│ │ └── __init__.py
│ └── libs
│ │ └── requests
│ │ └── ...
│ └── shared
│ └── __init__.py
libs.some_package.
To solve that we’ll simply package our libs folder into a separate zip
package who’s root older is libs .
build: clean
mkdir ./dist
cp ./src/main.py ./dist
cd ./src && zip -x main.py -x \*libs\* -r ../dist/jobs.zip
.
cd ./src/libs && zip -r ../../dist/libs.zip .
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 5/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
cd dist
spark-submit --py-files jobs.zip,libs.zip main.py --job
wordcount
The only caveat with this approach is that it can only work for pure-
Python dependencies. For libraries that require C++ compilation,
there’s no other choice but to make sure they’re pre-installed on all
nodes before the job runs which is a bit harder to manage. Fortunately,
most libraries do not require compilation which makes most
dependencies easy to manage,
For this case we’ll de ne a JobContext class that handles all our
broadcast variables and counters:
class JobContext(object):
def __init__(self, sc):
self.counters = OrderedDict()
self._init_accumulators(sc)
self._init_shared_data(sc)
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 6/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
self.counters[name] += value
def print_accumulators(self):
print 'aa\n' * 2
print tabulate(self.counters.items(),
self.counters.keys(),
tablefmt="simple")
class WordCountJobContext(JobContext):
def _init_accumulators(self, sc):
self.initalize_counter(sc, 'words')
def analyze(sc):
print "Running wordcount"
context = WordCountJobContext(sc)
text = " ... some text ..."
words = sc.parallelize(text.split())
pairs = words.map(lambda word: to_pairs(context, word))
ordered = counts.sortBy(lambda pair: pair[1],
ascending=False)
print ordered.collect()
context.print_accumulators()
Writing Transformations
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 7/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
def analyze(sc):
print "Running wordcount"
context = WordCountJobContext(sc)
text = " ... some text ..."
words = sc.parallelize(text.split())
pairs = words.map(to_pairs_step)
ordered = counts.sortBy(lambda pair: pair[1],
ascending=False)
print ordered.collect()
context.print_accumulators()
Unit Testing
When looking at PySpark ode, there are few ways we can (should) test
our code:
above) are just regular Python functions, we can simply test them the
same way we’d test any other python Function
def test_to_pairs():
context_mock = MagicMock()
result = to_pairs(context_mock, 'foo')
assert result[0] == 'foo'
assert result[1] == 1
context_mock.inc_counter.assert_called_with('words')
These tests cover 99% of our code, so if we just test our transformations
we’re mostly covered.
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 8/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
import pysparkling
from mock import patch
from jobs.wordcount import analyze
@patch('jobs.wordcount.get_text')
def test_wordcount(get_text_mock):
get_text_mock.return_value = "foo bar foo"
sc = pysparkling.Context()
result = analyze(sc)
assert result[0] == ('foo', 2)
assert result[1] == ('bar', 1)
Testing the entire job ow requires refactoring the job’s code a bit so
that analyze returns a value to be tested and that the input is
con gurable so that we could mock it.
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 9/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 10/11
6/16/2018 Best Practices Writing Production-Grade PySpark Jobs
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f 11/11