Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Spark applications
Mark Grover | @mark_grover | Software Engineer
Ted Malaska | @TedMalaska | Principal Solutions Architect
tiny.cloudera.com/spark-mistakes
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Mistakes people make
when using Spark
Mistakes people we’ve made
when using Spark
Mistakes people make
when using Spark
Mistake # 1
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
Decisions, decisions, decisions
• 6 nodes
• 16 cores each
• 64 GB of RAM
= 4 GB/executor Executor 3
= 4 GB/executor Executor 3
Executor 1
Answer #3 – with overhead
Worker node
• 6 executors – 1 executor/node
Overhead(1G,1 core)
• 63 GB memory each
• 15 cores each
Executor 1
Let’s assume…
• You are running Spark on YARN, from here on…
3 things
• 3 other things to keep in mind
#1 – Memory overhead
Single Thread
Single Thread
Single Thread
Distributed
Single Thread
Single Thread
Single Thread
Mistake - Skew
What about Skew, because that is a thing
Distributed
Mistake – Skew : Answers
• Salting
• Isolated Salting
• Isolated Map Joins
Mistake – Skew : Salting
• Normal Key: “Foo”
• Salted Key: “Foo” + random.nextInt(saltFactor)
Managing Parallelism
Mistake – Skew: Salting
Add Example Slide
10x
Shuffle Tmp 1
100x
Shuffle Tmp 2 ReduceTask 1000x
Map Task
Shuffle Tmp 3 10000x
Shuffle Tmp 4
100000x
ReduceTask
1000000x
Shuffle Tmp 1 Or more
Shuffle Tmp 2
Amount
Map Task
of Data Shuffle Tmp 3
ReduceTask
Shuffle Tmp 4
Shuffle Tmp 1
Shuffle Tmp 2
Map Task ReduceTask
Shuffle Tmp 3
Shuffle Tmp 4
Managing Parallelism
• How To fight Cartesian Join
Table X
– Nested Structures A, 1, 4
A, 2, 4
A, 3 A, 6 A, 3, 5 A, 3 A, 6
A, 1, 6
A, 2, 6
A, 3, 6
Managing Parallelism
• How To fight Cartesian Join
– Nested Structures
Partition Partition
Partition Partition
25%
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations