Sei sulla pagina 1di 24

SPARK TRANSFORMATIONS

MAP()

Return a new RDD by applying a function to each element of this


RDD.

FILTER()
Apply user function:keep item if functionreturns true
resultent RDD has only elements which the condition is true.
FlatMAp()
Return a new RDD by first applying a function to all elements of
this RDD, and then flattening the results
Similar to map, but each input item can be mapped to 0 or more
output items (so func should return a Seq rather than a single
item).
diff b/w map and flatmap()
Map and flatMap are similar in the way that they take a line from
input RDD and apply a function on that line. The key difference
between map() and flatMap() is map() returns only one element,
while flatMap() can return a list of elements.
groupBy() & groupByKey() Example
groupByKey() operates on Pair RDDs and is used to group all the
values related to a given key. groupBy() can be used in both
unpaired & paired RDDs. When used with unpaired data, the key
for groupBy() is decided by the function literal passed to the
method
reduceByKey will aggregate y key before shuffling, and
groupByKey will shuffle all the value key pairs as the diagrams
show. On large size data the difference is obvious.
groupByKey() is just to group your dataset based on a key. It will
result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can
say reduceBykey() equvelent to dataset.group(...).reduce(...). It
will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you
return result in different type. In another words, it lets you have a
input as type x and aggregate result as type y. For example (1,2),
(1,4) as input and (1,"six") as output. It also takes zero-value that
will be applied at the beginning of each key.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least
one value for each key. And then shuffle happens and it is sent
over the network to some particular executor for some action
such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle
process happens and here lot of data gets sent to each partition,
almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of
memory issue.

AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp,


[numTasks]) It is similar to reduceByKey but you can provide initial
values when performing aggregation.
Use of reduceByKey reduceByKey can be used when we run on
large data set. reduceByKey when the input and output value
types are of same type over aggregateByKey`.
aggregateByKey() is almost identical to reduceByKey() (both calling
combineByKey() behind the scenes), except you give a starting
value for aggregateByKey(). Most people are familiar with
reduceByKey(), so I will use that in the explanation.
The reason reduceByKey() is so much better is because it makes
use of a MapReduce feature called a combiner. Any function like +
or * can be used in this fashion because the order of the elements
it is called on doesn't matter. This allows Spark to start "reducing"
values with the same key even if they are not all in the same
partition yet.
On the flip side groupByKey() gives you more versatility since you
write a function that takes an Iterable, meaning you could even
pull all the elements into an array. However it is inefficient
because for it to work the full set of (K,V,) pairs have to be in one
partition.
The step that moves the data around on a reduce type operation
is generally called the shuffle, at the very simplest level the data is
partitioned to each node (often with a hash partitioner), and then
sorted on each node.
aggregateByKey() is quite different from reduceByKey. What
happens is that reduceByKey is sort of a particular case of
aggregateByKey.
aggregateByKey() will combine the values for a particular key, and
the result of such combination can be any object that you specify.
You have to specify how the values are combined ("added") inside
one partition (that is executed in the same node) and how you
combine the result from different partitions (that may be in
different nodes). reduceByKey is a particular case, in the sense
that the result of the combination (e.g. a sum) is of the same type
that the values, and that the operation when combined from
different partitions is also the same as the operation when
combining values inside a partition.
An example: Imagine you have a list of pairs. You parallelize it:
val pairs = sc.parallelize(Array(("a", 3), ("a", 1), ("b", 7), ("a", 5)))
Now you want to "combine" them by key producing a sum. In this
case reduceByKey and aggregateByKey are the same:
val resReduce = pairs.reduceByKey(_ + _) //the same operation for
everything
resReduce.collect
res3: Array[(String, Int)] = Array((b,7), (a,9))
//0 is initial value, _+_ inside partition, _+_ between partitions
val resAgg = pairs.aggregateByKey(0)(_+_,_+_)
resAgg.collect
res4: Array[(String, Int)] = Array((b,7), (a,9))
Now, imagine that you want the aggregation to be a Set of the
values, that is a different type that the values, that are integers
(the sum of integers is also integers):
import scala.collection.mutable.HashSet
//the initial value is a void Set. Adding an element to a set is the
first
//_+_ Join two sets is the _++_
val sets = pairs.aggregateByKey(new HashSet[Int])(_+_, _++_)
sets.collect
res5: Array[(String, scala.collection.mutable.HashSet[Int])]
=Array((b,Set(7)), (a,Set(1, 5, 3)))
Difference between ReduceByKey and CombineByKey in Spark
Reduce by key internally calls combineBykey. Hence the basic way
of task execution is same for both.
The choice of CombineByKey over reduceBykey is when the input
Type and output Type is not expected to be the same. So
combineByKey will have a extra overhead of converting one type
to another .
If the type conversion is omitted there is no difference at all .
Difference between combinebykey and aggregatebykey
You can replace groupByKey with reduceByKey to improve
performance in some cases. reduceByKey performs map side
combine which can reduce Network IO and shuffle size where as
groupByKey will not perform map side combine.
combineByKey is more general then aggregateByKey. Actually, the
implementation of aggregateByKey, reduceByKey and groupByKey
is achieved by combineByKey. aggregateByKey is similar to
reduceByKey but you can provide initial values when performing
aggregation.
As the name suggests, aggregateByKey is suitable for compute
aggregations for keys, example aggregations such as sum, avg, etc.
The rule here is that the extra computation spent for map side
combine can reduce the size sent out to other nodes and driver. If
your func satisfies this rule, you probably should use
aggregateByKey.
Map and Mappartitions
Mappartitions is a transformation that is similar to Map.
In Map, a function is applied to each and every element of an RDD
and returns each and every other element of the resultant RDD. In
the case of mapPartitions, instead of each element, the function is
applied to each partition of RDD and returns multiple elements of
the resultant RDD. In mapPartitions transformation, the
performance is improved since the object creation is eliminated
for each and every element as in map transformation.
Since the mapPartitions transformation works on each partition, it
takes an iterator of string or int values as an input for a partition.
Consider the following example:
val data = sc.parallelize(List(1,2,3,4,5,6,7,8), 2)
Map:
def sumfuncmap(numbers : Int) : Int =
{
var sum = 1
return sum + numbers
}
data.map(sumfuncmap).collect
returns Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9) //Applied to
each and every element
MapPartitions:
def sumfuncpartition(numbers : Iterator[Int]) : Iterator[Int] =
{
var sum = 1
while(numbers.hasNext)
{
sum = sum + numbers.next()
}
return Iterator(sum)
}
data.mapPartitions(sumfuncpartition).collect
returns
Array[Int] = Array(11, 27) // Applied to each and every
element partition-wise
MapPartitionsWithIndex is similar to mapPartitions, except that it
takes one more argument as input, which is the index of the
partition.
Mappartition v/s MAP() and foreach().
The MapPartition converts each partition of the source RDD into
many elements of the result (possibly none). In mapPartition(),
the map() function is applied on each partitions simultaneously.
MapPartition is like a map, but the difference is it runs separately
on each partition(block) of the RDD.
Now, lets discuss there differences:
we can use mapPartitions() as an alternative to map() & foreach().
mapPartitions() is called once for each Partition unlike map() &
foreach()which is called for each element in the RDD. The main
advantage being that, we can do initialization on Per-Partition
basis instead of per-element basis(as done by map() & foreach())
In the case of Initializing a database. If we are using map() or
foreach(), the number of times we would need to initialize will be
equal to the no of elements in RDD. Whereas if we use
mapPartitions(), the no of times we would need to initialize would
be equal to number of Partitions
We get Iterator as an argument, if we talk about mapPartition.
Through which we can iterate through all the elements in a
Partition.
Ex: If there are 1000 row and 10 partitions, then each partition will
contain the 1000/10=100 Rows.
Now, when we apply map(func) method to rdd, the func()
operation will be applied on each and every Row and in this
particular case func() operation will be called 1000 times. i.e. time
consuming in some time critical applications.
If we call mapPartition(func) method on rdd, the func() operation
will be called on each partition instead of each row. In this
particular case, it will be called 10 times(number of partition). In
this way you can prevent some processing when it comes to time
critical application.

sortByKey()
When we apply the sortByKey() function on a dataset of (K, V)
pairs, the data is sorted according to the key K in another RDD.
sortByKey() example:
val data = spark.sparkContext.parallelize(Seq(("maths",52),
("english",75), ("science",82), ("computer",65), ("maths",85)))
val sorted = data.sortByKey()
sorted.foreach(println)
Note – In above code, sortByKey() transformation sort the data
RDD into Ascending order of the Key(String).
join()
The Join is database term. It combines the fields from two table
using common values. join() operation in Spark is defined on pair-
wise RDD. Pair-wise RDDs are RDD in which each element is in the
form of tuples. Where the first element is key and the second
element is the value.
The boon of using keyed data is that we can combine the data
together. The join() operation combines two data sets on the basis
of the key.
Join() example:
val data = spark.sparkContext.parallelize(Array(('A',1),('b',2),
('c',3)))
val data2 =spark.sparkContext.parallelize(Array(('A',4),('A',6),
('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
Note – The join() transformation will join two different RDDs on
the basis of Key.
Controlling Partitions
Coalesce()
To avoid full shuffling of data we use coalesce() function. In
coalesce() we use existing partition so that less data is shuffled.
Using this we can cut the number of the partition. Suppose, we
have four nodes and we want only two nodes. Then the data of
extra nodes will be kept on two nodes which we kept.
Coalesce() example:
val rdd1 =
spark.sparkContext.parallelize(Array("jan","feb","mar","april","ma
y","jun"),3)
val result = rdd1.coalesce(2)
result.foreach(println)
Note – The coalesce will decrease the number of partitions of the
source RDD to numPartitions define in coalesce argument.
mapPartitions() and MapPartitionwithIndex()
foreach() Example
foreach() is an action. Unlike other actions, foreach do not return
any value. It simply operates on all the elements in the RDD.
foreach() can be used in situations, where we do not want to
return any result, but want to initiate a computation. A good
example is ; inserting elements in RDD into database. Let us look
at an example for foreach()
rdd.foreach(func)
collect()
·0 The simplest and most common operation that returns data
to our driver program is collect()
·1 which returns the entire RDD’s contents. collect() is
commonly used in unit tests where the entire contents of
the RDD are expected to fit in memory, as that makes it easy
to compare the value of our RDD with our expected result.
·2 collect() suffers from the restriction that all of your data
must fit on a single machine, as it all needs to be copied to
the driver.

Saving Files(Actions)
·3 Saving files means writing to plain-text files. With RDDs, you
cannot actually “save” to a data source in the conventional
sense
·4 You must iterate over the partitions in order to save the
contents of each partition to some external database.
·5 This is a low-level approach that reveals the underlying
operation that is being performed in the higher-level APIs.
Spark will take each partition, and write that out to the
destination.
saveAsTextFile
To save to a text file, you just specify a path and optionally a
compression codec:
words.saveAsTextFile("file:/tmp/bookTitle")
To set a compression codec, we must import the proper codec
from Hadoop. You can find these in the
org.apache.hadoop.io.compress library:
// in Scala
import org.apache.hadoop.io.compress.BZip2Codec
words.saveAsTextFile("file:/tmp/bookTitleCompressed",
classOf[BZip2Codec])
SequenceFiles
A sequenceFile is a flat file consisting of binary key–value pairs. It
is extensively used in MapReduce as input/output formats.
Spark can write to sequenceFiles using the saveAsObjectFile
method or by explicitly writing key–value pairs
words.saveAsObjectFile("/tmp/my/sequenceFilePath")
There are a variety of different Hadoop file formats to which you
can save. These allow you to specify classes, output formats,
Hadoop configurations, and compression schemes. (For
information on these formats, read Hadoop: The Definitive Guide
[O’Reilly, 2015].) These formats are largely irrelevant except if
you’re working deeply in the Hadoop ecosystem or with some
legacy mapReduce jobs.
Checkpointing
·6 One feature not available in the DataFrame API is the
concept of checkpointing
·7 Checkpointing is the act of saving an RDD to disk so that
future references to this RDD point to those intermediate
partitions on disk rather than recomputing the RDD from its
original source.
·8 This is similar to caching except that it’s not stored in
memory, only disk. This can be helpful when performing
iterative computation, similar to the use cases for caching:
spark.sparkContext.setCheckpointDir("/some/path/for/checkpoi
nting")
words.checkpoint()
Now, when we reference this RDD, it will derive from the
checkpoint instead of the source data. This can be a helpful
optimization.
Pipe RDDs to System Commands(this need to learn)
words.pipe("wc -l").collect()
mapPartitions
·9 return signature of a map function on an RDD is actually
MapPartitionsRDD
·10 This is because map is just a row-wise alias for
mapPartitions, which makes it possible for you to map an
individual partition (represented as an iterator).
·11 That’s because physically on the cluster we operate on each
partition individually (and not a specific row).
·12 A simple example creates the value “1” for every partition in
our data, and the sum of the following expression will count
the number of partitions we have:
words.mapPartitions(part => Iterator[Int](1)).sum() // 2
·13 this means that we operate on a per-partition basis and
allows us to perform an operation on that entire partition.
·14 This is valuable for performing something on an entire
subdataset of your RDD. You can gather all values of a
partition class or group into one partition and then operate
on that entire group using arbitrary functions and controls.
mapPartitionsWithIndex.
·15 Withthis you specify a function that accepts an index (within
the partition) and an iterator that goes through all items
within the partition.
·16 Thepartition index is the partition number in your RDD,
which identifies where each record in our dataset sits (and
potentially allows you to debug). You might use this to test
whether your map functions are behaving correctly:
// in Scala
def indexedFunc(partitionIndex:Int, withinPartIterator:
Iterator[String]) = {
withinPartIterator.toList.map(
value => s"Partition: $partitionIndex => $value").iterator
}
words.mapPartitionsWithIndex(indexedFunc).collect()
foreachPartition
·17 Although mapPartitions needs a return value to work
properly, this next function does not.
·18 foreachPartitionsimply iterates over all the partitions of the
data. The difference is that the function has no return value.
·19 Thismakes it great for doing something with each partition
like writing it out to a database
·20 In
fact, this is how many data source connectors are written.
You can create our own text file source if you want by
specifying outputs to the temp directory with a random ID:
words.foreachPartition { iter =>
import java.io._
import scala.util.Random
val randomFileName = new Random().nextInt()
val pw = new PrintWriter(new File(s"/tmp/random-file-$
{randomFileName}.txt"))
while (iter.hasNext) {
pw.write(iter.next())
}
pw.close()
}
You’ll find these two files if you scan your /tmp directory.
glom
·21 takesevery partition in your dataset and converts them to
arrays
·22 Thiscan be useful if you’re going to collect the data to the
driver and want to have an array for each partition.
·23 this
can cause serious stability issues because if you have
large partitions or a large number of partitions, it’s simple to
crash the driver.
·24 In
the following example, you can see that we get two
partitions and each word falls into one partition each:
// in Scala
spark.sparkContext.parallelize(Seq("Hello", "World"),
2).glom().collect()
// Array(Array(Hello), Array(World))
Passing Functions to Spark(need to know about this)
·25 Most of Spark’s transformations, and some of its actions,
depend on passing in functions that are used by Spark to
compute data.
·26 Each of the core languages(python/java/scala) has a slightly
different mechanism for passing functions to Spark.
Scala
·27 two recommended ways to do this
1)Anonymous function syntax, which can be used for short pieces
of code.
x:Int=x+1
2)Static methods in a global singleton object. For example, you
can define object MyFunctions and then pass MyFunctions.func1,
as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)

Note
it is also possible to pass a reference to a method in a class
instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass and call doStuff on it, the map
inside there references the func1 method of that MyClass
instance, so the whole object needs to be sent to the cluster. It is
similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will
reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x =>
field + x) }
}
is equilvalent to writing rdd.map(x => this.field + x), which
references all of this. To avoid this issue, the simplest way is to
copy field into a local variable instead of accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}

Potrebbero piacerti anche