Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
MAP()
FILTER()
Apply user function:keep item if functionreturns true
resultent RDD has only elements which the condition is true.
FlatMAp()
Return a new RDD by first applying a function to all elements of
this RDD, and then flattening the results
Similar to map, but each input item can be mapped to 0 or more
output items (so func should return a Seq rather than a single
item).
diff b/w map and flatmap()
Map and flatMap are similar in the way that they take a line from
input RDD and apply a function on that line. The key difference
between map() and flatMap() is map() returns only one element,
while flatMap() can return a list of elements.
groupBy() & groupByKey() Example
groupByKey() operates on Pair RDDs and is used to group all the
values related to a given key. groupBy() can be used in both
unpaired & paired RDDs. When used with unpaired data, the key
for groupBy() is decided by the function literal passed to the
method
reduceByKey will aggregate y key before shuffling, and
groupByKey will shuffle all the value key pairs as the diagrams
show. On large size data the difference is obvious.
groupByKey() is just to group your dataset based on a key. It will
result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can
say reduceBykey() equvelent to dataset.group(...).reduce(...). It
will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you
return result in different type. In another words, it lets you have a
input as type x and aggregate result as type y. For example (1,2),
(1,4) as input and (1,"six") as output. It also takes zero-value that
will be applied at the beginning of each key.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least
one value for each key. And then shuffle happens and it is sent
over the network to some particular executor for some action
such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle
process happens and here lot of data gets sent to each partition,
almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of
memory issue.
sortByKey()
When we apply the sortByKey() function on a dataset of (K, V)
pairs, the data is sorted according to the key K in another RDD.
sortByKey() example:
val data = spark.sparkContext.parallelize(Seq(("maths",52),
("english",75), ("science",82), ("computer",65), ("maths",85)))
val sorted = data.sortByKey()
sorted.foreach(println)
Note – In above code, sortByKey() transformation sort the data
RDD into Ascending order of the Key(String).
join()
The Join is database term. It combines the fields from two table
using common values. join() operation in Spark is defined on pair-
wise RDD. Pair-wise RDDs are RDD in which each element is in the
form of tuples. Where the first element is key and the second
element is the value.
The boon of using keyed data is that we can combine the data
together. The join() operation combines two data sets on the basis
of the key.
Join() example:
val data = spark.sparkContext.parallelize(Array(('A',1),('b',2),
('c',3)))
val data2 =spark.sparkContext.parallelize(Array(('A',4),('A',6),
('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
Note – The join() transformation will join two different RDDs on
the basis of Key.
Controlling Partitions
Coalesce()
To avoid full shuffling of data we use coalesce() function. In
coalesce() we use existing partition so that less data is shuffled.
Using this we can cut the number of the partition. Suppose, we
have four nodes and we want only two nodes. Then the data of
extra nodes will be kept on two nodes which we kept.
Coalesce() example:
val rdd1 =
spark.sparkContext.parallelize(Array("jan","feb","mar","april","ma
y","jun"),3)
val result = rdd1.coalesce(2)
result.foreach(println)
Note – The coalesce will decrease the number of partitions of the
source RDD to numPartitions define in coalesce argument.
mapPartitions() and MapPartitionwithIndex()
foreach() Example
foreach() is an action. Unlike other actions, foreach do not return
any value. It simply operates on all the elements in the RDD.
foreach() can be used in situations, where we do not want to
return any result, but want to initiate a computation. A good
example is ; inserting elements in RDD into database. Let us look
at an example for foreach()
rdd.foreach(func)
collect()
·0 The simplest and most common operation that returns data
to our driver program is collect()
·1 which returns the entire RDD’s contents. collect() is
commonly used in unit tests where the entire contents of
the RDD are expected to fit in memory, as that makes it easy
to compare the value of our RDD with our expected result.
·2 collect() suffers from the restriction that all of your data
must fit on a single machine, as it all needs to be copied to
the driver.
Saving Files(Actions)
·3 Saving files means writing to plain-text files. With RDDs, you
cannot actually “save” to a data source in the conventional
sense
·4 You must iterate over the partitions in order to save the
contents of each partition to some external database.
·5 This is a low-level approach that reveals the underlying
operation that is being performed in the higher-level APIs.
Spark will take each partition, and write that out to the
destination.
saveAsTextFile
To save to a text file, you just specify a path and optionally a
compression codec:
words.saveAsTextFile("file:/tmp/bookTitle")
To set a compression codec, we must import the proper codec
from Hadoop. You can find these in the
org.apache.hadoop.io.compress library:
// in Scala
import org.apache.hadoop.io.compress.BZip2Codec
words.saveAsTextFile("file:/tmp/bookTitleCompressed",
classOf[BZip2Codec])
SequenceFiles
A sequenceFile is a flat file consisting of binary key–value pairs. It
is extensively used in MapReduce as input/output formats.
Spark can write to sequenceFiles using the saveAsObjectFile
method or by explicitly writing key–value pairs
words.saveAsObjectFile("/tmp/my/sequenceFilePath")
There are a variety of different Hadoop file formats to which you
can save. These allow you to specify classes, output formats,
Hadoop configurations, and compression schemes. (For
information on these formats, read Hadoop: The Definitive Guide
[O’Reilly, 2015].) These formats are largely irrelevant except if
you’re working deeply in the Hadoop ecosystem or with some
legacy mapReduce jobs.
Checkpointing
·6 One feature not available in the DataFrame API is the
concept of checkpointing
·7 Checkpointing is the act of saving an RDD to disk so that
future references to this RDD point to those intermediate
partitions on disk rather than recomputing the RDD from its
original source.
·8 This is similar to caching except that it’s not stored in
memory, only disk. This can be helpful when performing
iterative computation, similar to the use cases for caching:
spark.sparkContext.setCheckpointDir("/some/path/for/checkpoi
nting")
words.checkpoint()
Now, when we reference this RDD, it will derive from the
checkpoint instead of the source data. This can be a helpful
optimization.
Pipe RDDs to System Commands(this need to learn)
words.pipe("wc -l").collect()
mapPartitions
·9 return signature of a map function on an RDD is actually
MapPartitionsRDD
·10 This is because map is just a row-wise alias for
mapPartitions, which makes it possible for you to map an
individual partition (represented as an iterator).
·11 That’s because physically on the cluster we operate on each
partition individually (and not a specific row).
·12 A simple example creates the value “1” for every partition in
our data, and the sum of the following expression will count
the number of partitions we have:
words.mapPartitions(part => Iterator[Int](1)).sum() // 2
·13 this means that we operate on a per-partition basis and
allows us to perform an operation on that entire partition.
·14 This is valuable for performing something on an entire
subdataset of your RDD. You can gather all values of a
partition class or group into one partition and then operate
on that entire group using arbitrary functions and controls.
mapPartitionsWithIndex.
·15 Withthis you specify a function that accepts an index (within
the partition) and an iterator that goes through all items
within the partition.
·16 Thepartition index is the partition number in your RDD,
which identifies where each record in our dataset sits (and
potentially allows you to debug). You might use this to test
whether your map functions are behaving correctly:
// in Scala
def indexedFunc(partitionIndex:Int, withinPartIterator:
Iterator[String]) = {
withinPartIterator.toList.map(
value => s"Partition: $partitionIndex => $value").iterator
}
words.mapPartitionsWithIndex(indexedFunc).collect()
foreachPartition
·17 Although mapPartitions needs a return value to work
properly, this next function does not.
·18 foreachPartitionsimply iterates over all the partitions of the
data. The difference is that the function has no return value.
·19 Thismakes it great for doing something with each partition
like writing it out to a database
·20 In
fact, this is how many data source connectors are written.
You can create our own text file source if you want by
specifying outputs to the temp directory with a random ID:
words.foreachPartition { iter =>
import java.io._
import scala.util.Random
val randomFileName = new Random().nextInt()
val pw = new PrintWriter(new File(s"/tmp/random-file-$
{randomFileName}.txt"))
while (iter.hasNext) {
pw.write(iter.next())
}
pw.close()
}
You’ll find these two files if you scan your /tmp directory.
glom
·21 takesevery partition in your dataset and converts them to
arrays
·22 Thiscan be useful if you’re going to collect the data to the
driver and want to have an array for each partition.
·23 this
can cause serious stability issues because if you have
large partitions or a large number of partitions, it’s simple to
crash the driver.
·24 In
the following example, you can see that we get two
partitions and each word falls into one partition each:
// in Scala
spark.sparkContext.parallelize(Seq("Hello", "World"),
2).glom().collect()
// Array(Array(Hello), Array(World))
Passing Functions to Spark(need to know about this)
·25 Most of Spark’s transformations, and some of its actions,
depend on passing in functions that are used by Spark to
compute data.
·26 Each of the core languages(python/java/scala) has a slightly
different mechanism for passing functions to Spark.
Scala
·27 two recommended ways to do this
1)Anonymous function syntax, which can be used for short pieces
of code.
x:Int=x+1
2)Static methods in a global singleton object. For example, you
can define object MyFunctions and then pass MyFunctions.func1,
as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)
Note
it is also possible to pass a reference to a method in a class
instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass and call doStuff on it, the map
inside there references the func1 method of that MyClass
instance, so the whole object needs to be sent to the cluster. It is
similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will
reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x =>
field + x) }
}
is equilvalent to writing rdd.map(x => this.field + x), which
references all of this. To avoid this issue, the simplest way is to
copy field into a local variable instead of accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}