Scala Spark Imp Functions

/////////////////////
groupBy (scala function)

...............
val a = Array("Venu","Satya","Venkat")
val x = sc.parallelize(a)
val y = x.groupBy(word => word.charAt(0))
y.collect
..........
Map (scala function)

val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"))
val y = x.map(x => (x, 1))
y.collect
val l = List(1,2,3,4,5)
l.map( x => x*2 )
.................
flattern (scala function)
val x = sc.parallelize(List("Venu", "satya", "goutami", "sumati", "supraja"))
val y = x.map(x => x.toUpperCase())
y.flatten // error
...................
flatmap() (scala function)
map+flattern
val flat = x.flatMap(x => x.toUpperCase())
..........
filter (scala )
Filter the data if it's true
val data = List("Venu", "satya", "goutami", "sumati", "supraja")
val x = sc.parallelize(data)
val y = x.filter(x => x.contains("venu"))
val y = x.filter(x => x.contains("Venu"))
val y = x.filter(x => x.startsWith("s"))
val y = x.filter(x => !x.startsWith("s"))
................
foreach
Executes an parameterless function for each data item.

val c = sc.parallelize(List("pongal", "idly", "dosa") )
c.foreach(x => println(x + "s are yummy"))
/////////////////////////////
groupBy (scala function)

.....................
val z = 1 to 9 toArray
z.groupBy(x => { if (x % 2 == 0) "even" else "odd" })
def myfunc(a: Int) : Int ={

a % 2
}
z.groupBy(myfunc)
or
z.groupBy(x => myfunc(x))
z.groupBy(myfunc(_))
val a = sc.parallelize(Array("dog", "tiger", "lion", "cat", "spider", "eagle"))

val b = a.groupBy(x=>x.charAt(0))
groupByKey(Spark function)
............... (touple)
val a = sc.parallelize(Array("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.map(x => (x, x.length))
b.groupByKey().map(t => (t._1, t._2.sum)).collect
val words = Array("one", "two", "two", "three", "three", "three")

val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1,
t._2.sum)).collect()
reduceByKey(Spark function)
.....................
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD.reduceByKey((a,b)=> (a+b)).collect()
distinct(scala function)
..............................
val x = Array(3,44,3,44,33,4,66)
x.distinct
val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20),

(3,21), (3,22)))
rdd.distinct().collect()
union (scala function)

..........................
also called union, ++
val x = Array(1,3,44,44,3,33,9)
val y = Array(22,3,8)
val z = x++y
val z = x.union(y)
val rdd1 = sc.parallelize(Array((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2,

"Sep", 10)))
val rdd2 = sc.parallelize(Array((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2,
"Nov", 15)))
rdd1.union(rdd2).collect
sortBy (Scala function)

It can sort based on either key or value
..................
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1));

val y1 = sc.parallelize(Array('a', 'b', 'g', 'r', 'o', 'q'));
y1.sortBy(c => c, true).collect;
y1.sortBy(c => c, false).collect;
y.sortBy(c => c, true).collect;
y.sortBy(c => c, false).collect;
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)));

z.sortBy(c => c._1, true).collect;
z.sortBy(c => c._2, true).collect;
sortByKey (Spark function)

It can sort only keys not values
....................
val a = sc.parallelize(List("Venu", "Goutami", "Venkat", "Guna", "Anu"))
//val b = sc.parallelize(1 to a.size.toInt)
val b = sc.parallelize(List(44,55,66,44,33))
val c = a.zip(b) //zip is a command to cimbine two to get in touple format
c.sortByKey(true).collect
c.sortByKey(false).collect
val rdd = sc.parallelize(Seq(("math", 55),("Telugu", 56),("english", 57),

("Computer", 58),("social", 59),("science", 54)))
rdd.collect()
val sorted1 = rdd.sortByKey()
sorted1.collect()
val sorted2 = rdd.sortByKey(false)
sorted2.collect()
Join (Spark function)

..........................................
Join two datasets, it can process touples
val a = List(1,2,3)
val b = List(4,5,6)
val c = a.join(b)
val a = Array(1,2,3)
val b = Array(4,5,6)
val c = a.join(b)
val rdd1 = sc.parallelize(Seq( ("Tendulkar", 55), ("Dravid", 56), ("Gambeer",

57), ("Dhoni", 58), ("Yuvaraj", 59), ("Pathan", 54)))
val rdd2 = sc.parallelize(Seq( ("Tendulkar", 60), ("Dravid", 15), ("Dhoni",

31), ("Yusaf", 62), ("Zaheer", 63),("Parthiv", 64)))
val joined = rdd1.join(rdd2)
joined.collect()
val leftJoined = rdd1.leftOuterJoin(rdd2)
leftJoined.collect()
val rightJoined = rdd1.rightOuterJoin(rdd2)
rightJoined.collect()
val fullJoined = rdd1.fullOuterJoin(rdd2).collect
.........................
val x = 1 to 9 toArray
x.drop(2).take(2)
//cogroup (scala)
val rdd1 = sc.parallelize(Seq(("key1", 1),("key2", 2),("key1", 3)))
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4)))
val grouped = rdd1.cogroup(rdd2)
Actions:
..............
take(n) It's scala function
Return an array with the first n elements of the dataset.
..............
val rdd = Array(2,5,88,7,5,54,3)
rdd.take(3)
rdd1.take(2)
.......................
first() (its spark function)

Return the first element of the dataset (similar to take(1)).
val rdd = Array(2,5,88,7,5,54,3)
rdd.first
rdd1.first()
..............
reduce(func) (Scala function)

Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so
that it can be computed correctly in parallel.
val x = 5 to 20
x.reduce(_+_)
collect() Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a
sufficiently small subset of the data.

Scala Spark Imp Functions

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Scala Spark Imp Functions

Caricato da

Copyright:

Formati disponibili

/////////////////////

groupBy (scala function)

Map (scala function)

Executes an parameterless function for each data item.

groupBy (scala function)

def myfunc(a: Int) : Int ={

val a = sc.parallelize(Array("dog", "tiger", "lion", "cat", "spider", "eagle"))

b.groupByKey().map(t => (t._1, t._2.sum)).collect

val words = Array("one", "two", "two", "three", "three", "three")

val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20),

union (scala function)

val rdd1 = sc.parallelize(Array((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2,

sortBy (Scala function)

val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1));

val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)));

sortByKey (Spark function)

val rdd = sc.parallelize(Seq(("math", 55),("Telugu", 56),("english", 57),

Join (Spark function)

val rdd1 = sc.parallelize(Seq( ("Tendulkar", 55), ("Dravid", 56), ("Gambeer",

val rdd2 = sc.parallelize(Seq( ("Tendulkar", 60), ("Dravid", 15), ("Dhoni",

val grouped = rdd1.cogroup(rdd2)

first() (its spark function)

reduce(func) (Scala function)

Potrebbero piacerti anche