Sei sulla pagina 1di 4

/////////////////////

groupBy (scala function)


...............
val a = Array("Venu","Satya","Venkat")
val x = sc.parallelize(a)
val y = x.groupBy(word => word.charAt(0))
y.collect

..........

Map (scala function)


val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"))
val y = x.map(x => (x, 1))
y.collect

val l = List(1,2,3,4,5)
l.map( x => x*2 )

.................
flattern (scala function)
val x = sc.parallelize(List("Venu", "satya", "goutami", "sumati", "supraja"))
val y = x.map(x => x.toUpperCase())
y.flatten // error

...................
flatmap() (scala function)
map+flattern
val flat = x.flatMap(x => x.toUpperCase())

..........
filter (scala )
Filter the data if it's true
val data = List("Venu", "satya", "goutami", "sumati", "supraja")

val x = sc.parallelize(data)
val y = x.filter(x => x.contains("venu"))
val y = x.filter(x => x.contains("Venu"))
val y = x.filter(x => x.startsWith("s"))
val y = x.filter(x => !x.startsWith("s"))

................
foreach

Executes an parameterless function for each data item.


val c = sc.parallelize(List("pongal", "idly", "dosa") )
c.foreach(x => println(x + "s are yummy"))

/////////////////////////////

groupBy (scala function)


.....................

val z = 1 to 9 toArray
z.groupBy(x => { if (x % 2 == 0) "even" else "odd" })

def myfunc(a: Int) : Int ={


a % 2
}
z.groupBy(myfunc)
or
z.groupBy(x => myfunc(x))
z.groupBy(myfunc(_))

val a = sc.parallelize(Array("dog", "tiger", "lion", "cat", "spider", "eagle"))


val b = a.groupBy(x=>x.charAt(0))

groupByKey(Spark function)
............... (touple)
val a = sc.parallelize(Array("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.map(x => (x, x.length))

b.groupByKey().map(t => (t._1, t._2.sum)).collect

val words = Array("one", "two", "two", "three", "three", "three")


val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1,
t._2.sum)).collect()

reduceByKey(Spark function)
.....................
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD.reduceByKey((a,b)=> (a+b)).collect()

distinct(scala function)
..............................
val x = Array(3,44,3,44,33,4,66)
x.distinct

val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20),


(3,21), (3,22)))
rdd.distinct().collect()

union (scala function)


..........................
also called union, ++

val x = Array(1,3,44,44,3,33,9)
val y = Array(22,3,8)
val z = x++y
val z = x.union(y)

val rdd1 = sc.parallelize(Array((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2,


"Sep", 10)))
val rdd2 = sc.parallelize(Array((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2,
"Nov", 15)))
rdd1.union(rdd2).collect

sortBy (Scala function)


It can sort based on either key or value
..................

val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1));


val y1 = sc.parallelize(Array('a', 'b', 'g', 'r', 'o', 'q'));
y1.sortBy(c => c, true).collect;
y1.sortBy(c => c, false).collect;
y.sortBy(c => c, true).collect;
y.sortBy(c => c, false).collect;

val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)));


z.sortBy(c => c._1, true).collect;
z.sortBy(c => c._2, true).collect;

sortByKey (Spark function)


It can sort only keys not values
....................
val a = sc.parallelize(List("Venu", "Goutami", "Venkat", "Guna", "Anu"))
//val b = sc.parallelize(1 to a.size.toInt)
val b = sc.parallelize(List(44,55,66,44,33))
val c = a.zip(b) //zip is a command to cimbine two to get in touple format

c.sortByKey(true).collect
c.sortByKey(false).collect

val rdd = sc.parallelize(Seq(("math", 55),("Telugu", 56),("english", 57),


("Computer", 58),("social", 59),("science", 54)))
rdd.collect()
val sorted1 = rdd.sortByKey()
sorted1.collect()
val sorted2 = rdd.sortByKey(false)
sorted2.collect()

Join (Spark function)


..........................................
Join two datasets, it can process touples
val a = List(1,2,3)
val b = List(4,5,6)
val c = a.join(b)

val a = Array(1,2,3)
val b = Array(4,5,6)
val c = a.join(b)

val rdd1 = sc.parallelize(Seq( ("Tendulkar", 55), ("Dravid", 56), ("Gambeer",


57), ("Dhoni", 58), ("Yuvaraj", 59), ("Pathan", 54)))

val rdd2 = sc.parallelize(Seq( ("Tendulkar", 60), ("Dravid", 15), ("Dhoni",


31), ("Yusaf", 62), ("Zaheer", 63),("Parthiv", 64)))
val joined = rdd1.join(rdd2)

joined.collect()
val leftJoined = rdd1.leftOuterJoin(rdd2)
leftJoined.collect()
val rightJoined = rdd1.rightOuterJoin(rdd2)

rightJoined.collect()
val fullJoined = rdd1.fullOuterJoin(rdd2).collect

.........................
val x = 1 to 9 toArray
x.drop(2).take(2)
//cogroup (scala)
val rdd1 = sc.parallelize(Seq(("key1", 1),("key2", 2),("key1", 3)))
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4)))

val grouped = rdd1.cogroup(rdd2)

Actions:
..............
take(n) It's scala function
Return an array with the first n elements of the dataset.
..............
val rdd = Array(2,5,88,7,5,54,3)
rdd.take(3)
val rdd1 = sc.parallelize(Seq( ("Tendulkar", 55), ("Dravid", 56), ("Gambeer",
57), ("Dhoni", 58), ("Yuvaraj", 59), ("Pathan", 54)))
rdd1.take(2)
.......................

first() (its spark function)


Return the first element of the dataset (similar to take(1)).
val rdd = Array(2,5,88,7,5,54,3)
rdd.first
val rdd1 = sc.parallelize(Seq( ("Tendulkar", 55), ("Dravid", 56), ("Gambeer",
57), ("Dhoni", 58), ("Yuvaraj", 59), ("Pathan", 54)))
rdd1.first()

..............

reduce(func) (Scala function)


Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative so
that it can be computed correctly in parallel.

val x = 5 to 20
x.reduce(_+_)

collect() Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that returns a
sufficiently small subset of the data.

Potrebbero piacerti anche