Nets 212: Scalable and Cloud Computing: Graph Algorithms in Mapreduce October 15, 2013

NETS 212: Scalable and Cloud
Computing
Graph algorithms in MapReduce

October 15, 2013
2013 A. Haeberlen, Z.
Ives
University of Pennsylvania
Announcements
No class on October 22nd or 24th
Special 'catch-up' class on October

30th
Andreas at IMC in Barcelona

Please work on HW2 and HW3 (will be released this
week)
4:30-6:00pm, Location TBA
Any questions about HW2?
Ives
If you haven't started yet: Please start early!!!

What we have seen so far
In the first half of the semester, we saw

how the map/reduce model could be
used to filter, collect, and aggregate
data values
This is useful for data with limited

structure
Ives
We could extract pieces of input data items and

collect them to run various reduce operations
We could join two different data sets on a common
key
3
Beyond average/sum/count
Much of the world is a network of

relationships and shared features
Ives
Members of a social network can be friends, and may

have shared interests / memberships / etc.
Customers might view similar movies, and might
even be clustered by interest groups
The Web consists of documents with links
Documents are also related by topics, words,
authors, etc.
Goal: Develop a toolbox
We need a toolbox of algorithms useful

for analyzing data that has both
relationships and properties
For the next ~2 lectures well start to

build this toolbox
Some of the problems are studied in courses you

may not have taken yet:
Ives
CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems)
So well see both the traditional solution and the

MapReduce one
5
Plan for today
Representing data in graphs

NEXT
Computation model
Iterative MapReduce
A toolbox of algorithms
Ives
Single-source shortest path (SSSP)

k-means clustering
Classification with Nave Bayes
Facebook
fan-of
fan-of
friend-of
Alice
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
We can represent related objects as a

labeled, directed graph
Entities are typically represented as
nodes; relationships are typically edges
Ives
Nodes all have IDs, and possibly other properties

Edges typically have values, possibly IDs and other
properties
7
Images by Jojo Mendoza, Creative Commons license
Thinking about related objects
Encoding the data in a graph

Facebook
Mikhail
Magna Carta
Alice
Jose
Recall basic definition of a graph:
Sunita
G = (V, E) where V is vertices, E is edges of

the form (v1,v2) where v1,v2 V
Assume we only care about connected

vertices
Ives
Then we can capture a graph simply as the

edges
8
Graph encodings: Set of edges

Facebook
Mikhail
Magna Carta
Alice
Ives
Sunita
Jose
(Alice, Facebook)
(Alice, Sunita)
(Jose, Magna Carta)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Magna
Carta)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
Graph encodings: Adding edge

types
Facebook
fan-of
fan-of
friend-of
Alice
Ives
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
(Alice, fan-of, Facebook)

(Alice, friend-of, Sunita)
(Jose, fan-of, Magna Carta)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Magna
Carta)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)
10
Graph encodings: Adding weights

Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
Ives
0.9
Sunita
0.3
fan-of
Mikhail 0.7
fan-of
Magna Carta
0.5
Jose
(Alice, fan-of, 0.5, Facebook)

(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Magna
Carta)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8,
Facebook)
(Mikhail, fan-of, 0.7, Magna
Carta)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
11
Recap: Related objects
We can represent the relationships

between related objects as a directed,
labeled graph
We can annotate this graph in various

ways
Add labels to edges to distinguish different types

Add weights to edges
...
12
We can encode the graph in various
Ives
Vertices represent the objects

Edges represent relationships
Plan for today

NEXT
Computation model
Iterative MapReduce
Ives

k-means clustering
13
A computation model for graphs

Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
Sunita
0.3
Mikhail 0.7
fan-of
Magna Carta
0.5
Jose
Once the data is encoded in this way, we

can perform various computations on it
0.9
fan-of
Simple example: Which users are their friends' best

friend?
More complicated examples (later): Page rank,
adsorption, ...
This is often done by
Ives
annotating the vertices with additional information,

and
14

Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail 0.7
fan-of
Jose
0.5
Magna Carta
Slightly more
technical: How many
of my friends have
me as their
best friend?
Example: Am I my friends' best friend?
Ives
Step #1: Discard irrelevant vertices and edges
15

Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3

sunitajose: 0.3
Ives

Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
16

Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9

sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
sunitajose: 0.3
Ives

Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
17

Mikhail
friend-of
friend-of
0.9
0.3
Alice
Sunita
Jose
sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9

sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
sunitajose: 0.3
Ives
Step
Step
Step
Step
#1:
#2:
#3:
#4:
Discard irrelevant vertices and edges

Annotate each vertex with list of friends
Push annotations along each edge
Determine result at each vertex
18
Can we do this in MapReduce?

map(key: node, value: [<otherNode, relType, strength>])
{
}
reduce(key: ________, values: list of _________)
{
Using adjacency list representation?
Ives
19
Can we do this in MapReduce?

map(key: node, value: <otherNode, relType, strength>)
{
}
reduce(key: ________, values: list of _________)
{
Using single-edge data representation?
Ives
20
A real-world use case
A variant that is actually used in social

networks today: "Who are the friends of
multiple of my friends?"
Where have you seen this before?
Friend recommendation!
Ives
Maybe these people should be my friends too!
21
Generalizing
Now suppose we want to go beyond

direct friend relationships
Example: How many of my friends' friends (distance2 neighbors) have me as their best friend's best
friend?
What do we need to do?
How about distance k>2?
To compute the answer, we need to run

multiple iterations of MapReduce!
Ives
22
Iterative MapReduce
The basic model:

copy files from input dir staging dir 1
(optional: do some preprocessing)
while (!terminating condition) {
map from staging dir 1
reduce into staging dir 2
move files from staging dir 2 staging dir1
}
(optional: postprocessing)
move files from staging dir 2 output dir
Note that reduce output must be compatible

with the map input!
Ives
What can happen if we filter out some information in the mapper

or in the reducer?
23
Graph algorithms and

MapReduce
A centralized algorithm typically

traverses a tree or a graph one item at
a time (theres only one cursor)
Youve learned breadth-first and depth-first

traversals
Most algorithms that are based on

graphs make use of multiple
map/reduce stages processing one
wave at a time
Ives
Sometimes iterative MapReduce, other times chains

of map/reduce
24
Recap: MapReduce on graphs
Suppose we want to:
compute a function for each vertex in a graph...

... using data from vertices at most k hops away
We can do this as follows:
"Push" information along the edges
"Think like a vertex"
Finally, perform the computation at each vertex
May need more than one MapReduce

phase
Ives
Iterative MapReduce: Outputs of stage i inputs of

stage i+1 University of Pennsylvania
25
Plan for today

Computation model
Iterative MapReduce
NEXT
Ives

k-means clustering
26
Path-based algorithms
Sometimes our goal is to compute

information about the paths (sets of
paths) between nodes
Edges may be annotated with cost, distance, or

similarity
Examples of such problems (see CIS

121+320):
Ives
Shortest path from one node to another

Minimum spanning tree (minimal-cost tree connecting
all vertices in a graph)
Steiner tree (minimal-cost tree connecting certain
27
nodes)
Single-Source Shortest Path

(SSSP)
Given a directed graph G = (V, E) in which each edge e has a
cost c(e):
Compute the cost of reaching each node from the source
node s in the most aefficient way (potentially
after multiple
b
1
'hops')
?
?
10
2
s
?
c
Ives
?
d
28
SSSP: Intuition
We can formulate the problem using

induction
The shortest path follows the principle of optimality:

the last step (u,v) makes use of the shortest path to u
We can express this as follows:
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
Ives
29
SSSP: CIS 320-style solution
Traditional approach: Dijkstra's

algorithm
V: vertices, E: edges, S: start node
foreach v in V
dist_S_to[v] := infinity
Initialize length and
predecessor[v] = nil
last step of path
to default values
spSet = {}
Q := V
while (Q not empty) do
Update length and
u := Q.removeNodeClosestTo(S) path based on edges
radiating from u
spSet := spSet + {u}
foreach v in V where (u,v) in E
if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then
dist_S_To[v] = dist_S_To[u] + cost(u,v)
predecessor[v] = u
Ives
30
10
2
s
4
7
Q = {s,a,b,c,d}
spSet = {}
dist_S_To: {(a,), (b,), (c,), (d,)}
predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}
Ives
31
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a 1
0
10
2
s
4
7
Q = {a,b,c,d}
spSet = {s}
dist_S_To: {(a,10), (b,), (c,5), (d,)}
predecessor: {(a,s), (b,nil), (c,s), (d,nil)}
Ives
32
1 b
4
10
2
s
7
c
Q = {a,b,d}
spSet = {c,s}
dist_S_To: {(a,8), (b,14), (c,5), (d,7)}
predecessor: {(a,c), (b,c), (c,s), (d,c)}
Ives
33
1 b
3
10
2
s
4
7
Q = {a,b}
spSet = {c,d,s}
dist_S_To: {(a,8), (b,13), (c,5), (d,7)}
predecessor: {(a,c), (b,d), (c,s), (d,c)}
Ives
34
10
2
s
7
c
Q = {b}
spSet = {a,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
Ives
35
10
2
s
7
c
Q = {}
spSet = {a,b,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
Ives
36
SSSP: How to parallelize?
Dijkstra traverses the graph along a

single route at a time, prioritizing its
traversal to the next step based on total
?
path length (and avoiding cycles)?
No real parallelism to be had here!

s 0
Intuitively, we want something

?
that radiates from the origin,
one edge hop distance at a time
Ives
Each step outwards can be done in parallel, before

another iteration occurs - or we are done
Recall our earlier discussion: Scalability depends on
the algorithm, not (just) on the problem!
37
SSSP: Revisiting the inductive

definition
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
Dijkstras algorithm carefully considered

each u in a way that allowed us to prune
certain points
Instead we can look at all potential us for
each v
Compute iteratively, by keeping a frontier38

set of u
nodes i edge-hops from the source
Ives
SSSP: MapReduce formulation
init:
For each node, node ID <, -, {<succ-node-ID,edgecost>}>
map:
This is a new path from

take node ID <dist, next, {<succ-node-ID,edgethe source to succ-nodecost>}>
ID
that we just discovered
For each succ-node-ID:
(not necessarily
The shortest path we have found so far ... this is the next... and here is the adjacency
list for nodeID
from the source to nodeID has length hop on that path...
...
emit succ-node ID {<node ID, distance+edge-cost>} shortest)
Why is this necessary?

emit node ID distance,{<succ-node-ID,edge-cost>}
reduce:
Ives
distance := min cost from a predecessor; next := that

predec.
emit node ID <distance, next, {<succ-node-ID,edgecost>}>
39
Iteration 0: Base case

mapper:
(a,<s,10>) (c,<s,5>) edges
reducer: (a,<10, ...>) (c,<5, ...>)

a
"Wave"
10
2
s 0
4
7
c
Ives
40
Iteration 1
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) edges
reducer: (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)
"Wave
b
1
a 1
"
0
10
2
s 0
7
c
Ives
41
Iteration 2
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
reducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)
"Wave
1
a
1 b
"
8
1
10
2
s 0
7
c
Ives
7
42
No change!
change!
No
Convergence!
Convergence!
Iteration 3
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)

(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
1 (d,<7>)
a
reducer: (a,<8>) (c,<5>)
(b,<11>)
1 b
8
1
10
2
s 0
5
Question: If a vertex's path cost
is the same in two consecutive
rounds, can we be sure that
this vertex has converged?
Ives
4
7
7
43
Summary: SSSP
Path-based algorithms typically involve

iterative map/reduce
They are typically formulated in a way
that traverses in waves or stages,
like breadth-first search
This allows for parallelism

They need a way to test for convergence
Example: Single-source shortest path

(SSSP)
Ives
Original Dijkstra formulation is hard to parallelize

But we can make it work with the "wave" approach
44
Plan for today

Computation model
Iterative MapReduce
Ives

k-means clustering NEXT
45
Learning (clustering /
classification)
Sometimes our goal is to take a set of

entities, possibly related, and group
them
If the groups are based on similarity, we call this

clustering
If the groups are based on putting them into a
semantically meaningful class, we call this
classification
Both are instances of machine learning
Ives
46
The k-clustering Problem

Ag
e
Clusters
Items
Expenses
Given: A set of items in a n-dimensional

feature space
Example: data points from survey, people in a social

network
Goal: Group the items into k clusters
Ives
What would be a 'good' set of clusters?47
Approach: k-Means
Let m1, m2, , mk be representative

points for each of our k clusters
Specifically: the centroid of the cluster
Initialize m1, m2, , mk to random

values in the data
For t = 1, 2, :
Map each observation to the closest mean
Si( t ) x j : x j mi( t ) x j mi(*t ) , i* 1,..., k
Assign the mi to be a new centroid for each set
mi(t 1)
Ives
1
Si( t )
x j S i( t )
48
A simple example (1/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
(10,10)
(15,12)
Expenses
Ives
49
(20,21)
Age
(18,20)
(30,21)
(11,16)
Randomly chosen
initial centers
(10,10)
(15,12)
Expenses
Ives
50
(20,21)
(18,20)
(30,21)
Age
(19.75,19.5)
(11,16)
(12.5,11)
(10,10)
(15,12)
Expenses
Ives
51
(20,21)
(30,21)
Age
(18,20)
(22.67,20.67)
(11,16)
(12,12.67)
(10,10)
(15,12)
Expenses
Ives
Stable!
52
k-Means in MapReduce
Map #1:
Input: node ID <position, centroid ID, [centroid IDs and

positions]>
Compute nearest centroid; emit centroid ID <node ID, position>
Reduce #1:
Recompute centroid position from positions of nodes in it

Emit centroidID <node IDs, positions> and for all other centroid
IDs, emit otherCentroidID centroid(centroidID,X,Y)
Map #2:
Pass through values to Reducer #2
Reduce #2:
For each node in the current centroid, emit

node ID <position, centroid ID, [centroid IDs and positions]>
Input for the next map iteration
Also, emit <X, <centroid ID, position>>
Each centroid will need to know where all the other centroids are
This will be the 'result' (remember that we wanted the centroids!)
Repeat until no change
Ives
53
Plan for today

Computation model
Iterative MapReduce
Ives

k-means clustering
Classification with Nave BayesNEXT
54
Classification
Suppose we want to learn what is

spam (or interesting, or )
Predefine a set of classes with semantic meaning

Train an algorithm to look at data and assign a class
Based on giving it some examples of data in each class

and the sets of features they have
Many probabilistic techniques exist
Each class has probabilistic relationships with others
Ives
e.g., p (spam | isSentLocally), p (isSentLocally | fromBob),

Typically represented as a graph(ical model)! See CIS 520
But well focus on a simple, flat model: Nave

Bayes
55
A simple example
Suppose we just look at the keywords in

the email's title:
Message(1,
Message(2,
Message(3,
Message(4,
Message(5,
Message(6,
Won contract)
Won award)
"Won the lottery")
Unsubscribe)
"Millions of customers")
"Millions of dollars")
What is probability message "Won Millions" is

?
Bayes
p(spam|containsWon,containsMillions)
Theorem
= p(spam) p(containsWon,containsMillions |spam)
p(containsWon,containsMillions)
Ives
56
Classification using Nave Bayes
Basic assumption: Probabilities of events are

independent
This is why it is called 'nave'
Under this assumption,

p(spam) p(containsWon,containsMillions | spam)
p(containsWon,containsMillions)
= p(spam) p(containsWon | spam) p(containsMillions | spam)
p(containsWon) p(containsMillions)
= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67
So how do we train a learner (compute the above

probabilities) using MapReduce?
Ives
57
What do we need to train the

learner?
p(spam)
Easy
Easy
p(containsXYZ | spam)
Count how many spam emails there are

Count total number of emails
Count how many spam emails contain XYZ 1

Count how many emails contain XYZ overall 2
p(containsXYZ)
Ives
Count how many emails contain XYZ overall 2

Count total number of emails
Easy
58
Training a Nave Bayes Learner
map 1:
reduce 1:
emits <word, class> <count>
takes messageId -> <class, {words}> Count how many

emails contain the
emits word 1
word overall
(WordCount)
reduce 2:
Ives
contain the word

(modified WordCount)
map 2:
takes messageId <class, {words}>

Count how many
emails in the class
emits <word, class> 1
emits word <totalCount>

59
Summary: Learning and

MapReduce
Clustering algorithms typically have

multiple aggregation stages or
iterations
k-means clustering repeatedly computes centroids,

maps items to them
Fixpoint computation
Classification algorithms can be quite

complex
Ives
In general: need to capture conditional probabilities

Nave Bayes assumes everything is independent
Training is a matter of computing probability
60
distribution
Stay tuned
Next time you will learn about:

PageRank and Adsorption
Ives
61

Nets 212: Scalable and Cloud Computing: Graph Algorithms in Mapreduce October 15, 2013

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Nets 212: Scalable and Cloud Computing: Graph Algorithms in Mapreduce October 15, 2013

Caricato da

Copyright:

Formati disponibili

NETS 212: Scalable and Cloud

Graph algorithms in MapReduce

No class on October 22nd or 24th

Special 'catch-up' class on October

Andreas at IMC in Barcelona

4:30-6:00pm, Location TBA

Any questions about HW2?

If you haven't started yet: Please start early!!!

What we have seen so far

In the first half of the semester, we saw

This is useful for data with limited

We could extract pieces of input data items and

Much of the world is a network of

Members of a social network can be friends, and may

Goal: Develop a toolbox

We need a toolbox of algorithms useful

For the next ~2 lectures well start to

Some of the problems are studied in courses you

So well see both the traditional solution and the

Plan for today

Representing data in graphs

Single-source shortest path (SSSP)

We can represent related objects as a

Nodes all have IDs, and possibly other properties

Images by Jojo Mendoza, Creative Commons license

Thinking about related objects

Encoding the data in a graph

Recall basic definition of a graph:

G = (V, E) where V is vertices, E is edges of

Assume we only care about connected

Then we can capture a graph simply as the

Graph encodings: Set of edges

Graph encodings: Adding edge

(Alice, fan-of, Facebook)

Graph encodings: Adding weights

(Alice, fan-of, 0.5, Facebook)

Recap: Related objects

We can represent the relationships

We can annotate this graph in various

Add labels to edges to distinguish different types

Vertices represent the objects

Plan for today

Representing data in graphs

Single-source shortest path (SSSP)

A computation model for graphs

Once the data is encoded in this way, we

Simple example: Which users are their friends' best

This is often done by

annotating the vertices with additional information,

A computation model for graphs

Example: Am I my friends' best friend?

Step #1: Discard irrelevant vertices and edges

A computation model for graphs

alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3

Example: Am I my friends' best friend?

Step #1: Discard irrelevant vertices and edges

A computation model for graphs

sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9

Example: Am I my friends' best friend?

Step #1: Discard irrelevant vertices and edges

A computation model for graphs

sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9

Example: Am I my friends' best friend?