Sei sulla pagina 1di 61

NETS 212: Scalable and Cloud

Computing

Graph algorithms in MapReduce


October 15, 2013

2013 A. Haeberlen, Z.
Ives

University of Pennsylvania

Announcements

No class on October 22nd or 24th

Special 'catch-up' class on October


30th

Andreas at IMC in Barcelona


Please work on HW2 and HW3 (will be released this
week)

4:30-6:00pm, Location TBA

Any questions about HW2?

2013 A. Haeberlen, Z.
Ives

If you haven't started yet: Please start early!!!


University of Pennsylvania

What we have seen so far

In the first half of the semester, we saw


how the map/reduce model could be
used to filter, collect, and aggregate
data values

This is useful for data with limited


structure

2013 A. Haeberlen, Z.
Ives

We could extract pieces of input data items and


collect them to run various reduce operations
We could join two different data sets on a common
key
3

Beyond average/sum/count

Much of the world is a network of


relationships and shared features

2013 A. Haeberlen, Z.
Ives

Members of a social network can be friends, and may


have shared interests / memberships / etc.
Customers might view similar movies, and might
even be clustered by interest groups
The Web consists of documents with links
Documents are also related by topics, words,
authors, etc.

Goal: Develop a toolbox

We need a toolbox of algorithms useful


for analyzing data that has both
relationships and properties

For the next ~2 lectures well start to


build this toolbox

Some of the problems are studied in courses you


may not have taken yet:

2013 A. Haeberlen, Z.
Ives

CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems)

So well see both the traditional solution and the


MapReduce one
5

Plan for today

Representing data in graphs


Graph algorithms in MapReduce
NEXT

Computation model
Iterative MapReduce

A toolbox of algorithms

2013 A. Haeberlen, Z.
Ives

Single-source shortest path (SSSP)


k-means clustering
Classification with Nave Bayes

University of Pennsylvania

Facebook

fan-of

fan-of
friend-of

Alice

fan-of
friend-of

Sunita

fan-of

Mikhail
fan-of

Magna Carta

Jose

We can represent related objects as a


labeled, directed graph
Entities are typically represented as
nodes; relationships are typically edges

2013 A. Haeberlen, Z.
Ives

Nodes all have IDs, and possibly other properties


Edges typically have values, possibly IDs and other
properties
7

Images by Jojo Mendoza, Creative Commons license

Thinking about related objects

Encoding the data in a graph


Facebook
Mikhail
Magna Carta
Alice

Jose

Recall basic definition of a graph:

Sunita

G = (V, E) where V is vertices, E is edges of


the form (v1,v2) where v1,v2 V

Assume we only care about connected


vertices

2013 A. Haeberlen, Z.
Ives

Then we can capture a graph simply as the


edges
8

Graph encodings: Set of edges


Facebook
Mikhail
Magna Carta
Alice

2013 A. Haeberlen, Z.
Ives

Sunita

Jose

(Alice, Facebook)
(Alice, Sunita)
(Jose, Magna Carta)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Magna
Carta)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)

Graph encodings: Adding edge


types
Facebook

fan-of

fan-of
friend-of

Alice

2013 A. Haeberlen, Z.
Ives

fan-of
friend-of

Sunita

fan-of

Mikhail
fan-of

Magna Carta

Jose

(Alice, fan-of, Facebook)


(Alice, friend-of, Sunita)
(Jose, fan-of, Magna Carta)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Magna
Carta)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)

10

Graph encodings: Adding weights


Facebook

fan-of
0.8

fan-of 0.5

0.7 fan-of
friend-of
friend-of

Alice

2013 A. Haeberlen, Z.
Ives

0.9

Sunita

0.3

fan-of

Mikhail 0.7
fan-of

Magna Carta

0.5

Jose

(Alice, fan-of, 0.5, Facebook)


(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Magna
Carta)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8,
Facebook)
(Mikhail, fan-of, 0.7, Magna
Carta)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)

11

Recap: Related objects

We can represent the relationships


between related objects as a directed,
labeled graph

We can annotate this graph in various


ways

Add labels to edges to distinguish different types


Add weights to edges
...

12
We can encode the graph in various

2013 A. Haeberlen, Z.
Ives

Vertices represent the objects


Edges represent relationships

Plan for today

Representing data in graphs


Graph algorithms in MapReduce

NEXT

Computation model
Iterative MapReduce

A toolbox of algorithms

2013 A. Haeberlen, Z.
Ives

Single-source shortest path (SSSP)


k-means clustering
Classification with Nave Bayes

University of Pennsylvania

13

A computation model for graphs


Facebook

fan-of
0.8

fan-of 0.5

0.7 fan-of
friend-of
friend-of

Alice

Sunita

0.3

Mikhail 0.7
fan-of

Magna Carta

0.5

Jose

Once the data is encoded in this way, we


can perform various computations on it

0.9

fan-of

Simple example: Which users are their friends' best


friend?
More complicated examples (later): Page rank,
adsorption, ...

This is often done by

2013 A. Haeberlen, Z.
Ives

annotating the vertices with additional information,


and
14
University of Pennsylvania

A computation model for graphs


Facebook

fan-of
0.8

fan-of 0.5

0.7 fan-of
friend-of
friend-of

Alice

0.9

Sunita

0.3

fan-of

Mikhail 0.7
fan-of

Jose

0.5

Magna Carta
Slightly more
technical: How many
of my friends have
me as their
best friend?

Example: Am I my friends' best friend?

2013 A. Haeberlen, Z.
Ives

Step #1: Discard irrelevant vertices and edges

University of Pennsylvania

15

A computation model for graphs


Mikhail
friend-of

friend-of

0.9

0.3

Alice

Sunita

Jose

alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3


sunitajose: 0.3

Example: Am I my friends' best friend?

2013 A. Haeberlen, Z.
Ives

Step #1: Discard irrelevant vertices and edges


Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge

University of Pennsylvania

16

A computation model for graphs


Mikhail
friend-of

friend-of

0.9

0.3

Alice

Sunita

Jose

sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9


sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3
sunitajose: 0.3

Example: Am I my friends' best friend?

2013 A. Haeberlen, Z.
Ives

Step #1: Discard irrelevant vertices and edges


Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge

University of Pennsylvania

17

A computation model for graphs


Mikhail
friend-of

friend-of

0.9

0.3

Alice

Sunita

Jose

sunitaalice: 0.9 alicesunita: 0.9 sunitaalice: 0.9


sunitajose: 0.3 josesunita: 0.3 sunitajose: 0.3
alicesunita: 0.9 sunitaalice: 0.9 josesunita: 0.3
sunitajose: 0.3

Example: Am I my friends' best friend?

2013 A. Haeberlen, Z.
Ives

Step
Step
Step
Step

#1:
#2:
#3:
#4:

Discard irrelevant vertices and edges


Annotate each vertex with list of friends
Push annotations along each edge
Determine result at each vertex
University of Pennsylvania

18

Can we do this in MapReduce?


map(key: node, value: [<otherNode, relType, strength>])
{

}
reduce(key: ________, values: list of _________)
{

Using adjacency list representation?

2013 A. Haeberlen, Z.
Ives

19

Can we do this in MapReduce?


map(key: node, value: <otherNode, relType, strength>)
{

}
reduce(key: ________, values: list of _________)
{

Using single-edge data representation?

2013 A. Haeberlen, Z.
Ives

20

A real-world use case

A variant that is actually used in social


networks today: "Who are the friends of
multiple of my friends?"

Where have you seen this before?

Friend recommendation!

2013 A. Haeberlen, Z.
Ives

Maybe these people should be my friends too!

21

Generalizing

Now suppose we want to go beyond


direct friend relationships

Example: How many of my friends' friends (distance2 neighbors) have me as their best friend's best
friend?
What do we need to do?

How about distance k>2?

To compute the answer, we need to run


multiple iterations of MapReduce!

2013 A. Haeberlen, Z.
Ives

22

Iterative MapReduce

The basic model:


copy files from input dir staging dir 1
(optional: do some preprocessing)
while (!terminating condition) {
map from staging dir 1
reduce into staging dir 2
move files from staging dir 2 staging dir1
}
(optional: postprocessing)
move files from staging dir 2 output dir

Note that reduce output must be compatible


with the map input!

2013 A. Haeberlen, Z.
Ives

What can happen if we filter out some information in the mapper


or in the reducer?
23

Graph algorithms and


MapReduce

A centralized algorithm typically


traverses a tree or a graph one item at
a time (theres only one cursor)

Youve learned breadth-first and depth-first


traversals

Most algorithms that are based on


graphs make use of multiple
map/reduce stages processing one
wave at a time

2013 A. Haeberlen, Z.
Ives

Sometimes iterative MapReduce, other times chains


of map/reduce
24

Recap: MapReduce on graphs

Suppose we want to:

compute a function for each vertex in a graph...


... using data from vertices at most k hops away

We can do this as follows:

"Push" information along the edges

"Think like a vertex"

Finally, perform the computation at each vertex

May need more than one MapReduce


phase

2013 A. Haeberlen, Z.
Ives

Iterative MapReduce: Outputs of stage i inputs of


stage i+1 University of Pennsylvania
25

Plan for today

Representing data in graphs


Graph algorithms in MapReduce

Computation model
Iterative MapReduce

A toolbox of algorithms

NEXT

2013 A. Haeberlen, Z.
Ives

Single-source shortest path (SSSP)


k-means clustering
Classification with Nave Bayes

University of Pennsylvania

26

Path-based algorithms

Sometimes our goal is to compute


information about the paths (sets of
paths) between nodes

Edges may be annotated with cost, distance, or


similarity

Examples of such problems (see CIS


121+320):

2013 A. Haeberlen, Z.
Ives

Shortest path from one node to another


Minimum spanning tree (minimal-cost tree connecting
all vertices in a graph)
Steiner tree (minimal-cost tree connecting certain
27
nodes)

Single-Source Shortest Path


(SSSP)
Given a directed graph G = (V, E) in which each edge e has a
cost c(e):
Compute the cost of reaching each node from the source
node s in the most aefficient way (potentially
after multiple
b
1
'hops')
?
?
10
2
s

?
c
2013 A. Haeberlen, Z.
Ives

?
d

28

SSSP: Intuition

We can formulate the problem using


induction

The shortest path follows the principle of optimality:


the last step (u,v) makes use of the shortest path to u

We can express this as follows:

bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
2013 A. Haeberlen, Z.
Ives

29

SSSP: CIS 320-style solution

Traditional approach: Dijkstra's


algorithm
V: vertices, E: edges, S: start node
foreach v in V
dist_S_to[v] := infinity
Initialize length and
predecessor[v] = nil
last step of path
to default values
spSet = {}
Q := V
while (Q not empty) do
Update length and
u := Q.removeNodeClosestTo(S) path based on edges
radiating from u
spSet := spSet + {u}
foreach v in V where (u,v) in E
if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then
dist_S_To[v] = dist_S_To[u] + cost(u,v)
predecessor[v] = u

2013 A. Haeberlen, Z.
Ives

30

10
2
s

4
7

Q = {s,a,b,c,d}
spSet = {}
dist_S_To: {(a,), (b,), (c,), (d,)}
predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}
2013 A. Haeberlen, Z.
Ives

31

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

a 1
0

10
2
s

4
7

Q = {a,b,c,d}
spSet = {s}
dist_S_To: {(a,10), (b,), (c,5), (d,)}
predecessor: {(a,s), (b,nil), (c,s), (d,nil)}
2013 A. Haeberlen, Z.
Ives

32

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

1 b
4

10
2
s

7
c

Q = {a,b,d}
spSet = {c,s}
dist_S_To: {(a,8), (b,14), (c,5), (d,7)}
predecessor: {(a,c), (b,c), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives

33

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

1 b
3

10
2
s

4
7

Q = {a,b}
spSet = {c,d,s}
dist_S_To: {(a,8), (b,13), (c,5), (d,7)}
predecessor: {(a,c), (b,d), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives

34

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

10
2
s

7
c

Q = {b}
spSet = {a,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives

35

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

10
2
s

7
c

Q = {}
spSet = {a,b,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
2013 A. Haeberlen, Z.
Ives

36

Example from CLR 2nd ed. p. 528

SSSP: Dijkstra in Action

SSSP: How to parallelize?

Dijkstra traverses the graph along a


single route at a time, prioritizing its
traversal to the next step based on total
?
path length (and avoiding cycles)?

No real parallelism to be had here!


s 0

Intuitively, we want something


?
that radiates from the origin,
one edge hop distance at a time

2013 A. Haeberlen, Z.
Ives

Each step outwards can be done in parallel, before


another iteration occurs - or we are done
Recall our earlier discussion: Scalability depends on
the algorithm, not (just) on the problem!
37

SSSP: Revisiting the inductive


definition
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}

Dijkstras algorithm carefully considered


each u in a way that allowed us to prune
certain points
Instead we can look at all potential us for
each v

Compute iteratively, by keeping a frontier38


set of u
nodes i edge-hops from the source

2013 A. Haeberlen, Z.
Ives

SSSP: MapReduce formulation

init:

For each node, node ID <, -, {<succ-node-ID,edgecost>}>

map:

This is a new path from


take node ID <dist, next, {<succ-node-ID,edgethe source to succ-nodecost>}>
ID
that we just discovered
For each succ-node-ID:
(not necessarily

The shortest path we have found so far ... this is the next... and here is the adjacency
list for nodeID
from the source to nodeID has length hop on that path...
...

emit succ-node ID {<node ID, distance+edge-cost>} shortest)

Why is this necessary?


emit node ID distance,{<succ-node-ID,edge-cost>}

reduce:

2013 A. Haeberlen, Z.
Ives

distance := min cost from a predecessor; next := that


predec.
emit node ID <distance, next, {<succ-node-ID,edgecost>}>
39

Iteration 0: Base case


mapper:

(a,<s,10>) (c,<s,5>) edges

reducer: (a,<10, ...>) (c,<5, ...>)


a

"Wave"

10
2

s 0

4
7

c
2013 A. Haeberlen, Z.
Ives

40

Iteration 1
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) edges
reducer: (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)
"Wave
b
1
a 1
"

0
10
2

s 0

7
c

2013 A. Haeberlen, Z.
Ives

41

Iteration 2
mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)
(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
reducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)
"Wave
1
a
1 b
"
8
1
10
2

s 0

7
c

2013 A. Haeberlen, Z.
Ives

7
42

No change!
change!
No
Convergence!
Convergence!

Iteration 3

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>)


(b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>)
edges
1 (d,<7>)
a
reducer: (a,<8>) (c,<5>)
(b,<11>)
1 b
8
1
10
2

s 0

5
Question: If a vertex's path cost
is the same in two consecutive
rounds, can we be sure that
this vertex has converged?
2013 A. Haeberlen, Z.
Ives

4
7

7
43

Summary: SSSP

Path-based algorithms typically involve


iterative map/reduce
They are typically formulated in a way
that traverses in waves or stages,
like breadth-first search

This allows for parallelism


They need a way to test for convergence

Example: Single-source shortest path


(SSSP)

2013 A. Haeberlen, Z.
Ives

Original Dijkstra formulation is hard to parallelize


But we can make it work with the "wave" approach
44

Plan for today

Representing data in graphs


Graph algorithms in MapReduce

Computation model
Iterative MapReduce

A toolbox of algorithms

2013 A. Haeberlen, Z.
Ives

Single-source shortest path (SSSP)


k-means clustering NEXT
Classification with Nave Bayes

University of Pennsylvania

45

Learning (clustering /
classification)

Sometimes our goal is to take a set of


entities, possibly related, and group
them

If the groups are based on similarity, we call this


clustering
If the groups are based on putting them into a
semantically meaningful class, we call this
classification

Both are instances of machine learning

2013 A. Haeberlen, Z.
Ives

46

The k-clustering Problem


Ag
e

Clusters
Items

Expenses

Given: A set of items in a n-dimensional


feature space

Example: data points from survey, people in a social


network

Goal: Group the items into k clusters

2013 A. Haeberlen, Z.
Ives

What would be a 'good' set of clusters?47

Approach: k-Means

Let m1, m2, , mk be representative


points for each of our k clusters

Specifically: the centroid of the cluster

Initialize m1, m2, , mk to random


values in the data
For t = 1, 2, :

Map each observation to the closest mean

Si( t ) x j : x j mi( t ) x j mi(*t ) , i* 1,..., k

Assign the mi to be a new centroid for each set

mi(t 1)
2013 A. Haeberlen, Z.
Ives

1
Si( t )

x j S i( t )

48

A simple example (1/4)

(20,21)

Age

(18,20)

(30,21)

(11,16)

(10,10)

(15,12)

Expenses

2013 A. Haeberlen, Z.
Ives

49

A simple example (2/4)

(20,21)

Age

(18,20)

(30,21)

(11,16)

Randomly chosen
initial centers
(10,10)

(15,12)

Expenses

2013 A. Haeberlen, Z.
Ives

50

A simple example (3/4)

(20,21)
(18,20)

(30,21)

Age

(19.75,19.5)
(11,16)
(12.5,11)
(10,10)

(15,12)

Expenses

2013 A. Haeberlen, Z.
Ives

51

A simple example (4/4)

(20,21)

(30,21)

Age

(18,20)

(22.67,20.67)
(11,16)
(12,12.67)
(10,10)

(15,12)

Expenses

2013 A. Haeberlen, Z.
Ives

Stable!
52

k-Means in MapReduce

Map #1:

Input: node ID <position, centroid ID, [centroid IDs and


positions]>
Compute nearest centroid; emit centroid ID <node ID, position>

Reduce #1:

Recompute centroid position from positions of nodes in it


Emit centroidID <node IDs, positions> and for all other centroid
IDs, emit otherCentroidID centroid(centroidID,X,Y)

Map #2:

Pass through values to Reducer #2

Reduce #2:

For each node in the current centroid, emit


node ID <position, centroid ID, [centroid IDs and positions]>

Input for the next map iteration

Also, emit <X, <centroid ID, position>>

Each centroid will need to know where all the other centroids are

This will be the 'result' (remember that we wanted the centroids!)

Repeat until no change

2013 A. Haeberlen, Z.
Ives

53

Plan for today

Representing data in graphs


Graph algorithms in MapReduce

Computation model
Iterative MapReduce

A toolbox of algorithms

2013 A. Haeberlen, Z.
Ives

Single-source shortest path (SSSP)


k-means clustering
Classification with Nave BayesNEXT

University of Pennsylvania

54

Classification

Suppose we want to learn what is


spam (or interesting, or )

Predefine a set of classes with semantic meaning


Train an algorithm to look at data and assign a class

Based on giving it some examples of data in each class


and the sets of features they have

Many probabilistic techniques exist

Each class has probabilistic relationships with others

2013 A. Haeberlen, Z.
Ives

e.g., p (spam | isSentLocally), p (isSentLocally | fromBob),


Typically represented as a graph(ical model)! See CIS 520

But well focus on a simple, flat model: Nave


Bayes
55

A simple example

Suppose we just look at the keywords in


the email's title:
Message(1,
Message(2,
Message(3,
Message(4,
Message(5,
Message(6,

Won contract)
Won award)
"Won the lottery")
Unsubscribe)
"Millions of customers")
"Millions of dollars")

What is probability message "Won Millions" is


?
Bayes
p(spam|containsWon,containsMillions)
Theorem
= p(spam) p(containsWon,containsMillions |spam)
p(containsWon,containsMillions)

2013 A. Haeberlen, Z.
Ives

56

Classification using Nave Bayes

Basic assumption: Probabilities of events are


independent

This is why it is called 'nave'

Under this assumption,


p(spam) p(containsWon,containsMillions | spam)
p(containsWon,containsMillions)
= p(spam) p(containsWon | spam) p(containsMillions | spam)
p(containsWon) p(containsMillions)
= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67

So how do we train a learner (compute the above


probabilities) using MapReduce?

2013 A. Haeberlen, Z.
Ives

57

What do we need to train the


learner?

p(spam)

Easy
Easy

p(containsXYZ | spam)

Count how many spam emails there are


Count total number of emails

Count how many spam emails contain XYZ 1


Count how many emails contain XYZ overall 2

p(containsXYZ)

2013 A. Haeberlen, Z.
Ives

Count how many emails contain XYZ overall 2


Count total number of emails
Easy
University of Pennsylvania

58

Training a Nave Bayes Learner

map 1:

reduce 1:

emits <word, class> <count>

takes messageId -> <class, {words}> Count how many


emails contain the
emits word 1
word overall
(WordCount)

reduce 2:

2013 A. Haeberlen, Z.
Ives

contain the word


(modified WordCount)

map 2:

takes messageId <class, {words}>


Count how many
emails in the class
emits <word, class> 1

emits word <totalCount>


59

Summary: Learning and


MapReduce

Clustering algorithms typically have


multiple aggregation stages or
iterations

k-means clustering repeatedly computes centroids,


maps items to them
Fixpoint computation

Classification algorithms can be quite


complex

2013 A. Haeberlen, Z.
Ives

In general: need to capture conditional probabilities


Nave Bayes assumes everything is independent
Training is a matter of computing probability
60
distribution

Stay tuned

Next time you will learn about:


PageRank and Adsorption
2013 A. Haeberlen, Z.
Ives

University of Pennsylvania

61

Potrebbero piacerti anche