Final All-Pairs Shortest Paths

All-Pairs Shortest Paths
Csc8530 Dr. Prasad

Jon A Preston
March 17, 2004
Outline
Review of graph theory

Problem definition
Sequential algorithms
Properties of interest
Parallel algorithm
Analysis
Recent research
References
Graph Terminology
G = (V, E)
W = weight matrix
wij = weight/length of edge (vi, vj)
wij = if vi and vj are not connected by an edge
wii = 0
Assume W has positive, 0, and negative values
For this problem, we cannot have a negative-sum

cycle in G
Weighted Graph and Weight Matrix

v0
v1
-4
1
v2
7
9
v3
v4
0
1
2
3
4
5
4
1
0
5 4
0 3
3 0
0 7
2 9
3 4
1
0
7
0
6
2
9
6
0
Directed Weighted Graph and Weight Matrix

v0
1
-1
v1
v2
7
5
6
3
v4
v3
-2
v5
0 0
1
2
3 2
4 1
5
2 9
0 3
7 0
0 4
6 5 0
3 4
All-Pairs Shortest Paths Problem Defined

For every pair of vertices vi and vj in V, it is
required to find the length of the shortest path
from vi to vj along edges in E.
Specifically, a matrix D is to be constructed such
that dij is the length of the shortest path from vi to
vj in G, for all i and j.
Length of a path (or cycle) is the sum of the
lengths (weights) of the edges forming it.
Sample Shortest Path

v0
v3
-2
1
-1
v1
v2
5
6
3
v4
v5
Shortest path from v0 to v4 is along edges
(v0, v1), (v1, v2), (v2, v4)

and has length 6
Disallowing Negative-length Cycles

APSP does not allow for input to contain
negative-length cycles
This is necessary because:
If such a cycle were to exist within a path from vi to vj,
then one could traverse this cycle indefinitely,
producing paths of ever shorter lengths from vi to vj.
If a negative-length cycle exists, then all paths

which contain this cycle would have a length
of -.
Recent Work on Sequential Algorithms

Floyd-Warshall algorithm is (V3)
Appropriate for dense graphs: |E| = O(|V|2)
Johnsons algorithm
Appropriate for sparse graphs: |E| = O(|V|)
O(V2 log V + V E) if using a Fibonacci heap
O(V E log V) if using binary min-heap
Shoshan and Zwick (1999)
Strassens Algorithm
(matrix multiplication)
Integer edge weights in {1, 2, , W}

O(W V p(V W)) where 2.376 and p is a polylog function
Pettie (2002)
Allows real-weighted edges
O(V2 log log V + V E)
wk
p (v, w) k 1 v
k
Properties of Interest
k
d
Let ij denote the length of the shortest path from
vi to vj that goes through at most k - 1 intermediate
vertices (k hops)
1
d
ij = wij (edge length from vi to vj)

If i j and there is no edge from vi to vj, then
dij1 wij
1
d
Also, ii wii 0
Given that there are no negative weighted cycles

in G, there is no advantage in visiting any vertex
more than once in the shortest path from vi to vj.
Since there are only n vertices in G, dij dijn1
Guaranteeing Shortest Paths

If the shortest path from vi to vj contains vr and vs (where
vr precedes vs)
The path from vr to vs must be minimal (or it wouldnt
exist in the shortest path)
Thus, to obtain the shortest path from vi to vj, we can
compute all combinations of optimal sub-paths (whose
concatenation is a path from vi to vj), and then select the
shortest one
vi
vr
vs
MIN
vj
MIN
MIN
MINs
Iteratively Building Shortest Paths

v1
d ik11
w1j
v2
vi
d ik21
w2j
k 1
in
vn
vj
wnj
dijk 1
k 1 n
d min d ij , min d ilk 1 wlj

l 1
k
ij
d min dilk 1 wlj

k
ij
l 1
Recurrence Definition
k
k /2
k/2
d
min
(
d
d
For k > 1, ij
il
lj )
l
vi
vl
vj
MIN
k/2 vertices
MIN
k/2 vertices
k vertices
Guarantees O(log k) steps to calculate d ijk
Similarity
d min dilk 1 wlj

k
ij
l 1
Cij Ail Blj

l 1
Computing D
Let Dk = matrix with entries dij for 0 i, j n - 1.
Given D1, compute D2, D4, , Dm
m 2 log( n 1)
D = Dm
To calculate Dk from Dk/2, use special form of
matrix multiplication

min
Modified Matrix Multiplication

Step 2: for r = 0 to N 1 dopar
Cr = Ar + Br
end
Step 3: for m = 2q to 3q 1 do
for all r N (rm = 0) dopar
Cr = min(Cr, Cr(m))
Modified Example
1 2
A
3 4
P100
1 2
B
3 4
2
3
P110
4
-3
7 10
C
15 22
2
-4
P000
P001
1
-1
1
-2
3
-1
3
-2
P010
P011
P101
From 9.2, after step (1.3)
P111
4
-4
Modified Example (step 2)

1 2
A
3 4
P100
1 2
B
3 4
-2
P000
P110
1
P101
P001
-1
P010
P011
From 9.2, after

modified step 2
P111
0
Modified Example (step 3)

1 2
A
3 4
1 2
B
3 4
0 2
C
1 0
P101
P100
MIN
P110
MIN
P000
P001
-2
P010
P011
MIN
MIN
From 9.2, after

modified step 3
P111
Hypercube Setup
Begin with a hypercube of n3 processors
Each has registers A, B, and C
Arrange them in an n n n array (cube)
Set A(0, j, k) = wjk for 0 j, k n 1

i.e processors in positions (0, j, k) contain D1 = W
When done, C(0, j, k) contains APSP = Dm
Setup Example
0
D1 = Wjk = A(0, j, k) =
v0
1
-1
v1
v2
7
5
6
3
v4
v3
-2
v5
0 0
1
2
3 2
4 1
5
2 9
0 3
7 0
0 4
6 5 0
3 4
APSP Parallel Algorithm

Algorithm HYPERCUBE SHORTEST PATH (A,C)
Step 1: for j = 0 to n - 1 dopar
for k = 0 to n - 1 dopar
B(0, j, k) = A(0, j, k)
end for
end for
Step 2: for i = 1 to log( n 1) do
(2.1) HYPERCUBE MATRIX MULTIPLICATION(A,B,C)
(2.2) for j = 0 to n - 1 dopar
for k = 0 to n - 1 dopar
(i) A(0, j, k) = C(0, j, k)
(ii) B(0, j, k) = C(0, j, k)
end for
end for
end for
An Example
0
D1 =
0 0
1
2
3 2
4 1
5
0
D4 =
0 2 9
0 3
7 0
0 4
6 5 0
1
0 0 1
1 4 0
2 2 3
3 2 1
4 1 0
5 3 4
D2 =
3 19 6 10
2 14 5 9
0 12 3 7
1 0 4 12
2 9 0 4
6 5 9 0
0 0 1
1 8 0
2
3 2 1
4 1 0
5 3
0
D8 =
0 0 1
1 4 0
2 2 3
3 2 1
4 1 0
5 3 4
10
2 5 13
0 3 7
7 0 10
10 9 0 4
6 5 9 0
3
3 15 6 10
2 14 5 9
0 12 3 7
1 0 4 8
2 9 0 4
6 5 9 0
Analysis
Steps 1 and (2.2) require constant time
There are log( n 1) iterations of Step (2.1)
Each requires O(log n) time
The overall running time is t(n) = O(log2 n)

p(n) = n3
Cost is c(n) = p(n) t(n) = O(n3 log2 n)
T1
O(n 3 )
1
Efficiency is E
3
2
c(n) O (n log n) O (log 2 n)
Recent Research
Jenq and Sahni (1987) compared various parallel
algorithms for solving APSP empirically
Kumar and Singh (1991) used the isoefficiency
metric (developed by Kumar and Rao) to analyze
the scalability of parallel APSP algorithms
Hardware vs. scalability
Memory vs. scalability
Isoefficiency
For scalable algorithms (efficiency increases
monotonically as p remains constant and problem
size increases), efficiency can be maintained for
increasing processors provided that the problem
size also increases
Relates the problem size to the number of
processors necessary for an increase in speedup
in proportion to the number of processors used
Isoefficiency (cont)
Given an architecture, defines the
degree of scalability
Tells us the required growth in problem size to be able to efficiently
utilize an increasing number of processors
Ex:
Given an isoefficiency of kp3

If p0 and w0, speedup = 0.8p0 (efficiency = 0.8)
If p1 = 2p0, to maintain efficiency of 0.8
w1 = 23w0 = 8w0
Indicates the superiority of one algorithm over another only when

problem sizes are increased in the range between the two
isoefficiency functions
Isoefficiency (cont)
Given an architecture, defines the
degree of scalability
Tells us the required growth in problem size to be able to efficiently
utilize an increasing number of processors
Ex:
Given
isoefficiency of kp3 of kp3
Given
ananisoefficiency
If p0 and w0, speedup = 0.8p0 (efficiency = 0.8)
w1 = 2w
maintain efficiency
of 0.8(efficiency
p0Ifand
w
=
0.8p
0,,tospeedup
0
0
3
p1 = 2 w0 = 8w0
If
= 0.8)
If p1 = 2p0, to maintain efficiency of 0.8
w1 = the
23superiority
w0 = 8wof0one algorithm over another only when
Indicates
problem sizes are increased in the range between the two
isoefficiency functions
Memory Overhead Factor (MOF)

Ratio:
Total memory required for all processors
Memory required for the same problems size on single processor
Wed like this to be lower!
Architectures Discussed
Shared Memory (CREW)

Hypercube (Cube)
Mesh
Mesh with Cut-Through Routing
Mesh with Cut-Through and Multicast Routing
Also examined fast and slow communication

technologies
Parallel APSP Algorithms
Floyd Checkerboard
Floyd Pipelined Checkerboard
Floyd Striped
Dijkstra Source-Partition
Dijkstra Source-Parallel
General Parallel Algorithm (Floyd)

Repeat steps 1 through 4 for k := 1 to n
Step 1: If this processor has a segment of Pk-1[*,k], then transmit it to all
processors that need it
Step 2: If this processor has a segment of Pk-1[k,*], then transmit it to all
processors that need it
Step 3: Wait until the needed segments of Pk-1[*,k] and Pk-1[k,*] have
been received
Step 4: For all i, j in this processors partition, compute
Pk[i,j] := min {Pk-1[i,j], Pk-1[i,k] + Pk-1[k,j]}
Floyd Checkerboard
n
p
n
p
Each cell is assigned to a

different processor, and this
processor is responsible for
updating the cost matrix
values at each iteration of
the Floyd algorithm.
Steps 1 and 2 of the GPF
involve each of the p
processors sending their
data to the neighbor
columns and rows.
Floyd Pipelined Checkerboard

n
p
n
p
Similar to the preceding.

Steps 1 and 2 of the GPF
involve each of the p
columns and rows.
The difference is that the
processors are not
synchronized and compute
and send data ASAP (or
sends as soon as it receives).
Floyd Striped
n
p
Each column is assigned a

different processor, and this
processor is responsible for
updating the cost matrix
values at each iteration of
the Floyd algorithm.
Step 1 of the GPF
involves each of the p
columns. Step 2 is not
needed (since the column
is contained within the
processor).
Dijkstra Source-Partition
Assumes Dijkstras Single-source Shortest Path is equally
distributed over p processors and executed in parallel
Processor p finds shortest paths from each vertex in its
set to all other vertices in the graph
Fortunately, this approach involves no

inter-processor communication
Unfortunately, only n processors can be kept busy
Also, memory overhead is high since each processors has
a copy of the weight matrix
Dijkstras Source-Parallel
Motivated by keeping more processors busy
Run n copies of the Dijkstras SSP
Each copy runs on
p
n
p
n
p
n
p
n
p
n
processors (p > n)
p
n
p
n
p
n
p
n
p
n
p
n
p
n
p
n
p
n
p
n
p
n
p
n
Calculating Isoefficiency
Example: Floyd Checkerboard
At most n2 processors can be kept busy
n must grow as (p) due to problem structure
By Floyd (sequential), Te = (n3)
Thus isoefficiency is (p3) = (p1.5)
But what about communication
Calculating Isoefficiency (cont)
ts = message startup time

tw = per-word communication time
tc = time to compute next iteration value for one cell in matrix
m = number words sent
d = number hops between nodes
Hypercube:
(ts + tw m) log d = time to deliver m words
2 (ts + tw m) log p = barrier synchronization time (up & down tree)
d = p
Step 1 = (ts + tw n/p) log p

Step 2 = (ts + tw n/p) log p
Step 3 (barrier synch) = 2(ts + tw) log p
Step 4 = tcn2/p

n
n2
T p n 2 t s t w
log p 2t s t w log p tc

p
p

Isoefficiency = (p1.5(log p)3)
Mathematical Details
To pTp Te
2

n
n
log p 2t s t w log p tc tc n3
To p n 2 t s t w
p
p

To 3ts 2tw np log p twn2 p log p

How are n and p
related?
Mathematical Details
To pTp Te
2

n
n
log p 2t s t w log p tc tc n3
To p n 2 t s t w
p
p

To 3ts 2tw np log p twn2 p log p
tc n3 K 3ts 2tw np log p twn2 p log p
E
1 E
( p log p)1.5 p1.5 (log p ) 3
p1.5 (log p ) 3
Calculating Isoefficiency (cont)
ts = message startup time

tw = per-word communication time
tc = time to compute next iteration value for one cell in matrix
m = number words sent
d = number hops between nodes
Mesh:
n
p
p
Step 1 =
n
Step 2 = p p
Step 3 (barrier synch) =
Step 4 = Te
n
Tp (comm / sync) 2
p
p p n p
Isoefficiency = (p3+p2.25)
= (p3)
Isoefficiency and MOF for

Algorithm & Architecture Combinations
Base Algorithm
Parallel Variant
Architecture
Isoefficiency
MOF
Dijkstra
SourcePartitioned
SM, Cube, Mesh, Mesh-CT,

Mesh-CT-MC
p3
Dijkstra
Source-Parallel
SM, Cube
(p log p)1.5
Mesh, Mesh-CT
Mesh-CT-MC
p1.8
SM
p3
Cube
(p log p)3
Mesh
p4.5
Mesh-CT
(p log p)3
Mesh-CT-MC
p3
SM
p1.5
Cube
p1.5 (log p)3
Mesh
p3
Mesh-CT
p2.25
Mesh-CT-MC
p2.25
SM, Cube, Mesh, Mesh-CT,

Mesh-CT-MC
p1.5
Floyd
Floyd
Floyd
Stripe
Checkerboard
Pipelined
Checkerboard
Comparing Metrics
Weve used cost previously this semester
(cost = p Tp)
But notice that the cost of all of the architecturealgorithm combinations discussed here is (n3)
Clearly some are more scalable than others
Thus isoefficiency is a useful metric when
analyzing algorithms and architectures
References
Akl S. G. Parallel Computation: Models and Methods. Prentice
Hall, Upper Saddle River NJ, pp. 381-384,1997.
Cormen T. H., Leiserson C. E., Rivest R. L., and Stein C.
Introduction to Algorithms (2nd Edition). The MIT Press, Cambridge
MA, pp. 620-642, 2001.
Jenq J. and Sahni S. All Pairs Shortest Path on a Hypercube
Multiprocessor. In International Conference on Parallel Processing.
pp. 713-716, 1987.
Kumar V. and Singh V. Scalability of Parallel Algorithms for the All
Pairs Shortest Path Problem. Journal of Parallel and Distributed
Computing, vol. 13, no. 2, Academic Press, San Diego CA, pp. 124138, 1991.
Pettie S. A Faster All-pairs Shortest Path Algorithm for Realweighted Sparse Graphs. In Proc. 29th Int'l Colloq. on Automata,
Languages, and Programming (ICALP'02), LNCS vol. 2380, pp. 8597, 2002.

Final All-Pairs Shortest Paths

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final All-Pairs Shortest Paths

Caricato da

Copyright:

Formati disponibili

All-Pairs Shortest Paths

Csc8530 Dr. Prasad

March 17, 2004

Review of graph theory

Assume W has positive, 0, and negative values

For this problem, we cannot have a negative-sum

Weighted Graph and Weight Matrix

Directed Weighted Graph and Weight Matrix

All-Pairs Shortest Paths Problem Defined

Sample Shortest Path

Shortest path from v0 to v4 is along edges

(v0, v1), (v1, v2), (v2, v4)

Disallowing Negative-length Cycles

If a negative-length cycle exists, then all paths

Recent Work on Sequential Algorithms

Shoshan and Zwick (1999)

Integer edge weights in {1, 2, , W}

ij = wij (edge length from vi to vj)

Given that there are no negative weighted cycles

Guaranteeing Shortest Paths

Iteratively Building Shortest Paths

d min d ij , min d ilk 1 wlj

d min dilk 1 wlj

Guarantees O(log k) steps to calculate d ijk

d min dilk 1 wlj

Cij Ail Blj

Modified Matrix Multiplication

From 9.2, after step (1.3)

Modified Example (step 2)

From 9.2, after

Modified Example (step 3)

From 9.2, after

Set A(0, j, k) = wjk for 0 j, k n 1

When done, C(0, j, k) contains APSP = Dm

APSP Parallel Algorithm

The overall running time is t(n) = O(log2 n)

Given an isoefficiency of kp3

Indicates the superiority of one algorithm over another only when

Memory Overhead Factor (MOF)

Wed like this to be lower!

Shared Memory (CREW)

Also examined fast and slow communication

Parallel APSP Algorithms

General Parallel Algorithm (Floyd)

Each cell is assigned to a

Floyd Pipelined Checkerboard

Similar to the preceding.

Each column is assigned a

Fortunately, this approach involves no

Calculating Isoefficiency (cont)

ts = message startup time

Step 1 = (ts + tw n/p) log p

Isoefficiency = (p1.5(log p)3)

To 3ts 2tw np log p twn2 p log p

To 3ts 2tw np log p twn2 p log p

tc n3 K 3ts 2tw np log p twn2 p log p

( p log p)1.5 p1.5 (log p ) 3

Calculating Isoefficiency (cont)

ts = message startup time

Isoefficiency and MOF for

SM, Cube, Mesh, Mesh-CT,

p1.5 (log p)3

SM, Cube, Mesh, Mesh-CT,