13 Chapter 16

Futures, Scheduling, and Work
Distribution
Companion slides for
The Art of Multiprocessor
Programming
by Maurice Herlihy & Nir Shavit
Art of Multiprocessor Programming Herlihy
Shavit 2007
2
How to write Parallel Apps?
How to
split a program into parallel parts
In an effective way
Thread management
Shavit 2007
3
Matrix Multiplication
( ) ( ) ( ) B A C - =
Shavit 2007
4
c
ij
=
k=0
N-1
a
ki
* b
jk
Shavit 2007
5
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
this.row = row; this.col = col;
}
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}
Shavit 2007
6
int row, col;
}
public void run() {
for (int i = 0; i < n; i++)
}}}
a thread
Shavit 2007
7
int row, col;
}
public void run() {
for (int i = 0; i < n; i++)
}}}
Which matrix entry
to compute
Shavit 2007
8
int row, col;
}
public void run() {
for (int i = 0; i < n; i++)
}}}
Actual computation
Shavit 2007
9
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row )
for (int col )
worker[row][col] = new Worker(row,col);
for (int row )
for (int col )
worker[row][col].start();
for (int row )
for (int col )
worker[row][col].join();
}
Shavit 2007
10
void multiply() {
for (int row )
for (int col )
for (int row )
for (int col )
for (int row )
for (int col )
}
Create nxn
threads
Shavit 2007
11
void multiply() {
for (int row )
for (int col )
for (int row )
for (int col )
for (int row )
for (int col )
}
Start them
Shavit 2007
12
void multiply() {
for (int row )
for (int col )
for (int row )
for (int col )
for (int row )
for (int col )
}
Wait for
them to
finish
Start them
Shavit 2007
13
void multiply() {
for (int row )
for (int col )
for (int row )
for (int col )
for (int row )
for (int col )
}
Wait for
them to
finish
Whats wrong with this
picture?
Start them
Shavit 2007
14
Thread Overhead
Threads Require resources
Memory for stacks
Setup, teardown
Scheduler overhead
Worse for short-lived threads
Shavit 2007
15
Thread Pools
More sensible to keep a pool of long-
lived threads
Threads assigned short-lived tasks
Runs the task
Rejoins pool
Waits for next assignment

Shavit 2007
16
Thread Pool = Abstraction
Insulate programmer from platform
Big machine, big pool
And vice-versa
Portable code
Runs well on any platform
No need to mix algorithm/platform
concerns

Shavit 2007
17
ExecutorService Interface
In java.util.concurrent
Task = Runnable object
If no result value expected
Calls run() method.
Task = Callable<T> object
If result value of type T expected
Calls T call() method.
Shavit 2007
18
Future<T>
Callable<T> task = ;

Future<T> future = executor.submit(task);

T value = future.get();
Shavit 2007
19
Future<T>


Submitting a Callable<T> task
returns a Future<T> object
Shavit 2007
20
Future<T>


The Futures get() method blocks
until the value is available
Shavit 2007
21
Future<?>
Runnable task = ;

Future<?> future = executor.submit(task);

future.get();
Shavit 2007
22
Future<?>
Runnable task = ;


future.get();
Submitting a Runnable task
returns a Future<?> object
Shavit 2007
23
Future<?>
Runnable task = ;


future.get();
The Futures get() method blocks
until the computation is complete
Shavit 2007
24
Note
Executor Service submissions
Like New England traffic signs
Are purely advisory in nature
The executor
Like the New England driver
Is free to ignore any such advice
And could execute tasks sequentially
Shavit 2007
25
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
+ +
| | | |
=
| |
+ +
\ . \ .
Shavit 2007
26
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
+ +
| | | |
=
| |
+ +
\ . \ .
4 parallel additions
Shavit 2007
27
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices a
ij
and b
ij
)
Future<?> f
00
= exec.submit(add(a
00
,b
00
));

Future<?> f
11
= exec.submit(add(a
11
,b
11
));
f
00
.get(); ; f
11
.get();

}}
Shavit 2007
28
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
ij
and b
ij
)
Future<?> f
00
= exec.submit(add(a
00
,b
00
));

Future<?> f
11
= exec.submit(add(a
11
,b
11
));
f
00
.get(); ; f
11
.get();

}}
Base case: add directly
Shavit 2007
29
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
ij
and b
ij
)
Future<?> f
00
= exec.submit(add(a
00
,b
00
));

Future<?> f
11
= exec.submit(add(a
11
,b
11
));
f
00
.get(); ; f
11
.get();

}}
Constant-time operation
Shavit 2007
30
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
ij
and b
ij
)
Future<?> f
00
= exec.submit(add(a
00
,b
00
));

Future<?> f
11
= exec.submit(add(a
11
,b
11
));
f
00
.get(); ; f
11
.get();

}}
Submit 4 tasks
Shavit 2007
31
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
ij
and b
ij
)
Future<?> f
00
= exec.submit(add(a
00
,b
00
));

Future<?> f
11
= exec.submit(add(a
11
,b
11
));
f
00
.get(); ; f
11
.get();

}}
Let them finish
Shavit 2007
32
Dependencies
Matrix example is not typical
Tasks are independent
Dont need results of one task
To complete another
Often tasks are not independent
Shavit 2007
33
Fibonacci
1 if n = 0 or 1
F(n)
F(n-1) + F(n-2) otherwise
Note
potential parallelism
Dependencies

Shavit 2007
34
Disclaimer
This Fibonacci implementation is
Egregiously inefficient
So dont deploy it!
But illustrates our point
How to deal with dependencies
Exercise:
Make this implementation efficient!
Shavit 2007
35
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n;
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Shavit 2007
36
int arg;
arg = n;
}
if (arg > 2) {
} else {
return 1;
}}}
Parallel calls
Shavit 2007
37
int arg;
arg = n;
}
if (arg > 2) {
} else {
return 1;
}}}
Pick up & combine results
Shavit 2007
38
Dynamic Behavior
Multithreaded program is
A directed acyclic graph (DAG)
That unfolds dynamically
Each node is
A single unit of work
Shavit 2007
39
Fib DAG
fib(4)
fib(3) fib(2)
submit
get
fib(2) fib(1) fib(1) fib(1)
fib(1) fib(1)
Shavit 2007
40
Arrows Reflect Dependencies
fib(4)
fib(3) fib(2)
submit
get
fib(1) fib(1)
Shavit 2007
41
How Parallel is That?
Define work:
Total time on one processor
Define critical-path length:
Longest dependency path
Cant beat that!
Shavit 2007
42
Fib Work
fib(4)
fib(3) fib(2)
fib(1) fib(1)
Shavit 2007
43
Fib Work
work is 17
1 2 3
8 4 7 6 5 9
14 10 13 12 11 15
16
17
Shavit 2007
44
Fib Critical Path
fib(4)
Shavit 2007
45
Fib Critical Path
fib(4)
Critical path length is 8
1 8
2 7
3 6 4
5
Shavit 2007
46
Notation Watch
T
P
= time on P processors
T
1
= work (time on 1 processor)
T
= critical path length (time on

processors)

Shavit 2007
47
Simple Bounds
T
P
T
1
/P
In one step, cant do more than P work
T
P
T

Cant beat infinite resources
Shavit 2007
48
More Notation Watch
Speedup on P processors
Ratio T
1
/T
P
How much faster with P processors
Linear speedup
T
1
/T
P
= (P)
Max speedup (average parallelism)
T
1
/T

Shavit 2007
49
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
+ +
| | | |
=
| |
+ +
\ . \ .
Shavit 2007
50
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
+ +
| | | |
=
| |
+ +
\ . \ .
4 parallel additions
Shavit 2007
51
Addition
Let A
P
(n) be running time
For n x n matrix
on P processors
For example
A
1
(n) is work
A
(n) is critical path length

Shavit 2007
52
Addition
Work is

A
1
(n) = 4 A
1
(n/2) + (1)
4 spawned additions
Partition, synch, etc
Shavit 2007
53
Addition
Work is

A
1
(n) = 4 A
1
(n/2) + (1)
= (n
2
)

Same as double-loop summation
Shavit 2007
54
Addition
Critical Path length is

A
(n) = A
(n/2) + (1)
spawned additions in
parallel
Partition, synch, etc
Shavit 2007
55
Addition
Critical Path length is

A
(n) = A
(n/2) + (1)
= (log n)

Shavit 2007
56
Matrix Multiplication Redux
( ) ( ) ( ) B A C - =
Shavit 2007
57
Matrix Multiplication Redux
|
|
.
|
\
|
-
|
|
.
|
\
|
=
|
|
.
|
\
|
22 21
12 11
22 21
12 11
22 21
12 11
B B
B B
A A
A A
C C
C C
Shavit 2007
58
First Phase
|
|
.
|
\
|
+ +
+ +
=
|
|
.
|
\
|
22 22 12 21 21 22 11 21
22 12 12 11 21 12 11 11
22 21
12 11
B A B A B A B A
B A B A B A B A
C C
C C
8 multiplications
Shavit 2007
59
Second Phase
|
|
.
|
\
|
+ +
+ +
=
|
|
.
|
\
|
22 22 12 21 21 22 11 21
22 12 12 11 21 12 11 11
22 21
12 11
B A B A B A B A
B A B A B A B A
C C
C C
4 additions
Shavit 2007
60
Multiplication
Work is

M
1
(n) = 8 M
1
(n/2) + A
1
(n)
8 parallel
multiplications
Final addition
Shavit 2007
61
Multiplication
Work is

M
1
(n) = 8 M
1
(n/2) + (n
2
)
= (n
3
)

Same as serial triple-nested loop
Shavit 2007
62
Multiplication
Critical path length is

M
(n) = M
(n/2) + A
(n)

Half-size parallel
multiplications
Final addition
Shavit 2007
63
Multiplication
Critical path length is

M
(n) = M
(n/2) + A
(n)
= M
(n/2) + (log n)
= (log
2
n)

Shavit 2007
64
Parallelism
M
1
(n)/ M
(n) = (n
3
/log
2
n)
To multiply two 1000 x 1000 matrices
1000
3
/10
2
=10
7
Much more than number of
processors on any real machine

Shavit 2007
65
Shared-Memory
Multiprocessors
Parallel applications
Do not have direct access to HW
processors
Mix of other jobs
All run together
Come & go dynamically
Shavit 2007
66
Ideal Scheduling Hierarchy
Tasks
Processors
User-level scheduler
Shavit 2007
67
Realistic Scheduling Hierarchy
Tasks
Threads
Processors
Kernel-level scheduler
Shavit 2007
68
For Example
Initially,
All P processors available for application
Serial computation
Takes over one processor
Leaving P-1 for us
Waits for I/O
We get that processor back .
Shavit 2007
69
Speedup
Map threads onto P processes
Cannot get P-fold speedup
What if the kernel doesnt cooperate?
Can try for speedup proportional to
time-averaged number of processors
the kernel gives us
Shavit 2007
70
Scheduling Hierarchy
Tells kernel which threads are ready
Kernel-level scheduler
Synchronous (for analysis, not correctness!)
Picks p
i
threads to schedule at step i
Processor average
over T steps is:

=
=
T
i
i
A
p
T
P
1
1
Shavit 2007
71
Greed is Good
Greedy scheduler
Schedules as much as it can
At each time step
Optimal schedule is greedy (why?)
But not every greedy schedule is
optimal
Shavit 2007
72
Theorem
Greedy scheduler ensures that

T T
1
/P
A
+ T
(P-1)/P
A

Shavit 2007
73
Deconstructing

T T
1
/P
A
+ T
(P-1)/P
A

Shavit 2007
74
Deconstructing

T T
1
/P
A
+ T
(P-1)/P
A

Actual time
Shavit 2007
75
Deconstructing

T T
1
/P
A
+ T
(P-1)/P
A

Work divided by
processor average
Shavit 2007
76
Deconstructing

T T
1
/P
A
+ T
(P-1)/P
A

Cannot do better than
critical path length
Shavit 2007
77
Deconstructing

T T
1
/P
A
+ T
(P-1)/P
A

The higher the average
the better it is
Shavit 2007
78
Proof Strategy
=
=
T
i
i
A
p
T
P
1
1
=
=
T
i
i
A
p T
P
1
1
Bound this!
Shavit 2007
79
Put Tokens in Buckets
idle work
Processor found work
Processor available but
couldnt find work
Shavit 2007
80
At the end .
idle work
Total #tokens =
=
T
i
i
p
1
Shavit 2007
81
At the end .
idle work
T
1
tokens
Shavit 2007
82
Must Show
idle work
T
(P-1) tokens
Shavit 2007
83
Idle Steps
An idle step is one where there is at
least one idle processor
Only time idle tokens are generated
Focus on idle steps
Shavit 2007
84
Every Move You Make
Scheduler is greedy
At least one node ready
Number of idle threads in one idle
step
At most p
i
-1 P-1
How many idle steps?
Shavit 2007
85
Unexecuted sub-DAG
fib(4)
fib(3) fib(2)
submit
get
fib(1) fib(1)
Shavit 2007
86
fib(1)
Unexecuted sub-DAG
fib(4)
fib(3) fib(2)
fib(2) fib(1) fib(1)
fib(1) fib(1)
Longest path
Shavit 2007
87
Unexecuted sub-DAG
fib(4)
fib(3) fib(2)
fib(1) fib(1)
Last node ready
to execute
Shavit 2007
88
Every Step You Take
Consider longest path in unexecuted
sub-DAG at step i
At least one node in path ready
Length of path shrinks by at least one
at each step
Initially, path is T

So there are at most T
idle steps

Shavit 2007
89
Counting Tokens
At most P-1 idle threads per step
At most T
steps
So idle bucket contains at most
T
(P-1) tokens
Both buckets contain
T
1
+ T
(P-1) tokens
Shavit 2007
90
Recapitulating
) P ( T T p
T
i
i
1
1
1
+ s

=
=
=
T
i
i
A
p
P
T
1
1
( ) ) P ( T T
P
T
A
1
1
1
+ s

Shavit 2007
91
Turns Out
This bound is within a factor of 2 of
optimal
Actual optimal is NP-complete
Shavit 2007
92
Work Distribution
zzz
Shavit 2007
93
Work Dealing
Yes!
Shavit 2007
94
The Problem with
Work Dealing
Doh!
Doh!
Doh!
Shavit 2007
95
Work Stealing
No
work
zzz Yes!
Shavit 2007
96
Lock-Free Work Stealing
Each thread has a pool of ready work
Remove work without synchronizing
If you run out of work, steal someone
elses
Choose victim at random
Shavit 2007
97
Local Work Pools

Each work pool is a Double-Ended Queue
Shavit 2007
98
Work DEQueue
1
pushBottom
popBottom
work
1. Double-Ended Queue
Shavit 2007
99
Obtain Work
Obtain work
Run thread until
Blocks or terminates
popBottom
Shavit 2007
100
New Work
Unblock node
Spawn node
pushBottom
Shavit 2007
101
Whatcha Gonna do When the
Well Runs Dry?
@&%$!!
empty
Shavit 2007
102
Steal Work from Others

Pick random guys DEQeueue
Shavit 2007
103
Steal this Thread!
popTop
Shavit 2007
104
Thread DEQueue

Methods
pushBottom
popBottom
popTop
Never happen
concurrently
Shavit 2007
105
Thread DEQueue

Methods
pushBottom
popBottom
popTop
These most
common make
them fast
(minimize use of
CAS)
Shavit 2007
106
Ideal
Wait-Free
Linearizable
Constant time
Fortune Cookie: It is better to be young,
rich and beautiful, than old, poor, and ugly
Shavit 2007
107
Compromise
Method popTop may fail if
Concurrent popTop succeeds, or a
Concurrent popBottom takes last work
Blame the
victim!
Shavit 2007
108
Dreaded ABA Problem
top CAS
Shavit 2007
109
Dreaded ABA Problem
top
Shavit 2007
110
Dreaded ABA Problem
top
Shavit 2007
111
Dreaded ABA Problem
top
Shavit 2007
112
Dreaded ABA Problem
top
Shavit 2007
113
Dreaded ABA Problem
top
Shavit 2007
114
Dreaded ABA Problem
top
Shavit 2007
115
Dreaded ABA Problem
top CAS
Uh-Oh
Yes!
Shavit 2007
116
Fix for Dreaded ABA
stamp
top
bottom
Shavit 2007
117
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;

}
Shavit 2007
118
Bounded DQueue
Runnable[] tasks;

}
Index & Stamp
(synchronized)
Shavit 2007
119
Bounded DEQueue
Runnable[] deq;

}
Index of bottom thread
(no need to synchronize
The effect of a write
must be seen so we
need a memory barrier)
Shavit 2007
120
Bounded DEQueue
Runnable[] tasks;

}
Array holding tasks
Shavit 2007
121
pushBottom()

void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
}

}
Shavit 2007
122
pushBottom()

tasks[bottom] = r;
bottom++;
}

}
Bottom is the index to store
the new task in the array
Shavit 2007
123
pushBottom()

tasks[bottom] = r;
bottom++;
}

}
Adjust the bottom index
stamp
top
bottom
Shavit 2007
124
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}
Shavit 2007
125
Steal Work
return null;
return r;
return null;
} Read top (value & stamp)
Shavit 2007
126
Steal Work
return null;
return r;
return null;
} Compute new value & stamp
Shavit 2007
127
Steal Work
return null;
return r;
return null;
}
Quit if queue is empty
stamp
top
bottom
Shavit 2007
128
Steal Work
return null;
return r;
return null;
}
Try to steal the task
stamp
top
CAS
bottom
Shavit 2007
129
Steal Work
return null;
return r;
return null;
}
Give up if
conflict occurs
Shavit 2007
130
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int oldTop = top.get(stamp), newTop = 0;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
return r;
}
top.set(newTop,newStamp); return null;
}
Shavit 2007
131
bottom--;
bottom = 0;
return r;
}
}
Take Work
Make sure queue is non-empty
Shavit 2007
132
bottom--;
bottom = 0;
return r;
}
}
Take Work
Prepare to grab bottom task
Shavit 2007
133
bottom--;
bottom = 0;
return r;
}
}
Take Work
Read top, & prepare new values
Shavit 2007
134
bottom--;
bottom = 0;
return r;
}
return null;
}
Take Work
If top & bottom 1 or more
apart, no conflict
stamp
top
bottom
Shavit 2007
135
bottom--;
bottom = 0;
return r;
}
}
Take Work
At most one item left
stamp
top
bottom
Shavit 2007
136
bottom--;
bottom = 0;
return r;
}
}
Take Work
Try to steal last item.
In any case reset Bottom
because the DEQueue will be empty
even if unsucessful (why?)
Shavit 2007
137
bottom--;
bottom = 0;
return r;
}
}
Take Work
stamp
top
CAS
bottom
I win CAS
Shavit 2007
138
bottom--;
bottom = 0;
return r;
}
}
Take Work
stamp
top
CAS
bottom
I lose CAS
Thief must
have won
Shavit 2007
139
bottom--;
bottom = 0;
return r;
}
}
Take Work
failed to get last item

Must still reset top
Shavit 2007
140
Old English Proverb
May as well be hanged for stealing a
sheep as a goat
From which we conclude
Stealing was punished severely
Sheep were worth more than goats
Shavit 2007
141
Variations
Stealing is expensive
Pay CAS
Only one thread taken
What if
Randomly balance loads?

Shavit 2007
142
Work Balancing
5
2
b2+5 c/2=3
d2+5e/2=4
Shavit 2007
143
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = , max = ;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Shavit 2007
144
public void run() {
while (true) {
int min = , max = ;
}}}}}
Keep running tasks
Shavit 2007
145
public void run() {
while (true) {
int min = , max = ;
}}}}}
With probability
1/|queue|
Shavit 2007
146
public void run() {
while (true) {
int min = , max = ;
}}}}}
Choose random victim
Shavit 2007
147
public void run() {
while (true) {
int min = , max = ;
}}}}}
Lock queues in canonical order
Shavit 2007
148
public void run() {
while (true) {
int min = , max = ;
}}}}}
Rebalance queues
Shavit 2007
149
Work Stealing & Balancing
Clean separation between app &
scheduling layer
Works well when number of
processors fluctuates.
Works on black-box operating
systems
Shavit 2007
150
T O M
M R A V L O O
R I D L E D
Shavit 2007
151

This work is licensed under a Creative Commons Attribution-
ShareAlike 2.5 License.
You are free:
to Share to copy, distribute and transmit the work
to Remix to adapt the work
Under the following conditions:
Attribution. You must attribute the work to The Art of Multiprocessor
Programming (but not in any way that suggests that the authors endorse
you or your use of the work).
Share Alike. If you alter, transform, or build upon this work, you may
distribute the resulting work only under the same, similar or a compatible
license.
For any reuse or distribution, you must make clear to others the license terms of
this work. The best way to do this is with a link to
http://creativecommons.org/licenses/by-sa/3.0/.
Any of the above conditions can be waived if you get permission from the
copyright holder.
Nothing in this license impairs or restricts the author's moral rights.

13 Chapter 16

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

13 Chapter 16

Caricato da

Copyright:

Formati disponibili

Futures, Scheduling, and Work

= critical path length (time on

(n) is critical path length

Potrebbero piacerti anche