Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 16
1
Memory and Synchronization
Latency
Scalability of system is limited by ability to handle
memory latency & algorithmic sychronization
delays
Overall solution is well known
– Do something else whilst waiting
Remote memory accesses
– Much slower than local
– Varying delay depending on
• Network traffic
• Memory traffic
2
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor Utilization
Utilization
– P/T
• P time spent processing
• T total time
– P/(P + I + S)
• I time spent waiting on other tasks
• S time spent switching tasks
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Basic ideas - Multithreading
Fine Grain – task switch every cycle
Blocked
Blocked
Blocked
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design Space
Multi threaded
architectures
Von Neumann
Hybrid Von Neumann/ Fine Physical Shared Small
(Sequential Control
Dataflow Grain Memory (4 – 10)
Flow)
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of multi-threaded architectures
Multi-threaded
architectures
Macro dataflow
HEP RISC Like Decoupled
architectures
MIT Hybrid
Tera P-RISC USC
Machine
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Computational Models
7
Sequential control flow (von Neumann)
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequential Control Flow Model
L1: -
A
B
m1
L2: +
Control B
Flow 1
m2
L3: *
m1
m2
R R = (A - B) * (B + 1)
9
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow
Control tied to data
Instruction “fires” when data is available
– Otherwise it is suspended
Order of instructions in program has no effect on execution order
– Cf Von Neumann
No shared rewritable memory
– Write once semantics
Code is stored as a dataflow graph
Data transported as tokens
Parallelism occurs if multiple instructions can fire at same time
– Needs a parallel processor
Nodes are self scheduling
10
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A B
- +
R = (A - B) * (B + 1)
11
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A B
- +
R = (A - B) * (B + 1)
12
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – Parallel Execution
A B
- +
R = (A - B) * (B + 1)
13
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Implementation
Dataflow model required very different execution
engine
Data must be stored in special matching store
Instructions must be triggered when both operands
are available
Parallel operations must be scheduled to
processors dynamically
– Don’t know apriori when they are available.
Instruction operands are pointers
– To instruction
– Operand number 14
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow model of execution
L1: Compte B
L2/2
L2: L3/1 L3:
A - B +
B
1
L4/1 L4/2
L4: *
L6/1
15
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel Control flow
Sometimes called macro dataflow
– Data flows between blocks of sequential code
– Has advantaged of dataflow & Von Neumann
• Context switch overhead reduced
• Compiler can schedule instructions statically
• Don’t need fast matching store
Requires additional control instructions
– Fork/Join
16
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Macro Dataflow (Hybrid Control/Dataflow)
L6: *
L3: GOTO m1
L5 m2
R
17
R = (A - B) * (B + 1)
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issues for Hybrid dataflow
Blocks of sequential instructions need to be large
enough to absorb overheads of context switching
Data memory same as MIMD
– Can be partitioned or shared
– Synchronization instructions required
• Semaphores, test-and-set
Control tokens required to synchronize threads.
18
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Some examples
19
Denelcor HEP
Designed to tolerate latency in memory
Fine grain interleaving of threads
Processor pipeline contains 8 stages
Each time step a new thread enters the pipeline
Threads are taken from the Process Status Word (PSW)
After thread taken from the PSW queue, instruction and
operands are fetched
When an instruction is executed, another one is placed on
the PSW queue
Threads are interleaved at the instruction level.
20
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Denelcor HEP
Memory latency toleration solved with
Scheduler Function Unit (SFU)
Memory words are tagged as full or empty
Attempting to read an empty suspends the
current thread
– Then current PSW entry is moved to the SFU
When data is written, taken from the SFU
and placed back on the PSW queue.
21
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization on the HEP
All registers have Full/Empty/Reserved bit
Reading an empty register causes thread toe
be placed back on the PSW queue without
updating its program counter
Thread synchronization is busy-wait
– But other threads can run
22
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP Architecture
PSW
Matching Program
queue
Unit memory
Operand hand 1
Increment Operand
control Operand hand 2 Registers
fetch
SFU
Funct Funct Funct To/from
unit 1 unit 2 unit N Data
memory
23
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP configuration
Up to 16 processors
Up to 128 data memories
Connected by high speed switch
Limitations
– Threads can have only 1 outstanding memory request
– Thread synchronization puts bubbles in the pipeline
– Maximum of 64 threads causing problems for software
• Need to throttle loops
– If parallelism is lower than 8 full utilisation not
possible. 24
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife Processor
512 Processors in 2-dim mesh
Sparcle Processor
Physcially distributed memory
Logical shared memory
Hardware supported cache coherence
Hardware supported user level message passing
Multi-threading
25
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Threading in Alewife
Coarse-grained multithreading
Pipeline works on single thread as long as
remote memory access or synchronization
not required
Can exploit register optimization in the
pipeline
Integration of multi-threading with
hardware supported cache coherence
26
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Sparcle Processor
Extension of SUN Sparc architecture
Tolerant of memory latency
Fine grained synchronisation
Efficient user level message passing
27
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Fast context switching
In Sparc 8 overlapping register windows
Used in Sparcle in paris to represent 4 independent, non-overlapping contexts
– Three for user threads
– 1 for traps and message handlers
Each context contains 32 general purpose registers and
– PSR (Processor State Register)
– PC (Program Counter)
– nPC (next Program Counter)
Thread states
– Active
– Loaded
• State stored in registers – can become active
– Ready
• Not suspended and not loaded
– Suspended
Thread switching
– In fast if one is active and the other is loaded
– Need to flush the pipeline (cf HEP) 28
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sparcle Architecture
0:R0
PSR
PC
nPC
0:R31
PSR
1:R0
PC
nPC CP Active
PSR thread
PC 1:R31
nPC 2:R0
PSR
PC
nPC
2:R31
3:R0
3:R31
29
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife and Sparcle
NR
Sparcle
Cache Main
64 CMMU Memory
kbytes 4 Bytes
FPU
NR = Network router
CMMU = Communication & memory management unit
FPU = Floating point unit
30
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
From here figures are drawn by Tim
31
Figures 16.10 Thread states in
Sparcle
Process state Global register Memory
frames
G0
3:R31 32
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.11 structure of a typical
static dataflow PE
Fetch unit
Update unit
33
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.12 structure of a typical
tagged-token dataflow PE
Matching unit Matching store
Update unit
To other (PEs)
34
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.13 organization of the I-
structure storage
Data storage Data storage
k: W tag X
k+1: A tag Z nil
k+2: A tag Y nil
k+3: P datum
k+4: W
35
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (a) and (b)
- fire -
<23, <FP, IP+2>> <23, <FP, IP+1>>
* + * +
36
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (c)
IP SUB 2 +1, +2 FP FP
ADD 3 +2
fire
MUL 4 +7 FP+2 1 35 0
0 FP+3 1 23
0 FP+4 1 23
Presence bit
37
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.15 structure of a typical
explicit token-store dataflow PE
From other PEs
Fetch unit Fetch unit
Effective
address
Presence bits
Frame
memory
Frame store
operation
Form token
unit
38
To/from other PEs
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.16 scale of von
Neumann/dataflow architectures
Dataflow
Macro dataflow
RISC-like hybrid
von Neumann
39
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.17 structure of a typical
macro dataflow PE
Matching unit
Instruction
Frame
memory
Fetch unit
Form token
unit
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.18 organization of a PE
in the MIT hybrid Machine
PC FBR
+1
Instruction
Instruction fetch
memory
Enabled
Decode unit
continuation
queue Frame
memory
(Token queue)
Operand fetch
41
To/from global memory
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.19 comparison of (a) SQ
and (b) SCB macro nodes
a b c a b c
SQ1 SQ2 SCB1 SCB2
l1 l4 l1 l4
l2 l5 l2 l5
1 2 1 2
l3 l6 l3 l6
42
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.20 structure of the USC
Decoupled Architecture
To/from network (Graph virtual space)
GC GC
DFGE DFGE
RQ AQ RQ AQ
Cluster 0
CE CE
CC CC
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.21 structure of a node in
the SAM
fire
APU
Main done
SEU ASU
memory
LEU
To/from network
44
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.22 structure of the P-
RISC processing element
Token queue
Frame
Messages to/from other
memory Func. unit
PE’s memory
45
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.23 transformation of dataflow graphs into control
flow graphs (a) dataflow graph (b) control flow graph
join
+ +
fork L1
* -
46
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.24 structure of *T node
Network
From interface Message
network formatter To
network
Message
queues
Synchronization
Data processor
coprocessor
sIP dIP
Remote memory
request sFP dFP
sV1 Continuation dV1
coprocessor sV2 dV2
queue
<IP,FP>
Local memory
47
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997