Multi Threaded Architectures

Multi Threaded Architectures
Chapter 16
1
Memory and Synchronization
Latency
 Scalability of system is limited by ability to handle
memory latency & algorithmic sychronization
delays
 Overall solution is well known
– Do something else whilst waiting
 Remote memory accesses
– Much slower than local
– Varying delay depending on
• Network traffic
• Memory traffic
2
 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor Utilization
 Utilization
– P/T
• P time spent processing
• T total time
– P/(P + I + S)
• I time spent waiting on other tasks
• S time spent switching tasks
Basic ideas - Multithreading
 Fine Grain – task switch every cycle
Blocked
Blocked
Blocked
 Coarse Grain – Task swith every n cycles
Task Switch Overhead Task Switch Overhead
Design Space
Multi threaded
architectures
Computational Memory Number of threads

Granularity
Model Organization per processor
Von Neumann
Hybrid Von Neumann/ Fine Physical Shared Small
(Sequential Control
Dataflow Grain Memory (4 – 10)
Flow)
Parallel Control flow Distributed

Based on parallel Coarse Middle
Shared
Control operators Grain (10 – 100)
Memory
Parallel control flow Cache-coherent

Large
Based on control tokens Distributed shared
(over 100)
Memory
Classification of multi-threaded architectures
Multi-threaded
architectures
Von Neumann based Hybrid von Neumann/

architectures Dataflow architectures
Macro dataflow
HEP RISC Like Decoupled
architectures
MIT Hybrid
Tera P-RISC USC
Machine
MIT Alewife & Sparcle *T McGill MGDA & SAM EM-4
Computational Models
7
Sequential control flow (von Neumann)
 Flow of control and data separated

 Executed sequentially (or at least sequential
semantics – see chapter 7)
 Control flow changed with
JUMP/GOTO/CALL instructions
 Data stored in rewritable memory
– Flow of data does not affect execution order
Sequential Control Flow Model
L1: -
A
B
m1
L2: +
Control B
Flow 1
m2
L3: *
m1
m2
R R = (A - B) * (B + 1)
9
Dataflow
 Control tied to data
 Instruction “fires” when data is available
– Otherwise it is suspended
 Order of instructions in program has no effect on execution order
– Cf Von Neumann
 No shared rewritable memory
– Write once semantics
 Code is stored as a dataflow graph
 Data transported as tokens
 Parallelism occurs if multiple instructions can fire at same time
– Needs a parallel processor
 Nodes are self scheduling
10
Dataflow – arbitrary execution order
A B
- +
R = (A - B) * (B + 1)
11
Dataflow – arbitrary execution order
A B
- +
R = (A - B) * (B + 1)
12
Dataflow – Parallel Execution
A B
- +
R = (A - B) * (B + 1)
13
Implementation
 Dataflow model required very different execution
engine
 Data must be stored in special matching store
 Instructions must be triggered when both operands
are available
 Parallel operations must be scheduled to
processors dynamically
– Don’t know apriori when they are available.
 Instruction operands are pointers
– To instruction
– Operand number 14
Dataflow model of execution
L1: Compte B
L2/2
L2: L3/1 L3:
A - B +
B
1
L4/1 L4/2
L4: *
L6/1
15
Parallel Control flow
 Sometimes called macro dataflow
– Data flows between blocks of sequential code
– Has advantaged of dataflow & Von Neumann
• Context switch overhead reduced
• Compiler can schedule instructions statically
• Don’t need fast matching store
 Requires additional control instructions
– Fork/Join
16
Macro Dataflow (Hybrid Control/Dataflow)
L1: FORK L4: +

L4 B
1
L2: - m2
A
Control
B Flow
m1 L5: JOIN
Control
Flow 2
L6: *
L3: GOTO m1
L5 m2
R
17
R = (A - B) * (B + 1)
Issues for Hybrid dataflow
 Blocks of sequential instructions need to be large
enough to absorb overheads of context switching
 Data memory same as MIMD
– Can be partitioned or shared
– Synchronization instructions required
• Semaphores, test-and-set
 Control tokens required to synchronize threads.
18
Some examples
19
Denelcor HEP
 Designed to tolerate latency in memory
 Fine grain interleaving of threads
 Processor pipeline contains 8 stages
 Each time step a new thread enters the pipeline
 Threads are taken from the Process Status Word (PSW)
 After thread taken from the PSW queue, instruction and
operands are fetched
 When an instruction is executed, another one is placed on
the PSW queue
 Threads are interleaved at the instruction level.
20
Denelcor HEP
 Memory latency toleration solved with
Scheduler Function Unit (SFU)
 Memory words are tagged as full or empty
 Attempting to read an empty suspends the
current thread
– Then current PSW entry is moved to the SFU
 When data is written, taken from the SFU
and placed back on the PSW queue.
21
Synchronization on the HEP
 All registers have Full/Empty/Reserved bit
 Reading an empty register causes thread toe
be placed back on the PSW queue without
updating its program counter
 Thread synchronization is busy-wait
– But other threads can run
22
HEP Architecture
PSW
Matching Program
queue
Unit memory
Operand hand 1
Increment Operand
control Operand hand 2 Registers
fetch
SFU
Funct Funct Funct To/from
unit 1 unit 2 unit N Data
memory
23
HEP configuration
 Up to 16 processors
 Up to 128 data memories
 Connected by high speed switch
 Limitations
– Threads can have only 1 outstanding memory request
– Thread synchronization puts bubbles in the pipeline
– Maximum of 64 threads causing problems for software
• Need to throttle loops
– If parallelism is lower than 8 full utilisation not
possible. 24
MIT Alewife Processor
 512 Processors in 2-dim mesh
 Sparcle Processor
 Physcially distributed memory
 Logical shared memory
 Hardware supported cache coherence
 Hardware supported user level message passing
 Multi-threading
25
Threading in Alewife
 Coarse-grained multithreading
 Pipeline works on single thread as long as
remote memory access or synchronization
not required
 Can exploit register optimization in the
pipeline
 Integration of multi-threading with
hardware supported cache coherence
26
The Sparcle Processor
 Extension of SUN Sparc architecture
 Tolerant of memory latency
 Fine grained synchronisation
 Efficient user level message passing
27
Fast context switching
 In Sparc 8 overlapping register windows
 Used in Sparcle in paris to represent 4 independent, non-overlapping contexts
– Three for user threads
– 1 for traps and message handlers
 Each context contains 32 general purpose registers and
– PSR (Processor State Register)
– PC (Program Counter)
– nPC (next Program Counter)
 Thread states
– Active
– Loaded
• State stored in registers – can become active
– Ready
• Not suspended and not loaded
– Suspended
 Thread switching
– In fast if one is active and the other is loaded
– Need to flush the pipeline (cf HEP) 28
Sparcle Architecture
0:R0
PSR
PC
nPC
0:R31
PSR
1:R0
PC
nPC CP Active
PSR thread
PC 1:R31
nPC 2:R0
PSR
PC
nPC
2:R31
3:R0
3:R31
29
MIT Alewife and Sparcle
NR
Sparcle
Cache Main
64 CMMU Memory
kbytes 4 Bytes
FPU
NR = Network router
CMMU = Communication & memory management unit
FPU = Floating point unit
30
From here figures are drawn by Tim
31
Figures 16.10 Thread states in
Sparcle
Process state Global register Memory
frames
G0
Ready queue Suspended queue

G7 ... ...
PC and PSR 0:R0
frames
PSR
PC
0:R31
nPC
1:R0
PSR active
PC CP
thread
nPC 1:R31
PSR 2:R0 Loaded thread
PC Unloaded
nPC thread
PSR 2:R31
PC 3:R0
nPC
3:R31 32
Figures 16.11 structure of a typical
static dataflow PE
Fetch unit
Instruction Func. Func. Func. Activity

queue Unit 1 Unit 2 Unit N store
Update unit
To/From other (PEs)
33
tagged-token dataflow PE
Matching unit Matching store
Fetch unit Instruction/

data memory
Func. Func. Func.

Token queue
Unit 1 Unit 2 Unit N
Update unit
To other (PEs)
34
Figures 16.13 organization of the I-
structure storage
Data storage Data storage
k: W tag X
k+1: A tag Z nil
k+2: A tag Y nil
k+3: P datum
k+4: W
Presence bits (A=Absent, P=Present, W=Waiting
35
Figures 16.14 coding in explicit
token-store architectures (a) and (b)
<35, <FP, IP>> <12, <FP, IP>>
- fire -
<23, <FP, IP+2>> <23, <FP, IP+1>>
* + * +
36
Figures 16.14 coding in explicit
token-store architectures (c)
Instruction memory Frame memory Frame memory
IP SUB 2 +1, +2 FP FP
ADD 3 +2
fire
MUL 4 +7 FP+2 1 35 0
0 FP+3 1 23
0 FP+4 1 23
Presence bit
37
explicit token-store dataflow PE
From other PEs
Fetch unit Fetch unit
Effective
address
Presence bits
Frame
memory
Frame store
operation
Form token Func. Func. Func.

unit Unit 1 Unit 2 Unit N
Form token
unit
38
To/from other PEs
Figures 16.16 scale of von
Neumann/dataflow architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
39
macro dataflow PE
Matching unit
Instruction
Frame
memory
Fetch unit
Token queue Func. Unit

Internal control pipeline
(program counter-based
sequential execution)
Form token
unit
To/from other (PEs)

40
Figures 16.18 organization of a PE
in the MIT hybrid Machine
PC FBR
+1
Instruction
Instruction fetch
memory
Enabled
Decode unit
continuation
queue Frame
memory
(Token queue)
Operand fetch
Execution unit Registers
41
To/from global memory
Figures 16.19 comparison of (a) SQ
and (b) SCB macro nodes
a b c a b c
SQ1 SQ2 SCB1 SCB2
l1 l4 l1 l4
l2 l5 l2 l5
1 2 1 2
l3 l6 l3 l6
42
Figures 16.20 structure of the USC
Decoupled Architecture
To/from network (Graph virtual space)
Cluster graph memory
GC GC
DFGE DFGE
RQ AQ RQ AQ
Cluster 0
CE CE
CC CC
Cluster graph memory
To/from network (Computation virtual space)

43
Figures 16.21 structure of a node in
the SAM
fire
APU
Main done
SEU ASU
memory
LEU
To/from network
44
Figures 16.22 structure of the P-
RISC processing element
Local memory Internal control pipeline

(conventional RISC-
Instruction Instruction fetch processor)
Operand fetch Load/Store
Token queue
Frame
Messages to/from other
memory Func. unit
PE’s memory
Operand store Start
45
Figures 16.23 transformation of dataflow graphs into control
flow graphs (a) dataflow graph (b) control flow graph
join
+ +
fork L1
* - join L1: join
* -
46
Figures 16.24 structure of *T node
Network
From interface Message
network formatter To
network
Message
queues
Synchronization
Data processor
coprocessor
sIP dIP
Remote memory
request sFP dFP
sV1 Continuation dV1
coprocessor sV2 dV2
queue
<IP,FP>
Local memory
47

Multi Threaded Architectures

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Multi Threaded Architectures

Caricato da

Copyright:

Formati disponibili

Multi Threaded Architectures

 Coarse Grain – Task swith every n cycles

Task Switch Overhead Task Switch Overhead

Computational Memory Number of threads

Parallel Control flow Distributed

Parallel control flow Cache-coherent

Von Neumann based Hybrid von Neumann/

MIT Alewife & Sparcle *T McGill MGDA & SAM EM-4

 Flow of control and data separated

L1: FORK L4: +

Ready queue Suspended queue

Instruction Func. Func. Func. Activity

To/From other (PEs)

Fetch unit Instruction/

Func. Func. Func.

Presence bits (A=Absent, P=Present, W=Waiting

<35, <FP, IP>> <12, <FP, IP>>

Instruction memory Frame memory Frame memory

Form token Func. Func. Func.

Decoupled hybrid dataflow

Token queue Func. Unit

To/from other (PEs)

Execution unit Registers

Cluster graph memory

Cluster graph memory

To/from network (Computation virtual space)

Local memory Internal control pipeline

Operand fetch Load/Store

Operand store Start

* - join L1: join

Potrebbero piacerti anche