Sei sulla pagina 1di 47

Multi Threaded Architectures

Chapter 16

1
Memory and Synchronization
Latency
 Scalability of system is limited by ability to handle
memory latency & algorithmic sychronization
delays
 Overall solution is well known
– Do something else whilst waiting
 Remote memory accesses
– Much slower than local
– Varying delay depending on
• Network traffic
• Memory traffic
2

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor Utilization
 Utilization
– P/T
• P time spent processing
• T total time

– P/(P + I + S)
• I time spent waiting on other tasks
• S time spent switching tasks

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Basic ideas - Multithreading
 Fine Grain – task switch every cycle

Blocked
Blocked
Blocked

 Coarse Grain – Task swith every n cycles

Task Switch Overhead Task Switch Overhead

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design Space

Multi threaded
architectures

Computational Memory Number of threads


Granularity
Model Organization per processor

Von Neumann
Hybrid Von Neumann/ Fine Physical Shared Small
(Sequential Control
Dataflow Grain Memory (4 – 10)
Flow)

Parallel Control flow Distributed


Based on parallel Coarse Middle
Shared
Control operators Grain (10 – 100)
Memory

Parallel control flow Cache-coherent


Large
Based on control tokens Distributed shared
(over 100)
Memory

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of multi-threaded architectures

Multi-threaded
architectures

Von Neumann based Hybrid von Neumann/


architectures Dataflow architectures

Macro dataflow
HEP RISC Like Decoupled
architectures

MIT Hybrid
Tera P-RISC USC
Machine

MIT Alewife & Sparcle *T McGill MGDA & SAM EM-4

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Computational Models

7
Sequential control flow (von Neumann)

 Flow of control and data separated


 Executed sequentially (or at least sequential
semantics – see chapter 7)
 Control flow changed with
JUMP/GOTO/CALL instructions
 Data stored in rewritable memory
– Flow of data does not affect execution order

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequential Control Flow Model
L1: -
A
B
m1
L2: +
Control B
Flow 1
m2

L3: *
m1
m2
R R = (A - B) * (B + 1)
9

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow
 Control tied to data
 Instruction “fires” when data is available
– Otherwise it is suspended
 Order of instructions in program has no effect on execution order
– Cf Von Neumann
 No shared rewritable memory
– Write once semantics
 Code is stored as a dataflow graph
 Data transported as tokens
 Parallelism occurs if multiple instructions can fire at same time
– Needs a parallel processor
 Nodes are self scheduling

10

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A B

- +

R = (A - B) * (B + 1)
11

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A B

- +

R = (A - B) * (B + 1)
12

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – Parallel Execution
A B

- +

R = (A - B) * (B + 1)
13

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Implementation
 Dataflow model required very different execution
engine
 Data must be stored in special matching store
 Instructions must be triggered when both operands
are available
 Parallel operations must be scheduled to
processors dynamically
– Don’t know apriori when they are available.
 Instruction operands are pointers
– To instruction
– Operand number 14

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow model of execution
L1: Compte B

L2/2
L2: L3/1 L3:
A - B +
B
1
L4/1 L4/2

L4: *

L6/1
15

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel Control flow
 Sometimes called macro dataflow
– Data flows between blocks of sequential code
– Has advantaged of dataflow & Von Neumann
• Context switch overhead reduced
• Compiler can schedule instructions statically
• Don’t need fast matching store
 Requires additional control instructions
– Fork/Join
16

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Macro Dataflow (Hybrid Control/Dataflow)

L1: FORK L4: +


L4 B
1
L2: - m2
A
Control
B Flow
m1 L5: JOIN
Control
Flow 2

L6: *
L3: GOTO m1
L5 m2
R
17
R = (A - B) * (B + 1)
 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issues for Hybrid dataflow
 Blocks of sequential instructions need to be large
enough to absorb overheads of context switching
 Data memory same as MIMD
– Can be partitioned or shared
– Synchronization instructions required
• Semaphores, test-and-set
 Control tokens required to synchronize threads.

18

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Some examples

19
Denelcor HEP
 Designed to tolerate latency in memory
 Fine grain interleaving of threads
 Processor pipeline contains 8 stages
 Each time step a new thread enters the pipeline
 Threads are taken from the Process Status Word (PSW)
 After thread taken from the PSW queue, instruction and
operands are fetched
 When an instruction is executed, another one is placed on
the PSW queue
 Threads are interleaved at the instruction level.

20

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Denelcor HEP
 Memory latency toleration solved with
Scheduler Function Unit (SFU)
 Memory words are tagged as full or empty
 Attempting to read an empty suspends the
current thread
– Then current PSW entry is moved to the SFU
 When data is written, taken from the SFU
and placed back on the PSW queue.
21

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization on the HEP
 All registers have Full/Empty/Reserved bit
 Reading an empty register causes thread toe
be placed back on the PSW queue without
updating its program counter
 Thread synchronization is busy-wait
– But other threads can run

22

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP Architecture
PSW
Matching Program
queue
Unit memory

Operand hand 1
Increment Operand
control Operand hand 2 Registers
fetch

SFU
Funct Funct Funct To/from
unit 1 unit 2 unit N Data
memory
23

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP configuration
 Up to 16 processors
 Up to 128 data memories
 Connected by high speed switch
 Limitations
– Threads can have only 1 outstanding memory request
– Thread synchronization puts bubbles in the pipeline
– Maximum of 64 threads causing problems for software
• Need to throttle loops
– If parallelism is lower than 8 full utilisation not
possible. 24

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife Processor
 512 Processors in 2-dim mesh
 Sparcle Processor
 Physcially distributed memory
 Logical shared memory
 Hardware supported cache coherence
 Hardware supported user level message passing
 Multi-threading

25

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Threading in Alewife
 Coarse-grained multithreading
 Pipeline works on single thread as long as
remote memory access or synchronization
not required
 Can exploit register optimization in the
pipeline
 Integration of multi-threading with
hardware supported cache coherence
26

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Sparcle Processor
 Extension of SUN Sparc architecture
 Tolerant of memory latency
 Fine grained synchronisation
 Efficient user level message passing

27

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Fast context switching
 In Sparc 8 overlapping register windows
 Used in Sparcle in paris to represent 4 independent, non-overlapping contexts
– Three for user threads
– 1 for traps and message handlers
 Each context contains 32 general purpose registers and
– PSR (Processor State Register)
– PC (Program Counter)
– nPC (next Program Counter)
 Thread states
– Active
– Loaded
• State stored in registers – can become active
– Ready
• Not suspended and not loaded
– Suspended
 Thread switching
– In fast if one is active and the other is loaded
– Need to flush the pipeline (cf HEP) 28

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sparcle Architecture
0:R0
PSR
PC
nPC
0:R31
PSR
1:R0
PC
nPC CP Active
PSR thread
PC 1:R31
nPC 2:R0
PSR
PC
nPC
2:R31
3:R0

3:R31
29

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife and Sparcle

NR

Sparcle
Cache Main
64 CMMU Memory
kbytes 4 Bytes
FPU
NR = Network router
CMMU = Communication & memory management unit
FPU = Floating point unit
30

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
From here figures are drawn by Tim

31
Figures 16.10 Thread states in
Sparcle
Process state Global register Memory
frames
G0

Ready queue Suspended queue


G7 ... ...
PC and PSR 0:R0
frames
PSR
PC
0:R31
nPC
1:R0
PSR active
PC CP
thread
nPC 1:R31
PSR 2:R0 Loaded thread
PC Unloaded
nPC thread
PSR 2:R31
PC 3:R0
nPC

3:R31 32

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.11 structure of a typical
static dataflow PE

Fetch unit

Instruction Func. Func. Func. Activity


queue Unit 1 Unit 2 Unit N store

Update unit

To/From other (PEs)

33

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.12 structure of a typical
tagged-token dataflow PE
Matching unit Matching store

Fetch unit Instruction/


data memory

Func. Func. Func.


Token queue
Unit 1 Unit 2 Unit N

Update unit

To other (PEs)
34

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.13 organization of the I-
structure storage
Data storage Data storage

k: W tag X
k+1: A tag Z nil
k+2: A tag Y nil
k+3: P datum
k+4: W

Presence bits (A=Absent, P=Present, W=Waiting

35

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (a) and (b)

<35, <FP, IP>> <12, <FP, IP>>

- fire -
<23, <FP, IP+2>> <23, <FP, IP+1>>

* + * +

36

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (c)

Instruction memory Frame memory Frame memory

IP SUB 2 +1, +2 FP FP
ADD 3 +2
fire
MUL 4 +7 FP+2 1 35 0
0 FP+3 1 23
0 FP+4 1 23

Presence bit

37

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.15 structure of a typical
explicit token-store dataflow PE
From other PEs
Fetch unit Fetch unit

Effective
address

Presence bits
Frame
memory
Frame store
operation

Form token Func. Func. Func.


unit Unit 1 Unit 2 Unit N

Form token
unit

38
To/from other PEs

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.16 scale of von
Neumann/dataflow architectures
Dataflow

Macro dataflow

Decoupled hybrid dataflow

RISC-like hybrid

von Neumann

39

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.17 structure of a typical
macro dataflow PE
Matching unit

Instruction
Frame
memory

Fetch unit

Token queue Func. Unit


Internal control pipeline
(program counter-based
sequential execution)

Form token
unit

To/from other (PEs)


40

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.18 organization of a PE
in the MIT hybrid Machine
PC FBR

+1

Instruction
Instruction fetch
memory

Enabled
Decode unit
continuation
queue Frame
memory
(Token queue)
Operand fetch

Execution unit Registers

41
To/from global memory

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.19 comparison of (a) SQ
and (b) SCB macro nodes
a b c a b c
SQ1 SQ2 SCB1 SCB2
l1 l4 l1 l4

l2 l5 l2 l5
1 2 1 2

l3 l6 l3 l6

42

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.20 structure of the USC
Decoupled Architecture
To/from network (Graph virtual space)

Cluster graph memory

GC GC

DFGE DFGE

RQ AQ RQ AQ
Cluster 0

CE CE

CC CC

Cluster graph memory

To/from network (Computation virtual space)


43

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.21 structure of a node in
the SAM

fire
APU

Main done
SEU ASU
memory

LEU

To/from network

44

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.22 structure of the P-
RISC processing element

Local memory Internal control pipeline


(conventional RISC-
Instruction Instruction fetch processor)

Operand fetch Load/Store

Token queue
Frame
Messages to/from other
memory Func. unit
PE’s memory

Operand store Start

45

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.23 transformation of dataflow graphs into control
flow graphs (a) dataflow graph (b) control flow graph

join

+ +
fork L1

* - join L1: join

* -

46

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.24 structure of *T node

Network
From interface Message
network formatter To
network
Message
queues

Synchronization
Data processor
coprocessor
sIP dIP
Remote memory
request sFP dFP
sV1 Continuation dV1
coprocessor sV2 dV2
queue
<IP,FP>

Local memory

47

 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Potrebbero piacerti anche