Sei sulla pagina 1di 22

EE Department

Technion, Haifa, Israel

The Power of Priority:


NoC based Distributed Cache
Coherency
Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny

QNoC Research Group


Technion

1 E. Bolotin – The Power of Priority, NoCs 2007


Chip Multi-Processor (CMP)
P0 P1

P2
0 7

P7
Distributed L2

P3
56 63

P6
P5 P4

Multi-Core
Dual-Core Large cache
Monolithic shared cache Shared cache
Distributed cache
NoC-based: How?
2 E. Bolotin – The Power of Priority, NoCs 2007
Future Cache - Physics Perspective
• Large cache  Large access time Global Wires Delay
100

• Global wires delay Global wire delay

• Distance reached in single cycle


10
 Today: ~25% of chip
 In 10 years: ~1% of chip

Gate delay

0.1 250
250
250 180 130 90 65 45 32

Source: ITRS 2003

Fraction of chip
reachable in 1 clock cycle
Source: Keckler et al. ISSCC 2003

Large monolithic cache is not scalable


3 E. Bolotin – The Power of Priority, NoCs 2007
NUCA - Non Uniform Cache Architecture

Banked cache over NoC


 Smaller bank  Smaller Access Time
 Multiple banks  Multiple Ports
 Closer bank  Smaller Access Time

NUCA= Non uniform access times

Cache-line placement policy


• Static NUCA (SNUCA)
• Dynamic NUCA (DNUCA)

Sources:
Kim et al. ASPLOS 2002
Beckmann et al. MICRO 2004
4 E. Bolotin – The Power of Priority, NoCs 2007
Issues in NUCA-based CMP
• NoC performance  CMP performance
• Cache coherency and transaction order (correctness)
• Search (in DNUCA)
• Different traffic types (e.g. fetch vs. prefetch)
P0 P1
• Synchronization (locks)

P2
0 7

P7
NoC Services for CMP? Distributed L2

P3
56 63

P6
P5 P4

5 E. Bolotin – The Power of Priority, NoCs 2007


Cache Coherency over NoC
How do we maintain coherency over NoC?
•• Snooping
Distributed directory
• Central directory

P0 P1 Cache bank with distributed directory


P2

0 7 Cache lines Dist. Directory

cache line status vec. D


P7

cache line status vec. D


cache line status vec. D
cache line status vec. D
Distributed L2 cache line status vec. D
cache line status vec. D
P3

56 63
P6

P5 P4

6 E. Bolotin – The Power of Priority, NoCs 2007


Distributed Cache Coherency

Cache access  Multiple NoC transactions

Example: Simple read transaction


P0-Shared
1. READ REQ

Directory
P0 NoC L2 Ctrl. packet
L1 Data packet
2. READ RESP
(data transfer)

7 E. Bolotin – The Power of Priority, NoCs 2007


Read Transaction of Modified Block

1. READ EXCL. REQ

Directory
P0 P2

NoC
NoC
L2
L1 P2-MOD. L1
2. READ RESP
(data transfer)

3. READ REQ 4. WR BACK REQ


Ctrl. packet

P0 Directory P2
NoC

NoC
L2 Data packet
L1 P0-SHARED
L1
6. READ RESP 5. WR BACK RESP
(data transfer) (data transfer)

8 E. Bolotin – The Power of Priority, NoCs 2007


Read Exclusive of Shared Block

P1
L1

1. READ REQ
NoC

Directory
P0 P2
NoC

NoC
L2
L1 P1-Shared
P2-Shared 1. READ. REQ
L1

P1 Ctrl. packet
5. INVALID. ACK

L1
Data packet
NoC
3. READ EXCL. REQ
Directory

P0 P2
NoC

NoC

L2
L1 P0-MOD.
5. INVALID. ACK
L1
6. Read EXCL. RESP
(data transfer)

9 E. Bolotin – The Power of Priority, NoCs 2007


Basic NoC to Support CMP
Off-the-shelf (Vanilla) NoC: P1

5. INVALID. ACK
• Grid of wormhole routers L1
.R
EQ
D
LI
NoC
• Unicast only 3. READ EXCL. REQ 4.
IN
VA

Directory
• Ordering in network P0 P2

NoC

NoC
L2
L1 L1
 Static routing
P0-MOD.
5. INVALID. ACK
6. Read EXCL. RESP
(data transfer)

 No virtual channels
• Smart interfaces

Can We Do Better?
Vanilla NoC

10 E. Bolotin – The Power of Priority, NoCs 2007


Observations: L2 Access
A) Delay = Queueing + NoC transactions
P1

5. INVALID. ACK
B) All NoC transactions are equally important L1
EQ
.R
D
LI
NoC VA

C) NoC transactions consist of:


IN
3. READ EXCL. REQ 4.

Directory
• Short ctrl. packets P0 P2

NoC

NoC
L2
L1 L1
• Long data packets 6. Read EXCL. RESP
(data transfer)
P0-MOD.
5. INVALID. ACK

Idea: Differentiate between Ctrl. and Data

Solution: Preemptive Priority NoC


 Give priority to short ctrl. packets

11 E. Bolotin – The Power of Priority, NoCs 2007


Preemptive Priority NoC: QNoC
QNoC Multiple SL Router
Service Levels: Input ports Output ports
• Dedicated wormhole buffer BufSize

• Preemptive priority scheduling SL 0 SL 0

SL 1 SL 1

SL 2 SL 2

CROSS-BAR
SL 3 SL 3

Multiple SL link
Output Input
CREDIT Control
Scheduler CREDIT
SL 0 SL 0

SL 1 SL 1
Physical Link
SL 2 SL 2

SL 3 SL 3

12 E. Bolotin – The Power of Priority, NoCs 2007


Example: Vanilla NoC
Transaction 1 Without contention:
X:Delay of long packet
Long Data δ:Delay of short packet

Vanilla NoC example


Transaction 2 Blue delay ~X
Short Req. Red delay ~ 2X+δ
Average delay ~ 1.5X
Long Resp. A B

13 E. Bolotin – The Power of Priority, NoCs 2007


Example: Priority NoC
Transaction 1 Without contention:
X:Delay of long packet
Long Data δ:Delay of short packet

Vanilla NoC example


Transaction 2
Blue delay=X
Short Req. Red delay = 2X+δ
Average delay ~ 1.5X
Long Resp. A B Priority NoC example
Blue delay= X+δ
Red delay = X+δ
Average delay ~ X

Potential delay reduction ~ 0.5X

14 E. Bolotin – The Power of Priority, NoCs 2007


Priority NoC: Different Destinations
Very important in wormhole
• When ctrl. packet is blocked by other worms

Long Data

Short Req.

15 E. Bolotin – The Power of Priority, NoCs 2007


Protocol Correctness
Need state-preserving serialization of transactions in

the processor interface


1. Read Req.

Directory
P0
2. Read Resp.
L2
L1 4. Invalidation Req.

3. Read Excl. Req.


Legend:
High Priority (ctrl.)

Low Priority (data)


P1
L1

16 E. Bolotin – The Power of Priority, NoCs 2007


Numerical Evaluation
P0 P1

• CMP simulator (SIMICS)

P2
0 7

 Simulate parallel benchmarks

P7
Distributed L2
 Obtain L2-cache access traces

P3
56 63

P6
• QNoC simulator (OPNET) P5 P4

 Simulate distributed coherence protocol over NoC


 Measure total RD/RX L2-access delay
 Measure total program throughput

17 E. Bolotin – The Power of Priority, NoCs 2007


Priority NoC: Results
Delay Reduction vs. Network Load
RD Delay - Apache
Av. Delay of L2-Read in Apache RD/RX Delay ofReduction
Av. Delay Reduction L2-Transaction-inApache
Apache
30.00
1400 1301
Vanilla NoC Read
1200 25.00

Delay Reduction [%]


Priority-based NoC Read Exclusive
994
Delay [cycles]

1000 20.00
800
15.00
600
10.00
400
286 234
200 5.00
62 57
0 0.00
1 4 16 1 4 16
Link Capacity[gbps] Link Capacity [gbps]

• Short ctrl. packet gets high priority


• Long data packet gets low priority

18 E. Bolotin – The Power of Priority, NoCs 2007


Priority NoC: Several Benchmarks

Delay Reduction Program Speedup

L2 Access Delay Reduction by Priority-based NoC Total Program Speedup by Priority-based NoC
Read Read Exclusive
10.0 9.4
35.0 32.9 31.8 8.7 9.0
9.0 8.6
30.0 28.4 28.0 8.0
Delay Reduction [%]

25.3 7.0
25.0 22.6

Speedup [%]
22.3
6.0
20.0 18.3 19.6 5.0
5.0
15.0 13.5 4.0
3.0
10.0
2.0
5.0 1.0
0.0
0.0
apache zeus fft ocean radix
apache zeus fft ocean radix

19 E. Bolotin – The Power of Priority, NoCs 2007


So Far: The Power of Priority
• Simplicity - Almost for Free P0 P1

P2
0 7

• Significant CMP Speed-up

P7
Distributed L2

P3
Good For: 56 63

P6
P5 P4

• Coherency
• Traffic differentiation (e.g. Fetch vs. Pre-Fetch)
• Search in DNUCA
• Synchronization (Locks)

20 E. Bolotin – The Power of Priority, NoCs 2007


Advanced Support Functions
• Special Broadcast for Short Messages
Source
 Broadcast service (e.g. search in DNUCA) S
Replicating

Forwarding

 Wormhole broadcast slow and expensive


S&F broadcast embedded in wormhole

• Virtual Ring P0 P1

 No Additional Cost

P2
0 7

P7
 For Invalidation Multicast

P3
 Snooping or synchronization
56 63

P6
P5 P4

21 E. Bolotin – The Power of Priority, NoCs 2007


Summary
NoC at CMP Service!
• Shared cache over NoC
• Priority is powerful
• Built-in support functions

22 E. Bolotin – The Power of Priority, NoCs 2007

Potrebbero piacerti anche