Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines

ISCA 2002
Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines

Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanovi Computer Architecture Group, MIT LCS
Leakage Power
Growing impact of leakage power
Increase of leakage power due to scaling of transistor lengths and threshold voltages Power budget limits use of fast leaky transistors
Challenge:
How to maintain performance scaling in face of increasing leakage power?
Leakage Reduction Techniques

Static: Design-time Selection of Slow Transistors (SSST) for non-critical paths
Replace fast transistors with slow ones on non-critical paths Tradeoff between delay and leakage power
Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths

DDFT switches critical path transistors between inactive and active modes
Observation:
Critical paths dominate leakage after applying SSST techniques Example: PowerPC 750
5% of transistor width is low Vt, but these account for >50% of total leakage.
DDFT could give large leakage savings
Existing DDFT Circuit Techniques

Gate Vbody > Vdd
Body Biasing
Vt increase by Body reverse-biased body effect Large transition time and wakeup latency due to well cap and resistance
Drain Source
Power Gating
Vdd
Sleep signal Sleep transistor between Virtual Vdd supply and virtual supply lines Logic cells Increased delay due to sleep transistor
Sleep Vector
0 Input vector which minimizes leakage Increased delay due to mux and active energy due to spurious toggles after applying sleep vector
Fine-Grain DDFT Techniques

Have to turn off small pieces of an active processor for short periods of time
Difficult to turn off large pieces for long periods Fine-grain DDFT techniques
Requirements of Fine-grain DDFT techniques

Circuits with low active delay penalty, low energy moving in and out of sleep, and fast wakeup time Micro-architectural scheduling to keep the sleep time as long and often as possible
Compare to coarse-grain DDFT techniques

O.S. puts whole processor to sleep for a long time doesnt save power when running code Low steady-state leakage only concern.
Highlights of This Work

We introduce metrics for comparing finegrain dynamic deactivation techniques
Steady-stage leakage, Transition time, Fixed transition energy, Breakeven time
We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB)

Low deactivation energy and fast wakeup
We save leakage power of I-Cache and Multiported regfile by LBB

I-cache: idle subbank deactivation Multiported regfile: idle read ports and dead register deactivation
Outline
1. Methodology and DDFT Metrics 2. Cache Leakage Saving
Idle subbank deactivation
3. Multiported Regfile Leakage Saving

Dead reg deactivation (Horizontal) Idle read port deactivation (Vertical)
4. Conclusion
Methodology
Process Technology
180nm DVT process modeled after 0.18um TSMC LVT and MVT processes Scaled to 130, 100, and 70nm processes based on SIA roadmap Optimistic/pessimistic leakage prediction: 2x/4x increase of leakage current density (nA/um)
Evaluation with SimpleScalar

Modified to model unified physical register file 4 issue, 100 integer physical regs, 16KB/4-Way/32B block I-Cache and D-Cache, Unified L-2 Cache SPECint95 refs
Energy measurements
Hspice simulation for 180nm process and scaled to other processes accordingly
Metrics for Fine-Grain DDFT Techniques

Leakage Current Original Leakage DDFT applied Transition Time Leakage Energy Original Leakage Break-Even Time DDFT Leakage Steady-state Sleep Leakage Time Wakeup Latency Active delay and power Fixed Active Transition Energy Length of Sleep
L1 Cache and Multiported Regfile

Good targets for Fine-grain DDFT techniques
Timing-critical Contrast: L2 cache is a better target for SSST (long channel or HVT transistors) Large leakage current Cache: Large number of fast transistors Multiported Regfile: Ever increasing number of registers and ports Alpha 21464 register file is 5x larger than 64KB data cache
LBB for Caches

Modern cache structure
: Hierarchical Bitlines To save active power To reduce delay To reduce bitline noise
Subbank
Global Bitline Local Bitline
Local-Global Switch SenseAmp
Local bitlines (32-bit cells) disconnected from senseamp by local-global switch. LBB for Caches: If a subbank is not in use, turn off precharge transistors and delay precharging.
Cache: Dual Vt SRAM cell

GLOBAL BIT GLOBAL BIT_BAR
BIT
BIT_BAR
WL
HVT transistors: green-colored

BIT
BIT_BAR
WL

BIT
BIT_BAR
WL
Bitline leakage depends on the stored value

BIT
BIT_BAR
WL
0
Our Target
Bitline leakage depends on the stored value
Forcing 0
0
Forcing 1
1
Forcing ?
0
Leakage-Biased Bitlines (LBB)

Discharge to 0
0
Stay at 1
1
Discharge to an intermediate value between 0 and 1

0
LBB lets bitlines float by turning off the local HVT NMOS precharge transistors
No static current draw because local bitline isolated LBB uses leakage itself to bias bitlines to the voltage which minimizes leakage!
A good fine-grain dynamic technique

Minimal transition energy: Same number of precharges (delayed precharge) Minimal transition time: Wakeup latency is only that of precharge phase
LBB versus Sleep Vector

LBB finds the minimal leakage state.
Always better than sleep vectors
Leakage Power of 32x16B SRAM subbank
350
Leakage Power (uW)
Original
300 250 200 150 100 0 20 40 60 80
Sleep Vector 1 Sleep Vector 0 LBB
100
Zero Percentage (%)
Cumulative Leakage Energy

32-row x 32B SRAM subbank
(optimistic leakage current used. 75% zero assumed)
180nm
50
70nm
Energy (pJ)
Energy (pJ)
40 30 20 10 0 0
Original LBB
50 40 30 20 10 0 0
Original
LBB
100 200 300 400 Length of Sleep (cy cles) 500
100 200 300 400 Length of Sleep (cy cles)
500
Dynamic energy cost: Need to replace the lost charge

-LBB curve increases fast in the beginning
Decrease of Breakeven time

-180nm: 200 cycles, 70nm: less than a cycle -Active energy scales down faster than leakage energy
Performance Issues for LBB Caches

Subbank must be precharged before use
Case 1 (best): subbank decode and precharge happen before more complex word-line decode, therefore no penalty. Case 2 (worst): add additional pipeline stage for precharge
One cycle increase in branch misprediction penalty
Focus on I-Cache because any latency increase can be partly hidden by branch prediction
ross processes
I-Cache Subbank Deactivation

Total energy sav ing at 70nm process
30
Leakage energy sav ing at 70nm process

30
Percentage (%)
25 20 15 10 5 0
Percentage (%)
t av g
25 20 15 10 5 0
Pessimistic Prediction O ptimistic Prediction
pe rl
gc c
88 k
pe rl
gc c
co m
co m
jp e
Leakage energy sav ing across processes

25 20 15 10 5 0 -5 -10 180nm 130nm 100nm 70nm
Total energy sav ing across processes

25 20
Percentage (%)
Percentage (%)
15 10 5 0 -5 -10 180nm 130nm 100nm 70nm
Case 2 (worst) assumption (adding additional pipeline stage) 2.5% IPC decrease on average
av g
vo r
vo r
jp e
70nm
88 k
go
li
go
li
Multiported Regfile Cell

8R, 4W unbalanced DVT reg cell
READ[0:7] WRITE[0:3] WRITEB[0:3]
WWL[0:3]
RWL[0:7]
x4
x4
x8
HVT transistors: green-colored Simplified but active/leakage power-aware baseline
LBB for Multiported Regfiles

LBB for Multiported Regfiles: Turn off the precharge transistor on idle subbank read ports
Leakage current discharges bitlines to 0 if any bits are holding 1.
Dead Register Deactivation

Horizontal technique Dead registers = Registers Subbank 1 in free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write.
Readport 0 Readport 1 Readport 2
Dead Register Deactivation

Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write.
NMOS Sleep Transistor (NST)

Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can 1 be turned off individually Disadvantage: increased read access time
Set delay penalty to 5% (tradeoff between delay and leakage)
NMOS Sleep Transistor (NST)

Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can 0 be turned off individually Disadvantage: increased read access time
Set delay penalty to 5% (tradeoff between delay and leakage)
Idle Readport Deactivation

Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline.
Idle Readport Deactivation

Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline.
Comparison of DDFTs
32 x 32-b Regfile subbank
(75% zero assumed. Optimistic leakage current used.)
Process Tech. (nm) Original (uW) SV steady-state (uW) LBB steady-state (uW) NST steady-state (uW) 180 177.9 2.0 2.0 1.8 130 214.1 2.4 2.4 2.2 100 263.6 3.0 3.0 2.7 70 276.7 3.1 3.1 2.9
180nm
50
70nm
50
Original
Energy (pJ)
Energy (pJ)
40 30 20 10 0 0
Sleep Vector Leakage-Biased Bitlines
40 30 20 10 0 0
Original
NMOS Sleep Transistor

Sleep Vector Leakage-Biased Bitlines
500 Length of Sleep (cy cles) 1000
NMOS Sleep Transistor

500 Length of Sleep (cy cles)
1000
Comparison of DDFTs Blowup: 70nm 70nm

8 7 6 5 4 3 2 1 0 0
Energy (pJ)
Original Sleep Vector NMOS Sleep Transistor Leakage-Biased Bitlines
10 20 30 40 Length of Sleep (cy cles)
50
Dead Register/Subbank Deactivation Policies

Free list policies for NST (NMOS Sleep Transistor): queue and stack
queue: conventional stack: keeps some regs dead for longer 2.4-10% greater savings than queue at 70nm Benefit increases as feature sizes shrink Allocate a new subbank only when the previous bank is empty of dead registers
Subbank allocation policy for LBB: stack
Dead Reg Deactivation (Horizontal)

Leakage energy savings (70nm process) Total energy savings (70nm process)
60
percent (%)
60
percent (%)
Colored: optimistic White: pessimistic
40 20 0
40 20 0
m li 88 k pe co rl m p vo rt
m li 88 k pe co rl m p vo rt
130 100 Process (nm) 70
c go jp eg
c g jp o eg
gc
gc
av
Leakage Energy Savings
Total Energy Savings NST Queue NST Stack LBB 16 regs/bank LBB 8 regs/bank
60
percent (%)
60
percent (%)
40 20 0
40 20 0
180
130 100 Process (nm)
70
180
NST stack better than NST queue, LBB stack better than either NST
av
Read Port Deactivation (Vertical)

Leakage energy sav ing at 70nm process
70 70
Total energy sav ing at 70nm process

Percentage (%)
60 50 40 30 10 0
Percentage (%)
60 50 40 30 20 10 0
20 processes Leakage energy sav ing across
88 k
pe rl
88 k
vo rt
pe rl
vo rt
gc c
gc c
go
av g
go
co m
co m
jp e
jp e
Percentage (%)
Leakage energy sav ing across processes
20 15
Total energy sav ing across processes

70 60
70 60 50 40 30 20 10 0
Percentage (%)
Percentage (%)
10 5 0 -5 180nm
180nm 130nm
50 40 30 20 70nm 10 0 -10 180nm
Pessimistic Prediction O ptimistic Prediction
130nm
100nm
100nm
70nm
-10
-10
130nm
100nm
70nm
More energy saving for wider issue processors Readport deactivation can be combined with dead subbank deactivation.
av g
li
li
25
Conclusion
Most leakage power is in critical paths
Dynamic leakage reduction (DDFT) desired
LBB allows Fine-grain dynamic leakage reduction with zero or minimal performance penalty.
0% performance penalty for multiported regfiles
Sleep time can be improved by changing micro-architectural scheduling policies.

Stack better than queue for free list policy
Follow on work:
Leakage-biased domino logic to save leakage power in critical ALUs [ VLSI Symposium 2002 ]
Acknowledgments
Thanks to Christopher Batten, Ronny Krashinsky, Rajesh Kumar, and anonymous reviewers Funded by DARPA PAC/C award F3060200-2-0562, NSF CAREER award CCR0093354, and a donation from Infineon Technologies.
DDFT Examples
Body Biasing
Steady-state leakage power
Power Gating
Sleep Vector
Less than 5% Less than 5% Less than 50% (depends on (depends on sleep (depends on the Vbody) transistor) circuit)
Transition time, Wakeup latency
0.1~100us
Less than a cycle

Sleep transistor gate cap switching energy Yes. Due to sleep transistor Area for sleep transistor and virtual supplies
Less than a cycle

Active energy consumed due to spurious toggling after sleep vector Yes. Due to mux Finding sleep vector is hard
Transition energy Well cap ,Breakeven time switching

energy
Delay Impact Etc
No

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines

Caricato da

Copyright:

Formati disponibili

ISCA 2002

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines

Leakage Reduction Techniques

Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths

DDFT could give large leakage savings

Existing DDFT Circuit Techniques

Fine-Grain DDFT Techniques

Requirements of Fine-grain DDFT techniques

Compare to coarse-grain DDFT techniques

Highlights of This Work

We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB)

We save leakage power of I-Cache and Multiported regfile by LBB

3. Multiported Regfile Leakage Saving

Evaluation with SimpleScalar

Metrics for Fine-Grain DDFT Techniques

L1 Cache and Multiported Regfile

LBB for Caches

Global Bitline Local Bitline

Local-Global Switch SenseAmp

Cache: Dual Vt SRAM cell

HVT transistors: green-colored

Cache: Dual Vt SRAM cell

Cache: Dual Vt SRAM cell

Bitline leakage depends on the stored value

Cache: Dual Vt SRAM cell

Bitline leakage depends on the stored value

Leakage-Biased Bitlines (LBB)

Discharge to an intermediate value between 0 and 1

A good fine-grain dynamic technique

LBB versus Sleep Vector

Leakage Power (uW)

300 250 200 150 100 0 20 40 60 80

Sleep Vector 1 Sleep Vector 0 LBB

Zero Percentage (%)

Cumulative Leakage Energy

100 200 300 400 Length of Sleep (cy cles)

Dynamic energy cost: Need to replace the lost charge

Decrease of Breakeven time

Performance Issues for LBB Caches

I-Cache Subbank Deactivation

Leakage energy sav ing at 70nm process

Pessimistic Prediction O ptimistic Prediction

Leakage energy sav ing across processes

Total energy sav ing across processes

15 10 5 0 -5 -10 180nm 130nm 100nm 70nm

Multiported Regfile Cell

LBB for Multiported Regfiles

Dead Register Deactivation

Dead Register Deactivation

NMOS Sleep Transistor (NST)

NMOS Sleep Transistor (NST)

Idle Readport Deactivation

Idle Readport Deactivation

Sleep Vector Leakage-Biased Bitlines

NMOS Sleep Transistor

NMOS Sleep Transistor

Comparison of DDFTs Blowup: 70nm 70nm

Original Sleep Vector NMOS Sleep Transistor Leakage-Biased Bitlines

10 20 30 40 Length of Sleep (cy cles)

Dead Register/Subbank Deactivation Policies

Subbank allocation policy for LBB: stack

Dead Reg Deactivation (Horizontal)