Sei sulla pagina 1di 38

ISCA 2002

Dynamic Fine-Grain Leakage Reduction Using Leakage-Biased Bitlines


Seongmoo Heo, Kenneth Barr, Mark Hampton, and Krste Asanovi Computer Architecture Group, MIT LCS

Leakage Power
Growing impact of leakage power
Increase of leakage power due to scaling of transistor lengths and threshold voltages Power budget limits use of fast leaky transistors

Challenge:
How to maintain performance scaling in face of increasing leakage power?

Leakage Reduction Techniques


Static: Design-time Selection of Slow Transistors (SSST) for non-critical paths
Replace fast transistors with slow ones on non-critical paths Tradeoff between delay and leakage power

Dynamic: Run-time Deactivation of Fast Transistors (DDFT) for critical paths


DDFT switches critical path transistors between inactive and active modes

Observation:
Critical paths dominate leakage after applying SSST techniques Example: PowerPC 750
5% of transistor width is low Vt, but these account for >50% of total leakage.

DDFT could give large leakage savings

Existing DDFT Circuit Techniques


Gate Vbody > Vdd

Body Biasing

Vt increase by Body reverse-biased body effect Large transition time and wakeup latency due to well cap and resistance

Drain Source

Power Gating

Vdd

Sleep signal Sleep transistor between Virtual Vdd supply and virtual supply lines Logic cells Increased delay due to sleep transistor

Sleep Vector

0 Input vector which minimizes leakage Increased delay due to mux and active energy due to spurious toggles after applying sleep vector

Fine-Grain DDFT Techniques


Have to turn off small pieces of an active processor for short periods of time
Difficult to turn off large pieces for long periods Fine-grain DDFT techniques

Requirements of Fine-grain DDFT techniques


Circuits with low active delay penalty, low energy moving in and out of sleep, and fast wakeup time Micro-architectural scheduling to keep the sleep time as long and often as possible

Compare to coarse-grain DDFT techniques


O.S. puts whole processor to sleep for a long time doesnt save power when running code Low steady-state leakage only concern.

Highlights of This Work


We introduce metrics for comparing finegrain dynamic deactivation techniques
Steady-stage leakage, Transition time, Fixed transition energy, Breakeven time

We present a new circuit-level leakage reduction technique, Leakage-Biased Bitlines (LBB)


Low deactivation energy and fast wakeup

We save leakage power of I-Cache and Multiported regfile by LBB


I-cache: idle subbank deactivation Multiported regfile: idle read ports and dead register deactivation

Outline
1. Methodology and DDFT Metrics 2. Cache Leakage Saving
Idle subbank deactivation

3. Multiported Regfile Leakage Saving


Dead reg deactivation (Horizontal) Idle read port deactivation (Vertical)

4. Conclusion

Methodology
Process Technology
180nm DVT process modeled after 0.18um TSMC LVT and MVT processes Scaled to 130, 100, and 70nm processes based on SIA roadmap Optimistic/pessimistic leakage prediction: 2x/4x increase of leakage current density (nA/um)

Evaluation with SimpleScalar


Modified to model unified physical register file 4 issue, 100 integer physical regs, 16KB/4-Way/32B block I-Cache and D-Cache, Unified L-2 Cache SPECint95 refs

Energy measurements
Hspice simulation for 180nm process and scaled to other processes accordingly

Metrics for Fine-Grain DDFT Techniques


Leakage Current Original Leakage DDFT applied Transition Time Leakage Energy Original Leakage Break-Even Time DDFT Leakage Steady-state Sleep Leakage Time Wakeup Latency Active delay and power Fixed Active Transition Energy Length of Sleep

L1 Cache and Multiported Regfile


Good targets for Fine-grain DDFT techniques
Timing-critical Contrast: L2 cache is a better target for SSST (long channel or HVT transistors) Large leakage current Cache: Large number of fast transistors Multiported Regfile: Ever increasing number of registers and ports Alpha 21464 register file is 5x larger than 64KB data cache

LBB for Caches


Modern cache structure
: Hierarchical Bitlines To save active power To reduce delay To reduce bitline noise
Subbank

Global Bitline Local Bitline

Local-Global Switch SenseAmp

Local bitlines (32-bit cells) disconnected from senseamp by local-global switch. LBB for Caches: If a subbank is not in use, turn off precharge transistors and delay precharging.

Cache: Dual Vt SRAM cell


GLOBAL BIT GLOBAL BIT_BAR

BIT

BIT_BAR

WL

HVT transistors: green-colored

Cache: Dual Vt SRAM cell


GLOBAL BIT GLOBAL BIT_BAR

BIT

BIT_BAR

WL

Cache: Dual Vt SRAM cell


GLOBAL BIT GLOBAL BIT_BAR

BIT

BIT_BAR

WL

Bitline leakage depends on the stored value

Cache: Dual Vt SRAM cell


GLOBAL BIT GLOBAL BIT_BAR

BIT

BIT_BAR

WL

0
Our Target

Bitline leakage depends on the stored value

Forcing 0
0

Forcing 1
1

Forcing ?
0

Leakage-Biased Bitlines (LBB)


Discharge to 0
0

Stay at 1
1

Discharge to an intermediate value between 0 and 1


0

LBB lets bitlines float by turning off the local HVT NMOS precharge transistors
No static current draw because local bitline isolated LBB uses leakage itself to bias bitlines to the voltage which minimizes leakage!

A good fine-grain dynamic technique


Minimal transition energy: Same number of precharges (delayed precharge) Minimal transition time: Wakeup latency is only that of precharge phase

LBB versus Sleep Vector


LBB finds the minimal leakage state.
Always better than sleep vectors
Leakage Power of 32x16B SRAM subbank
350

Leakage Power (uW)

Original

300 250 200 150 100 0 20 40 60 80

Sleep Vector 1 Sleep Vector 0 LBB

100

Zero Percentage (%)

Cumulative Leakage Energy


32-row x 32B SRAM subbank
(optimistic leakage current used. 75% zero assumed)

180nm
50

70nm
Energy (pJ)

Energy (pJ)

40 30 20 10 0 0

Original LBB

50 40 30 20 10 0 0

Original

LBB
100 200 300 400 Length of Sleep (cy cles) 500

100 200 300 400 Length of Sleep (cy cles)

500

Dynamic energy cost: Need to replace the lost charge


-LBB curve increases fast in the beginning

Decrease of Breakeven time


-180nm: 200 cycles, 70nm: less than a cycle -Active energy scales down faster than leakage energy

Performance Issues for LBB Caches


Subbank must be precharged before use
Case 1 (best): subbank decode and precharge happen before more complex word-line decode, therefore no penalty. Case 2 (worst): add additional pipeline stage for precharge
One cycle increase in branch misprediction penalty

Focus on I-Cache because any latency increase can be partly hidden by branch prediction

ross processes

I-Cache Subbank Deactivation


Total energy sav ing at 70nm process
30

Leakage energy sav ing at 70nm process


30

Percentage (%)

25 20 15 10 5 0

Percentage (%)
t av g

25 20 15 10 5 0

Pessimistic Prediction O ptimistic Prediction

pe rl

gc c

88 k

pe rl

gc c

co m

co m

jp e

Leakage energy sav ing across processes


25 20 15 10 5 0 -5 -10 180nm 130nm 100nm 70nm

Total energy sav ing across processes


25 20

Percentage (%)

Percentage (%)

15 10 5 0 -5 -10 180nm 130nm 100nm 70nm

Case 2 (worst) assumption (adding additional pipeline stage) 2.5% IPC decrease on average

av g

vo r

vo r

jp e

70nm

88 k

go

li

go

li

Multiported Regfile Cell


8R, 4W unbalanced DVT reg cell
READ[0:7] WRITE[0:3] WRITEB[0:3]

WWL[0:3]

RWL[0:7]

x4

x4

x8
HVT transistors: green-colored Simplified but active/leakage power-aware baseline

LBB for Multiported Regfiles


LBB for Multiported Regfiles: Turn off the precharge transistor on idle subbank read ports
Leakage current discharges bitlines to 0 if any bits are holding 1.

Dead Register Deactivation


Horizontal technique Dead registers = Registers Subbank 1 in free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write.
Readport 0 Readport 1 Readport 2

Dead Register Deactivation


Horizontal technique Dead registers = Registers in Subbank 1 free list If all registers in a subbank are dead, all read ports in the subbank are turned off by LBB No performance penalty since there is ample time to re-precharge between allocation and write.
Readport 0 Readport 1 Readport 2

NMOS Sleep Transistor (NST)


Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can 1 be turned off individually Disadvantage: increased read access time
Set delay penalty to 5% (tradeoff between delay and leakage)
Readport 0 Readport 1 Readport 2

NMOS Sleep Transistor (NST)


Alternative horizontal DDFT To turn off dead registers Register 1 using NMOS sleep transistors (NST) Advantage: registers can 0 be turned off individually Disadvantage: increased read access time
Set delay penalty to 5% (tradeoff between delay and leakage)
Readport 0 Readport 1 Readport 2

Idle Readport Deactivation


Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline.
Readport 0 Readport 1 Readport 2

Idle Readport Deactivation


Vertical technique Idle read ports when fewer than max # of instructions are issued in a superscalar machine Idle read ports deactivated by LBB No performance penalty since it is known whether a read port is needed before it is known which register will be accessed in the pipeline.
Readport 0 Readport 1 Readport 2

Comparison of DDFTs
32 x 32-b Regfile subbank
(75% zero assumed. Optimistic leakage current used.)
Process Tech. (nm) Original (uW) SV steady-state (uW) LBB steady-state (uW) NST steady-state (uW) 180 177.9 2.0 2.0 1.8 130 214.1 2.4 2.4 2.2 100 263.6 3.0 3.0 2.7 70 276.7 3.1 3.1 2.9

180nm
50

70nm
50

Original
Energy (pJ)

Energy (pJ)

40 30 20 10 0 0

Sleep Vector Leakage-Biased Bitlines

40 30 20 10 0 0

Original

NMOS Sleep Transistor


Sleep Vector Leakage-Biased Bitlines
500 Length of Sleep (cy cles) 1000

NMOS Sleep Transistor


500 Length of Sleep (cy cles)

1000

Comparison of DDFTs Blowup: 70nm 70nm


8 7 6 5 4 3 2 1 0 0

Energy (pJ)

Original Sleep Vector NMOS Sleep Transistor Leakage-Biased Bitlines

10 20 30 40 Length of Sleep (cy cles)

50

Dead Register/Subbank Deactivation Policies


Free list policies for NST (NMOS Sleep Transistor): queue and stack
queue: conventional stack: keeps some regs dead for longer 2.4-10% greater savings than queue at 70nm Benefit increases as feature sizes shrink Allocate a new subbank only when the previous bank is empty of dead registers

Subbank allocation policy for LBB: stack

Dead Reg Deactivation (Horizontal)


Leakage energy savings (70nm process) Total energy savings (70nm process)

60
percent (%)

60
percent (%)

Colored: optimistic White: pessimistic

40 20 0

40 20 0

m li 88 k pe co rl m p vo rt

m li 88 k pe co rl m p vo rt
130 100 Process (nm) 70

c go jp eg

c g jp o eg

gc

gc

av

Leakage Energy Savings

Total Energy Savings NST Queue NST Stack LBB 16 regs/bank LBB 8 regs/bank

60
percent (%)

60
percent (%)

40 20 0

40 20 0

180

130 100 Process (nm)

70

180

NST stack better than NST queue, LBB stack better than either NST

av

Read Port Deactivation (Vertical)


Leakage energy sav ing at 70nm process
70 70

Total energy sav ing at 70nm process


Percentage (%)
60 50 40 30 10 0

Percentage (%)

60 50 40 30 20 10 0

20 processes Leakage energy sav ing across

88 k

pe rl

88 k

vo rt

pe rl

vo rt

gc c

gc c

go

av g

go

co m

co m

jp e

jp e

Percentage (%)

Leakage energy sav ing across processes

20 15

Total energy sav ing across processes


70 60

70 60 50 40 30 20 10 0

Percentage (%)

Percentage (%)

10 5 0 -5 180nm
180nm 130nm

50 40 30 20 70nm 10 0 -10 180nm

Pessimistic Prediction O ptimistic Prediction

130nm
100nm

100nm
70nm

-10

-10

130nm

100nm

70nm

More energy saving for wider issue processors Readport deactivation can be combined with dead subbank deactivation.

av g

li

li

25

Conclusion
Most leakage power is in critical paths
Dynamic leakage reduction (DDFT) desired

LBB allows Fine-grain dynamic leakage reduction with zero or minimal performance penalty.
0% performance penalty for multiported regfiles

Sleep time can be improved by changing micro-architectural scheduling policies.


Stack better than queue for free list policy

Follow on work:
Leakage-biased domino logic to save leakage power in critical ALUs [ VLSI Symposium 2002 ]

Acknowledgments
Thanks to Christopher Batten, Ronny Krashinsky, Rajesh Kumar, and anonymous reviewers Funded by DARPA PAC/C award F3060200-2-0562, NSF CAREER award CCR0093354, and a donation from Infineon Technologies.

DDFT Examples
Body Biasing
Steady-state leakage power

Power Gating

Sleep Vector

Less than 5% Less than 5% Less than 50% (depends on (depends on sleep (depends on the Vbody) transistor) circuit)

Transition time, Wakeup latency

0.1~100us

Less than a cycle


Sleep transistor gate cap switching energy Yes. Due to sleep transistor Area for sleep transistor and virtual supplies

Less than a cycle


Active energy consumed due to spurious toggling after sleep vector Yes. Due to mux Finding sleep vector is hard

Transition energy Well cap ,Breakeven time switching


energy

Delay Impact Etc

No

Potrebbero piacerti anche