Sei sulla pagina 1di 29

Multi-Core Parallelism for Low-

Power Design

Vishwani D. Agrawal
James J. Danaher Professor
Department of Electrical and Computer Engineering
Auburn University
http://www.eng.auburn.edu/~vagrawal
vagrawal@eng.auburn.edu

2/8/06 D&T Seminar 1


Power Consumption of VLSI Chips

Why is it a concern?

2/8/06 D&T Seminar 2


SIA Roadmap for Processors (1999)
Year 1999 2002 2005 2008 2011 2014

Feature size (nm) 180 130 100 70 50 35

Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M

Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9

Chip size (mm2) 340 430 520 620 750 900

Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5

High-perf. Power (W) 90 130 160 170 175 183

Source: http://www.semichips.org
2/8/06 D&T Seminar 3
ISSCC, Feb. 2001, Keynote
“Ten years from now,
microprocessors will run at
10GHz to 30GHz and be capable
of processing 1 trillion operations
per second -- about the same
number of calculations that the
world's fastest supercomputer
can perform now.
Patrick P. Gelsinger
Senior Vice President
General Manager “Unfortunately, if nothing
Digital Enterprise Group
INTEL CORP.
changes these chips will produce
as much heat, for their
proportional size, as a nuclear
reactor. . . .”

2/8/06 D&T Seminar 4


VLSI Chip Power Density
Source: Intel
Sun’s
10000
Surface

Rocket
Power Density (W/cm2)

1000
Nozzle

Nuclear
100
Reacto
r Plate
Hot
8086
10 4004 P6
8008 8085 386 Pentium®
286 486
8080
1
1970 1980 1990 2000 2010
Year
2/8/06 D&T Seminar 5
Power Dissipation in
CMOS Logic (0.25µ)
Ptotal (0→1) = CL VDD2 + tscVDD Ipeak + VDDIleakage
VDD VDD

CL

%75 %20 %5
2/8/06 D&T Seminar 6
Low-Power Datapath Architecture
• Lower supply voltage
– This slows down circuit speed
– Use parallel computing to gain the speed back
• Works well when threshold voltage is also
lowered.
• About 60% reduction in power obtainable.
• Reference: A. P. Chandrakasan and R. W.
Brodersen, Low Power Digital CMOS Design,
Boston: Kluwer Academic Publishers (Now
Springer), 1995.
2/8/06 D&T Seminar 7
A Reference Datapath

Register

Register
Combinational
Input Output
logic

Cref
CK
Supply voltage = Vref
Total capacitance switched per cycle = Cref
Clock frequency =f
Power consumption: Pref = CrefVref2f
2/8/06 D&T Seminar 8
A Parallel Architecture
A copy processes Supply voltage:

Register
Comb. VN ≤ V1 = Vref
every Nth input, Logic
operates at Copy 1
f/N N = Deg. of
reduced voltage

N to 1 multiplexer
parallelism

Register
Comb.

Register
Logic Output
Input Copy 2
f/N
f
Register
Multiphase Comb.
Clock gen. Logic
and mux f/N Copy N
control
CK
2/8/06 D&T Seminar 9
Control Signals, N = 4
CK

Phase 1

Phase 2

Phase 3

Phase 4

2/8/06 D&T Seminar 10


Power
PN = Pproc + Poverhead

Pproc = N(Cinreg+ Ccomb)VN2f/N + CoutregVN2f

= (Cinreg+ Ccomb+Coutreg)VN2f

= CrefVN2f

Poverhead = CoverheadVN2f ≈ δCref(N – 1)VN2f

PN = [1 + δ(N – 1)]CrefVN2f

PN V N2
── = [1 + δ(N – 1)] ───
P1 Vref2
2/8/06 D&T Seminar 11
Voltage vs. Speed
CLVref CLVref
Delay of a gate, T ≈ ──── = ──────────
I k(W/L)(Vref – Vt)2

where I is saturation current


k is a technology parameter
W/L is width to length ratio of transistor
Vt is threshold voltage
4.0
1.2μ CMOS Voltage reduction
gate delay, T

slows down as we
Normalized

3.0 N=3
get closer to Vt
N=2
2.0
N=1
1.0

0.0 Supply voltage


Vt V3 V2=2.9V Vref =5V
2/8/06 D&T Seminar 12
Increasing Multiprocessing
1.0
1.2μ CMOS, Vref = 5V
0.8
Vt=0.8V
0.6
PN/P1
0.4 Vt=0.4V

0.2
Vt=0V (extreme case)
0.0
1 2 3 4 5 6 7 8 9 10 11 12
N

2/8/06 D&T Seminar 13


Extreme Cases: Vt = 0
Delay, T α 1/ Vref

For N processing elements, delay = NT → VN = Vref/N

PN 1
── = [1+ δ (N – 1)] ── → 1/N
P1 N2

For negligible overhead, δ→0

PN 1
── ≈ ──
P1 N2

For Vt > 0, power reduction is less and there will be an


2/8/06 optimum value of N. D&T Seminar 14
Example: Multiplier Core
• Specification:
• 200MHz Clock
• 15W dissipation @ 5V
• Low voltage operation, VDD ≥ 1.5 volts

(VDD – 0.5)2
Relative clock rate = ───────
20.25
• Problem:
• Integrate multiplier core on a SOC
• Power budget for multiplier ~ 5W
2/8/06 D&T Seminar 15
A Multicore Design

Multiplier

Reg
Core 1
40MHz

5 to 1 mux
Multiplier Output

Reg
Core 2

Reg
Input

40MHz
200MHz

Multiphase Multiplier
Clock gen.
Reg
Core 5
and mux 40MHz
control
200MHz
CK

Core clock frequency = 200/N, N should divide 200.

2/8/06 D&T Seminar 16


How Many Cores?

• For N cores:
• clock frequency = 200/N MHz
• Supply voltage, VDDN= 0.5 + (20.25/N)1/2 Volts
• Assuming 10% overhead per core,
VDDN 2
Power dissipation =15 [1 + 0.1(N – 1)] (───) watts
5

2/8/06 D&T Seminar 17


Design Tradeoffs
Number of cores Core supply Total Power
Clock (MHz)
N VDDN (Volts) (Watts)

1 200 5.00 15.0

2 100 3.68 8.94

4 50 2.75 5.90

5 40 2.51 5.29

8 25 2.10 4.50

2/8/06 D&T Seminar 18


Power Reduction in Processors
• Just about everything is used.
• Hardware methods:
• Voltage reduction for dynamic power
• Dual-threshold devices for leakage reduction
• Clock gating, frequency reduction
• Sleep mode
• Architecture:
• Instruction set
• hardware organization
• Software methods
2/8/06 D&T Seminar 19
Parallel Architecture
Processor

Input Output Output


Processor
f/2
Input

f Processor f

Capacitance = C
Voltage = V f/2 Capacitance = 2.2C
Frequency = f Voltage = 0.6V
Power = CV2f Frequency = 0.5f
Power = 0.396CV2f
2/8/06 D&T Seminar 20
Register
Pipeline Architecture

Register

Register
Input Output Input ½ ½ Output
Processor
Proc. Proc.

f f

Capacitance = C Capacitance = 1.2C


Voltage = V Voltage = 0.6V
Frequency = f Frequency = f
Power = CV2f Power = 0.432CV2f

2/8/06 D&T Seminar 21


Approximate Trend
n-parallel proc. n-stage pipeline proc.

Capacitance nC C

Voltage V/n V/n

Frequency f/n f

Power CV2f/n2 CV2f/n2

Chip area n times 10-20% increase

G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer


Academic Publishers, 1998.
2/8/06 D&T Seminar 22
Multicore Processors
SPECint2000 and SPECfp2000 benchmarks Computer, May 2005, p. 12
Performance based on

Multicore

Single core

2000 2004 2008

2/8/06 D&T Seminar 23


Multicore Processors
• D. Geer, “Chip Makers Turn to Multicore
Processors,” Computer, vol. 38, no. 5, pp. 11-13,
May 2005.
• A. Jerraya, H. Tenhunen and W. Wolf,
“Multiprocessor Systems-on-Chips,” Computer,
vol. 5, no. 7, pp. 36-40, July 2005; this special
issue contains three more articles on
multicore processors.
• S. K. Moore, “Winner Multimedia Monster –
Cell’s Nine Processors Make It a Supercomputer
on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp.
20-23, January 2006.
2/8/06 D&T Seminar 24
Cell - Cell Broadband Engine
Architecture
Nine-processor chip:
192 Gflops
© IEEE Spectrum, January 2006

L to R
Atsushi Kameyama, Toshiba
James Kahle, IBM
Masakazu Suzoki, Sony
2/8/06 D&T Seminar 25
Cell’s Nine-Processor Chip

© IEEE Spectrum, January 2006


Eight Identical
Processors
f = 5.6GHz (max)
44.8 Gflops

2/8/06 D&T Seminar 26


?

2/8/06 D&T Seminar 27


Amdahl’s Law
S P=1–S

0 1 time
1
Speedup = ─────────
S + (1 – S)/ N

Where N = number of parallel processors

Example: S = 0.6, N = 10, Speedup = 1.56


S = 0.6, N = ∞, Speedup = 1.67

Gene Amdahl, “Validity of the Single Processor Approach to Achieving


Large-Scale Computing Capabilities,” AFIPS Conference Proceedings,
(30), pp. 483-485, 1967.
2/8/06 D&T Seminar 28
Question
• Can we find a multi-processing law
– for power reduction, or
– for performance per watt

2/8/06 D&T Seminar 29

Potrebbero piacerti anche