Sei sulla pagina 1di 19

Performance, Energy and Thermal Considerations of SMT and CMP architectures

Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron

Dept. of Computer Science, University of Virginia Division of Engineering and Applied Sciences, Havard University IBM T.J.Watson Research Center

Motivation
note: not to scale Out-oforder Processor

Equal performance curve?


CMP with out-oforder Cores CMP with out-of-order SMT cores

2005, Yingmin Li

#threads per chip Future trend calls for multi-core and multi-thread architectures Which is better: lots of tiny speed demons or fewer brainiacs? Which is more valuable, more L2 or additional cores? Performance, power, and thermal properties of multi-core vs. multi-thread architectures not well understood

Single Thread Performance

In-order Processor

Sun Niagara

Scope of this Study


Equal-area comparison between SMT vs. CMP extensions of an Apple G5-like core Note: 1MB L2 roughly equals to 1 G5 like Core in terms of area
fp_reg

FXU

ISU

FPU
IDU
I_cache

fx_reg

BXU LSU
D_cache

Singlethreaded

IFU

L2Cache
fp_reg

FXU

ISU

FPU
IDU
I_cache

fx_reg

BXU LSU
D_cache

SMT
fp_reg

FXU

fp_reg

FXU

ISU

FPU
IDU BXU LSU

fx_reg

ISU

FPU
IDU BXU LSU

fx_reg

IFU

2005, Yingmin Li

L2Cache

I_cache

D_cache

I_cache

D_cache

IFU

IFU

L2Cache

Singlethreaded CMP
3

Outline

2005, Yingmin Li

Modeling / Model Validation SMT vs. CMP performance, power and thermal analysis (without DTM) SMT vs. CMP performance, power and thermal analysis (with DTM) Conclusions and future work

Performance sensitivity with different L2 size


SMT
Relative performance change compared to ST baseline
80% 70% 60% 50% 40% 30% 20% 10% 0%

CMP

2005, Yingmin Li

1.5M 1.75M

2M

2.25M 2.5M 2.75M

3M

L2 size (SMT)

CMP L2 size = SMT L2 size 1MB


5

Modeling and Validation



2005, Yingmin Li

Performance: Turandot with SMT and CMP augmentations, validated against Power4 preRTL model Power: PowerTimer with SMT and CMP augmentations, validated against CPAM power data extracted from circuit Temperature: Hotspot from UVA integrated with Turandot/PowerTimer, validated with test chips at UVA

Turandot/PowerTimer Simulation Framework Supports SMT/CMP Runs on AIX/PowerPC and Linux/Intel platforms PowerTimer based on CPAM data, extracted from circuits See Micro02 tutorial by Zhigang Hu and David Brooks for details

2005, Yingmin Li

Hotspot temperature model


Models all parts along both primary and secondary heat transfer paths At arbitrary granularities Fast and accurate Essentially a lumped thermal R-C network
To Interconnect Layer Thermal Model Heat Sink Silicon Die
Heat Spreader

Thermal Interface Material


2005, Yingmin Li

Fin-to-air convection thermal resistor

Peak Temperature of The Hottest Spot for SMT and CMP


3 heat-up mechanisms Unit self heating determined by the power density of the unit Global heating through TIM (thermal interface material) and spreader Lateral thermal coupling between neighboring units

95 90

Temperature (Celsius)

85 80 75 70 65 60 55 50

2005, Yingmin Li

d)

ST

r)

CM P

SM

fa ct o

rg e

ac tiv ity

ae

(a re

T( on ly

SM

CM P

(o ne

ST

co re

ro ta

nl a

te d

Heat Flow of Global Heat-up


Heat Sink Heat Spreader Thermal Interface Material Silicon Bulk Interconnect Layers C4 Pads and Underfill Ceramic Substrate CBGA Joint
2005, Yingmin Li

Primary Path

Secondary Path

Printed-circuit Board

10

Illustration (global heat-up of CMP vs. local heat-up of SMT)

CMP 100 90 80 70 60 50 40 IFU_B1

SMT
250

CMP

SMT

Power density
LSU_cache FXU_reg

200 150 100 50 0 IFU_B1 LSU_cache FXU_reg

2005, Yingmin Li

Temperature

11

Temperature Trend with technology evolution


Increased utilization of SMT becomes muted L2 cache tends to be much cooler than the core Expotential temperature dependence of leakage
Average temperature difference between CMP and SMT
6 5 4 3 2 1 0 130 90 Technology (nm) 70

Normal case L2 leakage radically reduced No Temperature effect on Leakge

2005, Yingmin Li

12

SMT vs. CMP performance and power efficiency analysis (without DTM)
SMT is superior for memory bound(high-l2cache-miss rate) benchmarks while CMP is superior for non memory bound benchmarks
2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% dual-core CMP

2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80%
Relative change compared to ST baseline

dual-core CMP

Relative change compared to ST baseline

2005, Yingmin Li

Compute-bound

ENERGY DELAY^2

Memory-bound

13

ENERGY DELAY^2

IPC

POWER

ENERGY

ENERGY

ENERGY DELAY

ENERGY DELAY

IPC

POWER

The impact of changing L2 size: Examples Stays memory bound when L2 size changes
SMT with 2MB L2 CMP with 1MB L2
2.54

Changes from memory bound to non memory bound when L2 size changes
Relative change compared with baseline ST with 2MB L2

Relative change compared with baseline ST with 2MB L2

SMT with 3MB L2 CMP with 2MB L2


6.09

SMT with 2MB L2 CMP with 1MB L2


1.35

SMT with 3MB L2 CMP with 2MB L2


2.33 3.72

13.2

100% 80% 60% 40% 20% 0%

100% 80% 60% 40% 20% 0% -20% -40% -60% -80%

-20% -40% -60% -80%

2005, Yingmin Li

MCF+MCF

ENERGY DELAY^2

MCF+VPR

ENERGY DELAY^2
14

POWER

IPC

ENERGY

ENERGY

ENERGY DELAY

ENERGY DELAY

POWER

IPC

SMT vs. CMP performance with DTM


Global technique
Global DVS

Fetch throttling Local technique Rename throttling Register file throttling (ideal)
No DTM Local renaming throttling Global fetch throttling Register file throttling

Localized DTM method favors SMT while global DTM method favors CMP
No DTM Local renaming throttling
Relative change compared to ST baseline without DTM
100% 80% 60% 40% 20% 0% SMT -20% CMP ST

Global fetch throttling Register file throttling

Relative change compared to ST baseline without DTM

100% 80% 60% 40% 20% 0% SMT -20% CMP ST

2005, Yingmin Li

Compute-bound

Memory-bound

15

SMT energy efficiency with DTM


Localized method can lead to better energy-delay product result compared with global method in some cases.
No DTM Rename throttling
Relative change compared with baseline without DTM

Fetch throttling Register file throttling


Relative change compared with baseline without DTM

No DTM Rename throttling 100% 80% 60% 40% 20% 0% POWER ENERGY

Fetch throttling Register file throttling

100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2

-20% -40% -60% -80%

ENERGY DELAY

ENERGY DELAY^2

2005, Yingmin Li

Compute-bound

Memory-bound

16

CMP energy efficiency with DTM


Localized method is inferior for CMP in terms of energy and energy delay product metrics
No DTM Rename throttling
Relative change compared with baseline without DTM

Fetch throttling Register file throttling


Relative change compared with baseline without DTM

No DTM Rename throttling 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY

Fetch throttling Register file throttling


1.10 1.21 1.09 2.17 2.06 2.07 1.9 2.37 1.94

100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2

ENERGY DELAY

ENERGY DELAY^2

2005, Yingmin Li

Compute-bound

Memory-bound

17

Conclusions
With the same chip area, SMT performs better than CMP for memory bound benchmarks while CMP performs better than SMT for non memory bound benchmarks with Apple G5 like architecture. The thermal heating effects are quite different for CMP and SMT CMP machines are clearly hotter than SMT machines with leaky technology Different DTM technique favors different architecture


2005, Yingmin Li

18

Future Work
Consider significantly larger amounts of threadlevel parallelism and hybrids between CMP and SMT cores The impact of varying core complexity on the performance of SMT and CMP, and explore a wider range of design options, like SMT fetch policies. Explore server-oriented workloads

2005, Yingmin Li

19

Potrebbero piacerti anche