SMT and CMP Architecture

Performance, Energy and Thermal Considerations of SMT and CMP architectures
Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron
Dept. of Computer Science, University of Virginia Division of Engineering and Applied Sciences, Havard University IBM T.J.Watson Research Center
Motivation
note: not to scale Out-oforder Processor
Equal performance curve?

CMP with out-oforder Cores CMP with out-of-order SMT cores
2005, Yingmin Li
#threads per chip Future trend calls for multi-core and multi-thread architectures Which is better: lots of tiny speed demons or fewer brainiacs? Which is more valuable, more L2 or additional cores? Performance, power, and thermal properties of multi-core vs. multi-thread architectures not well understood
Single Thread Performance
In-order Processor
Sun Niagara
Scope of this Study

Equal-area comparison between SMT vs. CMP extensions of an Apple G5-like core Note: 1MB L2 roughly equals to 1 G5 like Core in terms of area
fp_reg
FXU
ISU
FPU
IDU
I_cache
fx_reg
BXU LSU
D_cache
Singlethreaded
IFU
L2Cache
fp_reg
FXU
ISU
FPU
IDU
I_cache
fx_reg
BXU LSU
D_cache
SMT
fp_reg
FXU
fp_reg
FXU
ISU
FPU
IDU BXU LSU
fx_reg
ISU
FPU
IDU BXU LSU
fx_reg
IFU
2005, Yingmin Li
L2Cache
I_cache
D_cache
I_cache
D_cache
IFU
IFU
L2Cache
Singlethreaded CMP
3
Outline

2005, Yingmin Li
Modeling / Model Validation SMT vs. CMP performance, power and thermal analysis (without DTM) SMT vs. CMP performance, power and thermal analysis (with DTM) Conclusions and future work
Performance sensitivity with different L2 size

SMT
Relative performance change compared to ST baseline
80% 70% 60% 50% 40% 30% 20% 10% 0%
CMP
2005, Yingmin Li
1.5M 1.75M
2M
2.25M 2.5M 2.75M
3M
L2 size (SMT)
CMP L2 size = SMT L2 size 1MB

5
Modeling and Validation

2005, Yingmin Li
Performance: Turandot with SMT and CMP augmentations, validated against Power4 preRTL model Power: PowerTimer with SMT and CMP augmentations, validated against CPAM power data extracted from circuit Temperature: Hotspot from UVA integrated with Turandot/PowerTimer, validated with test chips at UVA
Turandot/PowerTimer Simulation Framework Supports SMT/CMP Runs on AIX/PowerPC and Linux/Intel platforms PowerTimer based on CPAM data, extracted from circuits See Micro02 tutorial by Zhigang Hu and David Brooks for details
2005, Yingmin Li
Hotspot temperature model

Models all parts along both primary and secondary heat transfer paths At arbitrary granularities Fast and accurate Essentially a lumped thermal R-C network
To Interconnect Layer Thermal Model Heat Sink Silicon Die
Heat Spreader
Thermal Interface Material

2005, Yingmin Li
Fin-to-air convection thermal resistor
Peak Temperature of The Hottest Spot for SMT and CMP

3 heat-up mechanisms Unit self heating determined by the power density of the unit Global heating through TIM (thermal interface material) and spreader Lateral thermal coupling between neighboring units
95 90
Temperature (Celsius)
85 80 75 70 65 60 55 50
2005, Yingmin Li
d)
ST
r)
CM P
SM
fa ct o
rg e
ac tiv ity
ae
(a re
T( on ly
SM
CM P
(o ne
ST
co re
ro ta
nl a
te d
Heat Flow of Global Heat-up

Heat Sink Heat Spreader Thermal Interface Material Silicon Bulk Interconnect Layers C4 Pads and Underfill Ceramic Substrate CBGA Joint
2005, Yingmin Li
Primary Path
Secondary Path
Printed-circuit Board
10
Illustration (global heat-up of CMP vs. local heat-up of SMT)
CMP 100 90 80 70 60 50 40 IFU_B1
SMT
250
CMP
SMT
Power density
LSU_cache FXU_reg
200 150 100 50 0 IFU_B1 LSU_cache FXU_reg
2005, Yingmin Li
Temperature
11
Temperature Trend with technology evolution

Increased utilization of SMT becomes muted L2 cache tends to be much cooler than the core Expotential temperature dependence of leakage
Average temperature difference between CMP and SMT
6 5 4 3 2 1 0 130 90 Technology (nm) 70
Normal case L2 leakage radically reduced No Temperature effect on Leakge
2005, Yingmin Li
12
SMT vs. CMP performance and power efficiency analysis (without DTM)
SMT is superior for memory bound(high-l2cache-miss rate) benchmarks while CMP is superior for non memory bound benchmarks
2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% dual-core CMP
2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80%
Relative change compared to ST baseline
dual-core CMP
Relative change compared to ST baseline
2005, Yingmin Li
Compute-bound
ENERGY DELAY^2
Memory-bound
13
ENERGY DELAY^2
IPC
POWER
ENERGY
ENERGY
ENERGY DELAY
ENERGY DELAY
IPC
POWER
The impact of changing L2 size: Examples Stays memory bound when L2 size changes
SMT with 2MB L2 CMP with 1MB L2
2.54
Changes from memory bound to non memory bound when L2 size changes
Relative change compared with baseline ST with 2MB L2
Relative change compared with baseline ST with 2MB L2

6.09

1.35

2.33 3.72
13.2
100% 80% 60% 40% 20% 0%
100% 80% 60% 40% 20% 0% -20% -40% -60% -80%
-20% -40% -60% -80%
2005, Yingmin Li
MCF+MCF
ENERGY DELAY^2
MCF+VPR
ENERGY DELAY^2
14
POWER
IPC
ENERGY
ENERGY
ENERGY DELAY
ENERGY DELAY
POWER
IPC
SMT vs. CMP performance with DTM

Global technique
Global DVS
Fetch throttling Local technique Rename throttling Register file throttling (ideal)
No DTM Local renaming throttling Global fetch throttling Register file throttling
Localized DTM method favors SMT while global DTM method favors CMP
No DTM Local renaming throttling
Relative change compared to ST baseline without DTM
100% 80% 60% 40% 20% 0% SMT -20% CMP ST
Global fetch throttling Register file throttling
Relative change compared to ST baseline without DTM
100% 80% 60% 40% 20% 0% SMT -20% CMP ST
2005, Yingmin Li
Compute-bound
Memory-bound
15
SMT energy efficiency with DTM

Localized method can lead to better energy-delay product result compared with global method in some cases.
No DTM Rename throttling
Relative change compared with baseline without DTM
Fetch throttling Register file throttling

No DTM Rename throttling 100% 80% 60% 40% 20% 0% POWER ENERGY
100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2
-20% -40% -60% -80%
ENERGY DELAY
ENERGY DELAY^2
2005, Yingmin Li
Compute-bound
Memory-bound
16
CMP energy efficiency with DTM

Localized method is inferior for CMP in terms of energy and energy delay product metrics
No DTM Rename throttling

No DTM Rename throttling 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY

1.10 1.21 1.09 2.17 2.06 2.07 1.9 2.37 1.94
100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2
ENERGY DELAY
ENERGY DELAY^2
2005, Yingmin Li
Compute-bound
Memory-bound
17
Conclusions
With the same chip area, SMT performs better than CMP for memory bound benchmarks while CMP performs better than SMT for non memory bound benchmarks with Apple G5 like architecture. The thermal heating effects are quite different for CMP and SMT CMP machines are clearly hotter than SMT machines with leaky technology Different DTM technique favors different architecture

2005, Yingmin Li
18
Future Work
Consider significantly larger amounts of threadlevel parallelism and hybrids between CMP and SMT cores The impact of varying core complexity on the performance of SMT and CMP, and explore a wider range of design options, like SMT fetch policies. Explore server-oriented workloads
2005, Yingmin Li
19

SMT and CMP Architecture

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

SMT and CMP Architecture

Caricato da

Copyright:

Formati disponibili

Performance, Energy and Thermal Considerations of SMT and CMP architectures

Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron

Equal performance curve?

Single Thread Performance

Scope of this Study

Performance sensitivity with different L2 size

2.25M 2.5M 2.75M

CMP L2 size = SMT L2 size 1MB

Modeling and Validation

Hotspot temperature model

Thermal Interface Material

Fin-to-air convection thermal resistor

Peak Temperature of The Hottest Spot for SMT and CMP

Heat Flow of Global Heat-up

Illustration (global heat-up of CMP vs. local heat-up of SMT)

CMP 100 90 80 70 60 50 40 IFU_B1

200 150 100 50 0 IFU_B1 LSU_cache FXU_reg

Temperature Trend with technology evolution

Normal case L2 leakage radically reduced No Temperature effect on Leakge

Relative change compared to ST baseline

Relative change compared with baseline ST with 2MB L2

SMT with 3MB L2 CMP with 2MB L2

SMT with 2MB L2 CMP with 1MB L2

SMT with 3MB L2 CMP with 2MB L2

100% 80% 60% 40% 20% 0%

100% 80% 60% 40% 20% 0% -20% -40% -60% -80%

-20% -40% -60% -80%

SMT vs. CMP performance with DTM

Global fetch throttling Register file throttling

Relative change compared to ST baseline without DTM

100% 80% 60% 40% 20% 0% SMT -20% CMP ST

SMT energy efficiency with DTM

Fetch throttling Register file throttling

Fetch throttling Register file throttling

-20% -40% -60% -80%

CMP energy efficiency with DTM

Fetch throttling Register file throttling

Fetch throttling Register file throttling

Potrebbero piacerti anche