Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dept. of Computer Science, University of Virginia Division of Engineering and Applied Sciences, Havard University IBM T.J.Watson Research Center
Motivation
note: not to scale Out-oforder Processor
2005, Yingmin Li
#threads per chip Future trend calls for multi-core and multi-thread architectures Which is better: lots of tiny speed demons or fewer brainiacs? Which is more valuable, more L2 or additional cores? Performance, power, and thermal properties of multi-core vs. multi-thread architectures not well understood
In-order Processor
Sun Niagara
FXU
ISU
FPU
IDU
I_cache
fx_reg
BXU LSU
D_cache
Singlethreaded
IFU
L2Cache
fp_reg
FXU
ISU
FPU
IDU
I_cache
fx_reg
BXU LSU
D_cache
SMT
fp_reg
FXU
fp_reg
FXU
ISU
FPU
IDU BXU LSU
fx_reg
ISU
FPU
IDU BXU LSU
fx_reg
IFU
2005, Yingmin Li
L2Cache
I_cache
D_cache
I_cache
D_cache
IFU
IFU
L2Cache
Singlethreaded CMP
3
Outline
2005, Yingmin Li
Modeling / Model Validation SMT vs. CMP performance, power and thermal analysis (without DTM) SMT vs. CMP performance, power and thermal analysis (with DTM) Conclusions and future work
CMP
2005, Yingmin Li
1.5M 1.75M
2M
3M
L2 size (SMT)
Performance: Turandot with SMT and CMP augmentations, validated against Power4 preRTL model Power: PowerTimer with SMT and CMP augmentations, validated against CPAM power data extracted from circuit Temperature: Hotspot from UVA integrated with Turandot/PowerTimer, validated with test chips at UVA
Turandot/PowerTimer Simulation Framework Supports SMT/CMP Runs on AIX/PowerPC and Linux/Intel platforms PowerTimer based on CPAM data, extracted from circuits See Micro02 tutorial by Zhigang Hu and David Brooks for details
2005, Yingmin Li
95 90
Temperature (Celsius)
85 80 75 70 65 60 55 50
2005, Yingmin Li
d)
ST
r)
CM P
SM
fa ct o
rg e
ac tiv ity
ae
(a re
T( on ly
SM
CM P
(o ne
ST
co re
ro ta
nl a
te d
Primary Path
Secondary Path
Printed-circuit Board
10
SMT
250
CMP
SMT
Power density
LSU_cache FXU_reg
2005, Yingmin Li
Temperature
11
2005, Yingmin Li
12
SMT vs. CMP performance and power efficiency analysis (without DTM)
SMT is superior for memory bound(high-l2cache-miss rate) benchmarks while CMP is superior for non memory bound benchmarks
2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% dual-core CMP
2-way SMT 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80%
Relative change compared to ST baseline
dual-core CMP
2005, Yingmin Li
Compute-bound
ENERGY DELAY^2
Memory-bound
13
ENERGY DELAY^2
IPC
POWER
ENERGY
ENERGY
ENERGY DELAY
ENERGY DELAY
IPC
POWER
The impact of changing L2 size: Examples Stays memory bound when L2 size changes
SMT with 2MB L2 CMP with 1MB L2
2.54
Changes from memory bound to non memory bound when L2 size changes
Relative change compared with baseline ST with 2MB L2
13.2
2005, Yingmin Li
MCF+MCF
ENERGY DELAY^2
MCF+VPR
ENERGY DELAY^2
14
POWER
IPC
ENERGY
ENERGY
ENERGY DELAY
ENERGY DELAY
POWER
IPC
Fetch throttling Local technique Rename throttling Register file throttling (ideal)
No DTM Local renaming throttling Global fetch throttling Register file throttling
Localized DTM method favors SMT while global DTM method favors CMP
No DTM Local renaming throttling
Relative change compared to ST baseline without DTM
100% 80% 60% 40% 20% 0% SMT -20% CMP ST
2005, Yingmin Li
Compute-bound
Memory-bound
15
No DTM Rename throttling 100% 80% 60% 40% 20% 0% POWER ENERGY
100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2
ENERGY DELAY
ENERGY DELAY^2
2005, Yingmin Li
Compute-bound
Memory-bound
16
No DTM Rename throttling 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY
100% 80% 60% 40% 20% 0% -20% -40% -60% -80% POWER ENERGY ENERGY DELAY ENERGY DELAY^2
ENERGY DELAY
ENERGY DELAY^2
2005, Yingmin Li
Compute-bound
Memory-bound
17
Conclusions
With the same chip area, SMT performs better than CMP for memory bound benchmarks while CMP performs better than SMT for non memory bound benchmarks with Apple G5 like architecture. The thermal heating effects are quite different for CMP and SMT CMP machines are clearly hotter than SMT machines with leaky technology Different DTM technique favors different architecture
2005, Yingmin Li
18
Future Work
Consider significantly larger amounts of threadlevel parallelism and hybrids between CMP and SMT cores The impact of varying core complexity on the performance of SMT and CMP, and explore a wider range of design options, like SMT fetch policies. Explore server-oriented workloads
2005, Yingmin Li
19