Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
!-$/PTERON'(Z
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
)NTEL0ENTIUM)))'(Z
!LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
6!8
1
Tuesday, April 24, 12
History of Processor Performance
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
!-$/PTERON'(Z
CSEE 3827
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
)NTEL0ENTIUM)))'(Z
!LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
6!8
2
Tuesday, April 24, 12
History of Processor Performance
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
!-$/PTERON'(Z
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
)NTEL0ENTIUM)))'(Z
!LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
COMS 4824
6!8
3
Tuesday, April 24, 12
Abstract Stages of Execution
Instruction Fetch
(Instructions fetched from memory into CPU)
4
Tuesday, April 24, 12
Multiple Instruction Issue Processors
Multiple instructions fetched, executed, and committed in each cycle
F:
In superscalar processors instructions E:
are scheduled by the HW C:
Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)
Exploits
instr-level
parallelism
(ILP)
6
Tuesday, April 24, 12
Out-of-order execution
In in-order execution, In out-of-order execution (OOO),
instructions are fetched, instructions are fetched, and
executed, and committed in committed in compiler, order; may be
compiler order executed in some other order
F: F: One stalls,
independent
One stalls, instrs may
they all stall proceed
E: E:
Relatively Additional
simple hardware
HW required for
C: C: reordering
F: F: F:
Mis-
E: E: speculation
executes
E: excess
instructions,
C: C: costing time
and power
C:
8
Tuesday, April 24, 12
The Memory Wall
a l l y
annu
Performance
2 0 %
25 -
ro x
a pp
as e
c re
e ds in
sp e lly
n n u a
C PU o x 2 - 11 % a
d s i n c rease appr
A M s p e e
D R
Time
A result of this gap is that cache design has increased in
importance over the years. This has resulted in
innovations such as victim caches and trace caches.
9
Tuesday, April 24, 12
Modern Processor Performance
While single threaded performance has leveled, multithreaded performance potential scaling.
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
!-$/PTERON'(Z
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
)NTEL0ENTIUM)))'(Z
!LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
-)03-
YEAR COMS 6824
-)03-
3UN
+
COMS 4130
6!8
6!8
YEAR
6!8
(Parallel Programming)
10
Tuesday, April 24, 12
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
!-$/PTERON'(Z
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
)NTEL0ENTIUM)))'(Z
!LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
6!8
3
)NTEL8EON'(Z
BIT)NTEL8EON'(Z
Increase
2 !-$/PTERON'(Z
1
)NTEL0ENTIUM '(Z
!-$!THLON'(Z
0
)NTEL0ENTIUM)))'(Z
Pipelined Superscalar OOO-Speculation Deep Pipelined
-1 !LPHA!'(Z
!LPHA'(Z
!LPHA'(Z
!LPHA'(Z
0ERFORMANCEVS6!8
!LPHA'(Z
!LPHA!'(Z
0OWER0#'(Z
!LPHA'(Z
(00!
2)3#'(Z
)"-23
YEAR
-)03-
-)03-
3UN
6!8
6!8
YEAR
6!8
#LOCK 2ATE -(Z
0OWER 7ATTS
#LOCK 2ATE
0OWER
0ENTIUM
0RO
7ILLAMETTE
0ENTIUM
+ENTSFIELD
0ENTIUM
0RESCOTT
0ENTIUM
#ORE
1,
x
V>i>`*ivinVViii}}ii>
>`xi>4HE 0ENTIUM MADE A DRAMATIC JUMP IN CLOCK RATE AND POWER BUT LESS SO IN PERFORMANCE
4HE 0RESCOTT THERMAL PROBLEMS LED TO THE ABANDONMENT OF THE 0ENTIUM LINE 4HE #ORE LINE REVERTS TO A
SIMPLER PIPELINE WITH LOWER CLOCK RATES AND MULTIPLE PROCESSORS PER CHIP #OPYRIGHT %LSEVIER )NC !LL
RIGHTS RESERVED
12
Tuesday, April 24, 12
Much of it goes back to the transistor
individual atoms!
individual atoms!
= leakage current +
defects Source: Intel press foils
Tuesday, April 24, 12
A model of power
P = Pswitch + Pleakage
2
Pswitch = Eswitch x F = (C x Vdd ) x F
Pleakage = Vdd x I
Tuesday, April 24, 12
Voltage Scaling: DVFS + Near-Threshold Computing
[Source: Dreslinski et al.: Near-Threshold Computing: Recplaiming Moores Law Through Energy Efficient Integrated Circuits]
Tuesday, April 24, 12
Voltage Scaling: DVFS + Near-Threshold Computing
[Source: Dreslinski et al.: Near-Threshold Computing: Recplaiming Moores Law Through Energy Efficient Integrated Circuits]
Tuesday, April 24, 12
Chip Area and Power Consumption
Power Density (Watts/cm2)
1500
Active Power
Leakage Power
1000
With leakage power dominating,
500
power envelope to remain constant
power consumption roughly
proportional to transistor count
0
90nm 65nm 45nm 32nm 22nm 16nm
1000
Integer Performance
100
Pollacks Law:
Processor performance grows
10 with sqrt of area
1
1 10 100 1000 10000 100000
Processor Area
Source: Shekhar Borkar (Intel)
Tuesday, April 24, 12
The Resulting Shift to Multicore
Perf = 1
Power = 1
Perf = 1 Perf = 2
Power = 1 Power = 4
BIT&05
(40(9LINK
,OAD ,$ATA
K"
-" 3TORE #ACHE , #ORE
3HARED
%XECUTION , #ACHE
,
#TL
#ACHE &ETCH
$ECODE ,)NSTR
"RANCH #ACHE $
$
2
.ORTHBRIDGE
0
(
9
(40(9LINK
#ORE #ORE
18
Tuesday, April 24, 12
x86 64-bit Architecture Evolution
Mfg.
90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
Process
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
CPU Core
Hyper
Transport 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Technology
Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333
Sandy
Merom Penryn Nehalem Westmere Bridge
Forecast
Nehalem-EX Architecture All products, dates, and figures are preliminary and
are subject to change without notice.
6 Hot Chips 2009
C C C
O O O Core
R R R
E E E
L3 Cache
DRAM
Uncore
Power
IM QPI QPI &
C Clock
QPI
Common%core%for%client%and%server%CPUs
http://www.intel.com/technology/architecture<silicon/next<gen/whitepaper.pdf
Some%unique%features%only%on%NHM<EX
Uncore%differentiates%different%segment%specific%CPUs
Scalable%Core/Uncore%gasket%interface
Decouples%core%and%uncore%operation
Nehalem-EX Architecture
7 Hot Chips 2009
Monolithic)single)die)CPU
8)Nehalem)cores,)16)threads
3MB*LLC 3MB LLC
24MB)shared)L3)cache
3MB*LLC 3MB*LLC 2)integrated)memory)controllers
Scalable)Memory)Interconnect)(SMI))with)
support)for)up)to)8)DDR)channels)
3MB*LLC 3MB*LLC 4)Quick)Path)Interconnect)(QPI))links)with)up)to))
6.4GT/s
3MB LLC 3MB*LLC
Supports)2,)4)and)8)socket)in)glueless)configs)
and)larger)systems)using)Node)Controller)(NC)
Intel)45nm)process)technology
2.3)Billion)transistors
Nehalem-EX Architecture
8 Hot Chips 2009
POWER7
RS64IV Sstar 130nm -Multi-core
POWER6TM
RS64III Pulsar 180nm -Ultra High Frequency
.18um
RS64II North Star POWER5TM
.25um
-SMT
RS64I Apache .35um
BiCMOS .5um
POWER4TM Major POWER Innovation
Muskie A35 .5um -Dual Core -1990 RISC Architecture
.22um
.5um -1994 SMP
-Cobra A10 -1995 Out of Order Execution
-64 bit
-1996 64 Bit Enterprise Architecture
POWER3TM -1997 Hardware Multi-Threading
.35um -630 -2001 Dual Core Processors
-2001 Large System Scaling
-2001 Shared Caches
.72um POWER2TM
-2003 On Chip Memory Control
P2SC -2003 SMT
.25um -2006 Ultra High Frequency
RSC -2006 Dual Scope Coherence Mgmt
.35um
1.0um -2006 Decimal Float/VSX
.6um -2006 Processor Recovery/Sparing
604e
-2009 Balanced Multi-core Processor
-603
POWER1 -2009 On Chip EDRAM
-AMERICAs
-601
Goals:
Functional
Performance
Reliability
Cost
Energy efficiency
Time to market [Credit: Milo Martin, UPenn]
Tuesday, April 24, 12
Technology:
Logic gates Application domains:
SRAM PCs
DRAM Servers
Circuit technologies Computer PDAs
Packaging Mobile phones
architecture
Magnetic storage Supercomputers
Flash memory is at the Game consoles
Biochips intersection Embedded
3D stacking
Goals:
Functional
Performance
Reliability
Cost
Energy efficiency
Time to market [Credit: Milo Martin, UPenn]
Tuesday, April 24, 12