IBM POWER Processor

IBM’s Next Generation POWER
Processor
Hot Chips
August 18-20, 2019
Jeff Stuecheli
Scott Willenborg
William Starke
Proposed POWER Processor Technology and I/O Roadmap
Focus of 2018 talk
POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10
2010 2012 2014 2016 2017 2018 2020 2021

POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10
8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores
45nm 32nm 22nm 12 cores 14nm 14nm 14nm
22nm
New Micro- New Micro- Enhanced Enhanced
New Micro- Enhanced Enhanced Micro- New Micro-
Architecture Micro- Architecture Micro- Architecture Micro- Architecture
Architecture Architecture
With NVLink Direct attach
memory Buffered New
New Process Memory Memory
New Process New Process Subsystem New Process
Technology Technology Technology New Process Technology
Technology
Up To Up To Up To Up To Up To Up To Up To Up To
Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s
Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5
20 GT/s 25 GT/s 25 GT/s 25 GT/s 32 & 50 GT/s

Advanced I/O Signaling N/A N/A N/A
160GB/s 300GB/s 300GB/s 300GB/s
CAPI 2.0, CAPI 2.0, CAPI 2.0,

CAPI 1.0 , OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0,
Advanced I/O Architecture N/A N/A CAPI 1.0 NVLink TBA
NVLink NVLink NVLink
© 2019 IBM Corporation

Statement of Direction, Subject to Change 2
Focus of today’s talk
2010 2012 2014 2016 2017 2018 2020 2021

22nm
memory Buffered New
Technology

160GB/s 300GB/s 300GB/s 300GB/s


Looking forward
2010 2012 2014 2016 2017 2018 2020 2021

22nm
memory Buffered New
Technology

160GB/s 300GB/s 300GB/s 300GB/s


Future Evolution of System Architecture
Yesterday’s Plumbing
Tomorrow’s Differentiation
B B B B B B B B B B B B B B B B B B Emerging JEDEC Buffers
OMI Memory
Cores & Cores & Cores &

Caches System Bus Caches Caches
768 GB/s
PowerAXON & PCI
Yesterday’s Plumbing PCI Gen N+1 SMP / OpenCAPI / NVLink
Tomorrow’s Differentiation
© 2019 IBM Corporation 5

POWER9 – Premier Acceleration Platform
• Extreme Processor / Accelerator Bandwidth and Reduced Latency
• Coherent Memory and Virtual Addressing Capability for all Accelerators POWER9
• OpenPOWER Community Enablement – Robust Accelerated Compute Options PowerAccel
• State of the Art I/O and Acceleration Attachment Signaling PCIe

Devices PCIe G4
– PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth PCIe
I/O
–
ASIC / G4
– 25 G Common Link x 96 lanes – 600 GB/s duplex bandwidth FPGA CAPI 2.0 CAPI
Devices
NVLink
• Robust Accelerated Compute Options with OPEN standards Nvidia
GPUs
NVLink
25G OpenCAPI
– On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2
–
ASIC / OpenCAPI Memory
FPGA
– CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 Devices
–
DRAM/
– NVLink – Next generation of GPU/CPU bandwidth SCM/
OMI
etc
– OpenCAPI – High bandwidth, low latency and open interface On Chip

Accel
– OMI – High bandwidth and/or differentiated for acceleration

THE WORLD’S TWO
MOST POWERFUL
SUPERCOMPUTERS
BUILT FOR THE AI ERA
WITH OPEN COLLABORATION

OpenCAPI Design Goals
• Designed to support range of devices
– Coherent Caching Accelerators
– Network Controllers
– Differentiated Memory
o High Bandwidth
o Low Latency
o Storage Class Memory
– Storage Controllers
• Asymmetric design, endpoint optimized for host and device attach

– ISA of Host Architecture: Need to hide difference in Coherence, Memory Model, Address Translation, etc.
– Design schedule: The design schedule of a high performance CPU host is typically on the order of multiple years,
conversely, accelerator devices have much shorter development cycles, typically less than a year.
– Timing Corner: ASIC and FPGA technologies run at lower frequencies and timing optimization as CPUs.
– Plurality of devices: Effort in the host, both IP and circuit resource, have a multiplicative effect.
– Trust: Attached devices are susceptible to both intentional and unintentional trust violations
– Cache coherence: Hosts have high variability in protocol. Host cannot trust attached device to obey rules.

OpenCAPI Design Goals
• Low Latency and High Bandwidth

– Fixed width DL CRC
– Aligned TL
– Aligned Data
– Separately pipelined control/tag vs data
o Compromise in switching capability

POWER9 Family Memory Architecture
Scale Up Superior RAS, High bandwidth, High Capacity
Buffered Memory Agnostic interface for alternate memory innovations
Scale Out Low latency access

Direct Attach Memory Commodity packaging form factor
OpenCAPI Agnostic Buffered Memory (OMI)

Near Tier Commodity Enterprise Storage Class
Extreme Bandwidth Low Latency RAS Extreme Capacity
Low Capacity Low Cost Capacity Persistence
Bandwidth
IBM Confidential
© 2019 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 10
Primary Tier Memory Options
72b ~2666 MHz bidi

8b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
Same System
72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns
8b
8b Capacity ~2 TB higher load-to-use
BW ~150 GB/sec than RDIMM
(< 5ns for LRDIMM)
8b ~25G diff
8b
DDR4 OMI DIMM
BUF
Capacity ~256GB→4 TB
OMI Strategy
8b
BW ~320 GB/sec
Same System
8b ~25G diff
BW Opt OMI DIMM
16b
Capacity ~128→512 GB
BUF
BW ~650 GB/sec
1024b
On module
On Module HBM
Si interposer Capacity ~16→32 GB Unique System
BW ~1 TB/sec
Final Addition to the POWER9 Processor Family
Processor Chip Details
• 728 mm2 ( 25.3 x 28.8 mm)
The Bandwidth Beast Open Memory Interface (OMI)
Advanced I/O (AIO) • 16 channels x8 at 25 GT/s
• 8 Billion Transistors
• 650 GB/s peak 1:1 r/w bandwidth
• Up to 24 SMT4 Cores PowerAXON (x48) Memory Signaling (8x8 OMI)
• Technology Agnostic
• Up to 120 MB eDRAM L3 cache
Core Core Core Core Core Core Core Core
• Offered w/ Microchip DDR4 buffer
SMP and Accelerator Interconnect

(410 GB/s peak bandwidth)
L2 L2 L2 L2
Semiconductor Technology
Local SMP Signaling (3x30)

PCIe Gen4 Signaling (x48)
10 MB
• 14nm finFET
L3 Region
L3 L3
PowerAXON 25 GT/s Attach
• Improved device performance L2 L2 L2 L2 • Up to 16 socket glue-less SMP
• Reduced energy (4x24 SMP added to 3x30 local)
• eDRAM • Up to x48 NVIDIA NVLINK GPU
• 17 layer metal stack attach
L3 L3 • Up to x48 OpenCAPI 4.0 coherent
L2 L2 L2 L2 accelerator / memory attach
High Bandwidth Signaling
• 25 GT/s low energy differential Core Core Core Core Core Core Core Core
Industry Standard I/O Attach
• PowerAXON, OMI memory
PowerAXON (x48) Memory Signaling (8x8 OMI) • x48 PCIe Gen 4 at 16 GT/s
• 16 GT/s low energy differential
• Up to x16 CAPI 2.0 coherent
• Local SMP accelerator / storage attach
2 TB/s Raw Signaling Bandwidth
• 16 GT/s PCIe Gen4
Shared by 6 Attach Protocols
Unleashing the Bandwidth: POWER9 AIO + NVIDIA GPU
Potential System
Offering NVLink
GPU GPU
GPU GPU
OMI Memory OMI Memory

NVLink
OpenCAPI OpenCAPI
IBM IBM
POWER9 AIO POWER9 AIO
PCIe Gen4 SMP PCIe Gen4
OMI Memory OMI Memory

OpenCAPI 4.0: Asymmetric Open Accelerator Attach
Roadmap of Capabilities and Host Silicon Delivery
Accelerator Protocol CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 4.0 OpenCAPI 5.0
First Host Silicon POWER8 POWER9 SO POWER9 SO POWER9 AIO POWER10
(GA 2014) (GA 2017) (GA 2017) (GA 2020) (GA 2021)
Functional Partitioning Asymmetric Asymmetric Asymmetric Asymmetric Asymmetric
Host Architecture POWER POWER Any Any Any
Cache Line Size Supported 128B 128B 64/128/256B 64/128/256B 64/128/256B
Attach Vehicle PCIe Gen 3 PCIe Gen 4 25 G (open) 25 G (open) 32/50 G (open)
Tunneled Tunneled Native DL/TL Native DL/TL Native DL/TL
Address Translation On Accelerator Host Host (secure) Host (secure) Host (secure)
Native DMA to Host Mem No Yes Yes Yes Yes
Atomics to Host Mem No Yes Yes Yes Yes
Host Thread Wake-up No Yes Yes Yes Yes
Host Memory Attach Agent No No Yes Yes Yes
Low Latency Short Msg 4B/8B MMIO 4B/8B MMIO 4B/8B MMIO 128B push 128B push
Posted Writes to Host Mem No No No Yes Yes
Caching of Host Mem RA Cache RA Cache No VA Cache VA Cache
POWER9 Connectivity Variants
POWER9 Scale Out POWER9 Scale Up POWER9 Advanced I/O
Direct Attach Memory DMI Buffered (Centaur) Memory OMI Buffered Memory
2 Socket SMP 16 Socket SMP 16 Socket SMP
4 x DDR4 memory 4 x DMI buffered memory 8 x OMI buffered memory

x24 Accelerator x48 PowerAXON x48 PowerAXON
Interconnect Interconnect Advanced Interconnect
- OMI Bandwidth
- OpenCAPI function
- NVLINK function
3x bandwidth 6x bandwidth
per mm2 as per mm2 as
DDR signaling DDR signaling
x24 Accelerator 2 x Local SMP x48 PowerAXON 3 x Local SMP x48 PowerAXON 3 x Local SMP
4 x DDR4 memory 4 x DMI buffered memory 8 x OMI buffered memory
DRAM DIMM Comparison
IBM Centaur DIMM

OMI DDIMM
• Technology agnostic
• Low cost
JEDEC DDR • Ultra-scale system density
DIMM • Enterprise reliability
• Low-latency
• High bandwidth
© 2019 IBM Corporation Approximate Scale 16

Open Memory Interface (OMI)
• Signaling: 25.6GHz vs DDR4 @ 3200 MHz
– 4x raw bandwidth per I/O signal
– 1.3x mixed traffic utilization
• Idle load-to-use latency over traditional DDR:
– POWER8/9 Centaur design ~10 ns
– OMI target of ~5-10 ns (RDIMM)
– OMI target of < 5ns (LRDIMM)
• IBM Centaur: One proprietary DMI design

• Microchip SMC 1000:
– Open (OMI) design
– Emerging JEDEC Standard

OMI Memory Latency
SMC 1000
HSS HSS DDR4 Cmd/Addr

Lane Deskew Fast Activate Driver
Transmit Receiver Transmit PHY
DRAM
DRAM
DRAM
DRAM
HSS HSS Frame Read DDR4 Data
Receiver
Receive Transmitter Formatting Dataflow Receive PHY
IBM
POWER9
AIO
Microchip’s SMC 1000 8x25G features an innovative low latency design that delivers
less than four ns incremental latency over a traditional integrated DDR controller
with LRDIMM. This results in OMI-based DDIMM products having virtually identical
bandwidth and latency performance to comparable LRDIMM products.
6X bandwidth / PHY area advantage gives POWER9 AIO bandwidth of 16 DDR ports
18
Final Addition to POWER9: Stay Tuned for POWER10
PowerAXON The Bandwidth Beast OMI Memory

POWER9 with Advanced I/O (AIO)
PowerAXON (x48) Memory Signaling (8x8 OMI)
SMP and Accelerator Interconnect

GPU
L2 L2 L2 L2
Local SMP Signaling (3x30)

PCIe Gen4 Signaling (x48)
10 MB
NVLink L3
L3 Region
L3
GPU L2 L2 L2 L2
GPU
L3 L3
L2 L2 L2 L2
PowerAXON (x48) Memory Signaling (8x8 OMI)

Thank You!
19

IBM POWER Processor

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

IBM POWER Processor

Caricato da

Copyright:

Formati disponibili

IBM’s Next Generation POWER

2010 2012 2014 2016 2017 2018 2020 2021

20 GT/s 25 GT/s 25 GT/s 25 GT/s 32 & 50 GT/s

CAPI 2.0, CAPI 2.0, CAPI 2.0,

© 2019 IBM Corporation

2010 2012 2014 2016 2017 2018 2020 2021

20 GT/s 25 GT/s 25 GT/s 25 GT/s 32 & 50 GT/s

CAPI 2.0, CAPI 2.0, CAPI 2.0,

© 2019 IBM Corporation

2010 2012 2014 2016 2017 2018 2020 2021

20 GT/s 25 GT/s 25 GT/s 25 GT/s 32 & 50 GT/s

CAPI 2.0, CAPI 2.0, CAPI 2.0,

© 2019 IBM Corporation

Cores & Cores & Cores &

© 2019 IBM Corporation 5

• State of the Art I/O and Acceleration Attachment Signaling PCIe

– OpenCAPI – High bandwidth, low latency and open interface On Chip

© 2019 IBM Corporation 6

© 2019 IBM Corporation

• Asymmetric design, endpoint optimized for host and device attach

© 2019 IBM Corporation

• Low Latency and High Bandwidth

© 2019 IBM Corporation

Scale Out Low latency access

OpenCAPI Agnostic Buffered Memory (OMI)

72b ~2666 MHz bidi

SMP and Accelerator Interconnect

Local SMP Signaling (3x30)

OMI Memory OMI Memory

OMI Memory OMI Memory

4 x DDR4 memory 4 x DMI buffered memory 8 x OMI buffered memory

IBM Centaur DIMM

© 2019 IBM Corporation Approximate Scale 16

• IBM Centaur: One proprietary DMI design

© 2019 IBM Corporation

HSS HSS DDR4 Cmd/Addr

PowerAXON The Bandwidth Beast OMI Memory

Core Core Core Core Core Core Core Core

SMP and Accelerator Interconnect

Local SMP Signaling (3x30)

Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core

PowerAXON (x48) Memory Signaling (8x8 OMI)

© 2019 IBM Corporation

Potrebbero piacerti anche