04 Sun Golla PDF

Niagara2: A Highly Threaded
Server-on-a-Chip
Robert Golla
Principal Architect
Sun Microsystems
October 10, 2006
Contributors
Jama Barreh
Mark Luttrell
Jeff Brooks
Mark Mcpherson
William Bryg
Shimon Muller
Bruce Chang
Chris Olson
Robert Golla
Bikram Saha
Greg Grohoski
Manish Shah
Rick Hetherington
Michael Wong
Paul Jordan
Page 2
Agenda
Chip Overview
Throughput Computing
Sparc core
Crossbar
L2 cache
Networking
PCI-Express
Power
Status
Summary Page 3
Niagara2 Chip Overview 8 Sparc cores, 8
threads each
L2 Data
Bank 0
L2 Data
Bank 4
Shared 4MB L2,
L2B0 SPARC SPARC SPARC SPARC L2B4 8-banks, 16-way
L2 Data
Core 0 Core 1 Core 5 Core 4
L2 Data associative
Bank 1 Bank 5
L2B1 L2B5
Four dual-channel
MCU0 L2 L2 L2 L2 MCU2 FBDIMM memory
controllers
NCU
TAG0 TAG1 TAG5 TAG4 FSR
EFU
MCU1 MCU3
Two 10/1 Gb Enet

SIO
SII
L2B3 CCX L2B7

FSR
ports
CCU
L2 Data L2 Data
Bank 3 Bank 7
L2B2
L2
TAG2
L2
TAG3
L2
TAG7
L2
TAG6 L2B6 One PCI-Express
L2 Data
Bank 2
L2 Data
Bank 6
x8 1.0A port
342 mm^2 die
size in 65 nm
DMU SPARC SPARC SPARC SPARC
Core 2 Core 3 Core 7 Core 6 TDS
RDP
PEU 711 signal I/O,

PSR ESR FSR MAC
RTX 1831 total
Page 4
Niagara2 Chip Overview

Full 8x9 crossbar
switch
Sparc Core L2 Bank0 Memory

Connects every core
FBDIMM
Sparc Core L2 Bank1 Control 0 to every L2 bank
and vice-versa
Control 1 FBDIMM
Sparc Core 8x9 L2 Bank3
Supports 8 byte
Cache
Sparc Core Crossbar L2 Bank4 Memory
writes from a core to
Control 2 FBDIMM a bank
Sparc Core L2 Bank5

Supports 16 byte
FBDIMM
Sparc Core L2 Bank7 Control 3 reads from a bank to
I/O
core

One port for core to
read/write IO
NIU System
2x10/1
(Ethernet) Interface
Unit

System interface
GE unit connects
networking and IO
PCI-EX X8 to memory
@2.5Gb/s
Page 5
Throughput Computing Memory Latency
Compute Time
C M C M C M C M Single
Thread
C M C M C M C M
C M C M C M C M
C M C M C M C M
Threads
C M C M C M C M
C M C M C M C M
C M C M C M C M
C M C M C M C M
Time
For a single thread

Memory is THE bottleneck to improving performance

Commercial server workloads exhibit poor memory locality

Only a modest throughput speedup is possible by reducing compute time

Conventional single-thread processors optimized for ILP have low
utilizations
With many threads

Its possible to find something to execute every cycle

Significant throughput speedups are possible

Processor utilization is much higher Page 6
Engineering Solutions
Design Problem
> Double UltraSparc T1's throughput and throughput/watt
> Improve UltraSparc T1's FP single-thread and
throughput performance
> Minimize required area for these improvements
Considered doubling number of UltraSparc T1
cores
> 16 cores of 4 threads each
> Takes too much die area
> No area left for improving FP performance
Page 7
Engineering Solutions Probabilistic Modelling
> Generate synthetic traces
for each thread with an
instruction/miss profile that
Multithread_performance_vs._threads (12/2002) matches TPC-C
Total IPC
> Schedule ready threads to
run on some number of
2.0 execution units
Niagara2 > End simulation once
simulated distributions are
close to actual
Relative Throughput Performance
distributions
Works very well for simple
scalar cores running lots of
threads on transactional
1.0 workloads
> Within 10 percent of a
UltraSparc T1
detailed cycle accurate
simulator
> Detailed cycle accurate
simulator not available at
beginning of the project
Page 8
Engineering Solutions
Decided to increase the number of threads per
core and increase execution bandwidth
> 8 threads per core x 8 cores = 64 threads total
> 2 EXUs per core
> More than doubles UltraSparc T1s throughput
> Doubling threads is more area efficient than doubling
cores
> Integrate FGU into core pipeline
6 cycle FP latency
Threads running FP are non-blocking

> Enhance Niagara2s cryptography
Added more ciphers
Enhanced existing public key support

Page 9
Throughput Changes
Niagara2 throughput changes vs. UltraSparc T1
> Add instruction buffers after L1 instruction cache for
each thread
> Add new pipe stage pick
> Choose 2 threads out of 8 to execute each cycle
> Increase execution units from 1 to 2
> Increase set associativity of L1 instruction cache to 8
> Increase size of fully associative DTLB from 64 to 128
entries
> Increase L2 banks from 4 to 8
> 15 percent performance loss with only 4 banks and 64 threads
> Increase threads from 4 to 8 Page 10
Sparc Core Block Diagram
IFU Instruction Fetch Unit
>
16 KB I$, 32B lines, 8-way SA
>
64-entry fully-associative ITLB
TLU IFU
EXU0/1 Integer Execution Units
>
4 threads share each unit
>
8 register windows/thread
>
160 IRF entries/thread
EXU0 EXU1
LSU Load/Store Unit
>
8 threads share LSU
>
8KB D$, 16B lines, 4-way SA
>
128-entry fully-associative DTLB
SPU FGU LSU

FGU Floating-Point/Graphics Unit

8 threads share FGU

32 FRF entries/thread

SPU Stream Processing Unit
MMU/
HWTW
>
Cryptographic coprocessor

TLU Trap Logic Unit
>
Updates machine state, handles
exceptions and interrupts
Gasket
MMU Memory Management Unit
>
Hardware tablewalk (HWTW)
>
8KB, 64KB, 4MB, 256MB pages
Crossbar/L2 Page 11
Core Pipeline
8 stage integer pipeline
Fetch Cache Pick Decode Execute Mem Bypass W
> 3-cycle load-use penalty

> Memory (data translation, access tag/data array)
> Bypass (late way select, data formatting, data forwarding)
12 stage floating-point pipeline

Fetch Cache Pick Decode Execute Fx1 Fx2 Fx3 Fx4 Fx5 FB FW
> 6-cycle latency for dependent FP ops

> Longer pipeline for divide/sqrt
Page 12
Integer/LSU Pipeline
Instruction cache is shared by
all 8 threads

Least-recently-fetched
F IFU algorithm used to select
Thread Thread next thread to fetch
C Each thread is written into
Group 0 Group 1
thread-specific instruction
buffer
IB0-3 IB4-7
Decouples fetch from
pick

Each thread statically assigned
P P to one of 2 thread groups

Pick chooses 1 ready thread
D D each cycle within each thread
LSU group
E E

Picking within each thread
group is independent of the
other
M M M
Least-recently-picked
algorithm used to select
B B B next thread to execute

Decode resolves resource
W W W hazards not handled during pick
Page 13
Integer/LSU Pipeline
Threads are
interleaved between
F2 IFU pipeline stages with
Thread C6 Thread very few restrictions
Group 0 Group 1
Any thread can be at
fetch or cache stage
IB0-3 IB4-7
Threads are split into
2 thread groups
before pick stage
P0 P5
Load/store and
floating-point units
D2 D7 are shared between
LSU
E0 E6 all 8 threads

Up to 1 thread from
M3 M4 M4 either thread group
can be scheduled on
B1 B1 B7 a shared unit
W2 W6 W6 Page 14
Stream Processing Unit
Cryptographic coprocessor
>
One per core
>
Runs in parallel w/core at
MA Scratchpad same frequency
160x64b, 2R/1W
Two independent sub-units
MA >
Modular Arithmetic Unit
Sources
To FGU
>
RSA, binary and integer
rs1 rs2 polynomial elliptic curve
Multiply
(ECC)
Result MA Execution >
Shares FGU multiplier
From FGU >
Cipher/Hash Unit
Store Data, Address >
RC4, DES/3DES, AES-
128/192/256
DMA Engine Address,
Data to/from
>
MD5, SHA-1, SHA-256
L2 >
Designed to achieve
wire-speed on both
10Gb Ethernet ports
Hash Cipher >
Facilitates wire-
Engine Engines speed encryption
and decryption

DMA engine shares cores
crossbar port Page 15
Crossbar
Connects 8 cores to 8 L2
Banks and I/O

Non-blocking, pipelined
switch
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc

8 load/store requests and 8
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 data returns can be done at
the same time
~90 GB/s write
Divided into 2 parts
L2 B0 Mux

PCX processor to
cache
PCX

CPX cache to
processor
L2 B7 Mux

Arbitration for a target is
~180 GB/s read
required

Priority given to oldest
L2 L2 L2 L2 L2 L2 L2 L2 requestor to maintain
Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7
fairness and order

Three cycle arbitration
protocol

Request, arbitrate and
then grant
Page 16
L2 Cache
4 MB L2 cache
PCX Request CPX Return

16 way set associative
Input I/O Request
Queue
Output
Queue

8 L2 banks
Replayed Miss Fill Request

64 byte line size
L2
Arbiter 16B Directory
L2 cache is write-back,
lookup
write-allocate
L2 Tag L2 Valid Invalidation
Array Array Packet

L1 data cache is write-
thru

Support for partial stores
hit L2 Data 16B 64B Eviction
Array
Coherency is managed
miss
16B
I/O data 64B by the L2 cache

Directories maintained
I/O
Miss Buffer Write-back Write for all 16 L1 caches
Buffer Buffer
Miss Request
Data transfers between
Arbiter the L2 and a core are
Fill Buffer
64B Line Fill done in 16 byte packets
64B Memory
Miss Request 64B Memory Write
to Memory Read Page 17
Integrated Networking
Integrate networking for
better overall performance
FBDIMMs

All network data is
sourced from and
destined to main
memory
42 GB/s read
21 GB/s write
Integration minimizes
L2 L2 L2 L2 L2 L2 L2 L2 impact of memory
Crossbar
Get networking
Pipelined
memory
C0
L2 C1 C2 C3 C4 C5 C6 C7 closer to memory to
accesses
tolerate relaxed
reduce latency
ordering
Able to take full
NIU System
Interface PCI-Ex
advantage of higher
(Ethernet) Unit memory bandwidth

Eliminates inherent
X8 @ 2.5 GHz
10 GE Ethernet inefficiencies of I/O
protocol translation
Page 18
Networking Features
Line Rate Packet Classification (~30M pkt/s)
> Based on Layer 1/2/3/4 of the protocol stack
Multiple DMA Engines
> Matches DMAs to threads
> Binding flexibility between DMAs and ports
> 16 transmit + 16 receive DMA channels
Virtualization Support
> Supports up to 8 partitions
> Interrupts may be bound to different hardware threads
Dual Ethernet ports
> 2 dual-speed MACs (10G/1G) with integrated serdes
Page 19
PCI-Express
PCI-Express operates at 2.5
Gb/s per lane per direction
Data Management
Point-to-point, dual-simplex
chip interconnect
DMA/PIO IOMMU
Cache Lines
Transfers are in packets with
headers and max data
payloads from 128B to 512B
350 MHz

IOMMU supports I/O
virtualization and process
Interrupt device isolation by using
TLP Packets
PCIEs BDF#

MSI Support
XMIT
128b
CSR
32b
RCV
128b
CSR
32b

Event queue accumulates
MSIs
PCI Express Core

Allows many MSIs to be
250 MHz Transaction Layer serviced upon an interrupt
Data Link and Physical Layer
Total I/O bandwidth is 3-4 GB/s
with max payload sizes of
X8 Serdes X8 Serdes 128B to 512B
2.5 Gb/s 16b 16b
Page 20
Power Management
Limit speculation
> Sequential prefetch of instruction cache lines
> Predict conditional branches as not-taken
> Predict loads hit in the data cache
> Hardware tablewalk search control
Extensive clock gating
> Datapath
> Control blocks
> Arrays
Power throttling
> 3 external power throttle pins
> Inject stall cycles into the decode stage based on state of these pins
> If power_throttle_pins[2:0]==n then n stalls in window of 8, n is 0-7
> Affects all threads
Page 21
Niagara2 System Status
First silicon
arrived at the
end of May
Booted Solaris
in 5 days
Current
systems are
fully
operational
Expect
systems to
ship in
2H2007
Page 22
Summary
Niagara2 combines all major server functions on one
chip
> Integrated networking
> Integrated PCI-Express
> Embedded wire-speed cryptography
Niagara2 has improved performance vs. UltraSparc
T1
> Better integer throughput and throughput/watt (>2x)
> Improved integer single-thread performance (>1.4x)
> Better floating-point throughput (>10x)
> Better floating-point single-thread performance (>5x)
Enables new generation of power-efficient, fully-
secure datacenters
Page 23
Thank you ...
robert.golla@sun.com
Page 24

04 Sun Golla PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

04 Sun Golla PDF

Caricato da

Copyright:

Formati disponibili

Niagara2: A Highly Threaded

TAG0 TAG1 TAG5 TAG4 FSR

Two 10/1 Gb Enet

L2B3 CCX L2B7

PEU 711 signal I/O,

For a single thread

With many threads

Threads running FP are non-blocking

Enhanced existing public key support

Fetch Cache Pick Decode Execute Mem Bypass W

> 3-cycle load-use penalty

12 stage floating-point pipeline

> 6-cycle latency for dependent FP ops

Potrebbero piacerti anche