Head's Up: Last Week's Material Week's Material

Heads Up
Last weeks material
Memory hierarchies, the basics of cache design Main memory architectures; advanced topics in computer architecture
- Reading assignment PH: 5.1-5.2, C.9
This weeks material
Next (last!) weeks material
Cumulative review
CSE331 W14&15.2
Irwin Fall 09 PSU
Review: Major Components of a Computer

Processor Control
Devices
Memory Input Output
Main Memory uses DRAM for density (size)
Datapath
Higher density (1 transistor cells), lower power, cheaper but slower (access times of 50 to 70 nsec) Dynamic, so needs to be refreshed regularly (~ every 8 ms)
Main Memory
Cache
Secondary Memory (Disk)
CSE331 W14&15.5
Irwin Fall 09 PSU
Review: The Memory Hierarchy
Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology
Processor
4-8 bytes (word)
Increasing distance from the processor in access time
L1$ (SRAM)
8-32 bytes (block)
L2$ (SRAM)
1 to 4 blocks
Main Memory (DRAM)
Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
1,024+ bytes (disk sector = page)
Secondary Memory (Disks)
CSE331 W14&15.6
(Relative) size of the memory at each level

Irwin Fall 09 PSU
DRAM Performance Metrics
DRAM addresses are divided into 2 halves (row and column)
RAS or Row Access Strobe that triggers the row decoder
CAS or Column Access Strobe that triggers the column selector
Latency: Time to access one word
Access Time: time between request and when word is read or written
- read access and write access times can be different
Cycle Time: time between successive (read or write) requests
Usually cycle time > access time
Bandwidth: How much data can be supplied per unit time
width of the data channel * the rate at which it can be used

Irwin Fall 09 PSU
CSE331 W14&15.7
Review: A Typical Interconnect System

Interrupts
Processor
Cache
Memory - I/O Bus
Main Memory
I/O Controller
I/O Controller
Graphics
I/O Controller
Network
Disk
Disk
CSE331 W14&15.8
Irwin Fall 09 PSU
Main Memory (DRAM) System
Its important to match the cache characteristics
Remember, caches want information provided to them one block at a time (and a block is usually more than one word)
with the main memory characteristics
use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache
with the memory-bus characteristics
make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus_to_Cache bandwidth
CSE331 W14&15.9
Irwin Fall 09 PSU
DRAM Packaging - DIMMs
Dual In-line Memory Modules
Small printed circuit board that holds DRAMs with a 64-bit datapath
Each contain eight x4 (by 4) DRAM parts or x8 DRAM parts
Front Side Bus (1333MHz, 10.5GB/sec)

FB DDR2 667 (5.3GB/sec)
Intel Xeon 5300 processor
Main memory DIMMs
Memory Controller Hub (north bridge) 5000P
CSE331 W14&15.10
Irwin Fall 09 PSU
Classical DRAM Organization (~Square Planes)

bit (data) lines
r o w
Each intersection represents a 1-T DRAM cell

word (row) select line
d e c o d e r
DRAM Cell Array
m planes (where m is the # of bits in the part)

column Address (CAS)
row Address (RAS)
Column Selector & I/O Circuits

data bit
data bit data bit

CSE331 W14&15.11
The column address selects the requested bit from the row in each plane Irwin Fall 09 PSU
Classical DRAM Operation
DRAM Organization:
Column Address
N rows x N column x M-bit (planes) Reads or Writes M-bit at a time

N rows
N cols
DRAM
Each M-bit access requires a RAS / CAS cycle
Row Address
Cycle Time Access Time
M bits M-bit Output
1st M-bit
RAS
2nd M-bit
CAS
Row Address
CSE331 W14&15.12
Col Address
Row Address
Col Address
Irwin Fall 09 PSU
(Fast) Page Mode DRAM Operation

A row

is kept open by keeping the RAS asserted
Column Address
N cols
Pulse CAS to access other M-bit blocks on that row Successive reads or writes within the row are faster since dont have to precharge and (re)access that row
Cycle Time 2nd M-bit
DRAM N rows
Row Address
N x M row M bits M-bit Output 3rd M-bit 4th M-bit
1st M-bit
RAS
CAS Row Address
CSE331 W14&15.13
Col Address
Col Address
Col Address
Col Address
Irwin Fall 09 PSU
Synchronous DRAMs (SDRAMs)
Like page mode DRAMS, synchronous DRAMs have the ability to transfer a burst of data from a series of sequential addresses that are in the same row
For words in the same burst, dont have to provide the complete (row and column) addresses
Specify the starting (row+column) address and the burst length (burst must all be in the same DRAM row). The row is accessed from the DRAM and loaded into a row cache (SRAM). Data words in the burst are then accessed from that SRAM under control of a clock signal.
DDR SDRAMs (Double Data Rate SDRAMs)
Transfers burst data on both the rising and falling edge of the clock (so twice fast)
Now have DDR2 and DDR3 with even higher clock rates
Irwin Fall 09 PSU
CSE331 W14&15.14
Synchronous DRAM (SDRAM) Operation
After RAS loads a row into the SRAM cache
Column Address
+1 N cols
Input CAS as the starting burst address along with a burst length to read a burst of data from a series of sequential addresses within that row on the clock edge
DRAM N rows
Row Address
N x M SRAM M-bit Output
M bit planes
clock
RAS CAS
1st M-bit
2st M-bit
3st M-bit
4st M-bit
Row Address
CSE331 W14&15.15
Col Address
Row Address
Irwin Fall 09 PSU
http://en.wikipedia.org/wiki/DDR_SDRAM
DRAM Memory Latency & Bandwidth Milestones

DRAM Page DRAM Page DRAM Page DRAM SDRAM DDR SDRAM
Module Width
Year
16b
1980
16b
1983
32b
1986
64b
1993
64b
1997
64b
2000
Mb/chip
Die size (mm2) Pins/chip
0.06
35 16
0.25
45 16
1
70 18
16
130 20
64
170 54
256
204 66
BWidth (MB/s)
Latency (nsec)
13
225
40
170
160
125
267
75
640
62
1600
52
Patterson, CACM Vol 47, #10, 2004
In the time that the memory to processor bandwidth has more than doubled the memory latency has improved by a factor of only 1.2 to 1.4 To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks
CSE331 W14&15.16 Irwin Fall 09 PSU
Memory Systems that Support Caches
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways
on-chip
CPU
One word wide organization (one word wide bus and one word wide memory)
Assume
1.
Cache
bus
1 memory bus clock cycle to send the addr
2.
32-bit data & 32-bit addr per cycle
15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd, 3rd, 4th words (column access time) 1 memory bus clock cycle to return a word of data number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle
Irwin Fall 09 PSU
DRAM Memory
3.
Memory-Bus to Cache bandwidth
CSE331 W14&15.18
One Word Wide Bus, One Word Blocks
on-chip
CPU
Cache bus
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory
cycle to send address
cycles to read DRAM

cycle to return data
DRAM Memory
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle
CSE331 W14&15.19
Irwin Fall 09 PSU
One Word Wide Bus, Four Word Blocks
on-chip
What if the block size is four words and each word is in a different DRAM row?
cycle to send 1st address
CPU
cycles to read DRAM

Cache bus
cycles to return last data word total clock cycles miss penalty
DRAM Memory
CSE331 W14&15.21
Irwin Fall 09 PSU
One Word Wide Bus, Four Word Blocks
on-chip
What if the block size is four words and all words are in the same DRAM row?
CPU
cycles to read DRAM

Cache bus
cycles to return last data word total clock cycles miss penalty
DRAM Memory
CSE331 W14&15.23
Irwin Fall 09 PSU
Interleaved Memory, One Word Wide Bus
on-chip
For a block size of four words

cycles to read DRAM banks cycles to return last data word
CPU
Cache bus DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3
total clock cycles miss penalty
CSE331 W14&15.25
bytes per memory bus clock cycleFall 09 PSU Irwin
Extracting Yet More Performance
Superpipelining
Increase the depth of the pipeline to overlap more instructions
Dynamic multiple-issue (superscalar)
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware E.g., Pentium 4
Static multiple-issue (VLIW or EPIC)
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically by the compiler
E.g., Intel Itanium and Itanium 2
Hyperthreading Multicore
Irwin Fall 09 PSU
CSE331 W14&15.28
Super Pipelined Processors
Increasing the depth of the pipeline leads to very short clock cycles, so very high clock rates (and more instructions in flight at one time)
The higher the degree of superpipelining

the more forwarding/hazard hardware needed

the more stall cycles (noop instructions) incurred
the more pipeline latch overhead (i.e., the pipeline latch accounts for a larger and larger percentage of the clock cycle time)
the bigger the clock skew issues (i.e., because of faster and faster clocks)
CSE331 W14&15.29
Irwin Fall 09 PSU
Super Scalar Processors
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware
In-order fetch, in-order issue, out-of-order execution, and in-order commit
Pipelining creates true dependencies (read before write)

Out-of-order execution creates antidependencies (write before read) and output dependencies (write before write) In-order commit allows speculation and is required to implement precise interrupts
Register renaming (RUU) architecture structures are used to solve these storage dependencies in superscalar processors
CSE331 W14&15.30
Irwin Fall 09 PSU
A Super Scalar Example
Intel Pentium 4 (IA-32 ISA)
Decodes the IA-32 instructions into microoperations

Does register renaming with a RUU-like structure Has a 20 stage pipeline
op queue
RUU allocation 4 FU queues Instr dispatch 5 RegFile access 2 Execution
I$ access (Bpredict) # cycles 5
RUU queue
Commit
7 Functional Units: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex Up to 126 instructions in flight, including 48 loads and 24 stores
4K entry branch predictor
CSE331 W14&15.31
Irwin Fall 09 PSU
VLIW Processors
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically at compile time by the compiler
Issue packet the set of instructions that are bundled together and issued in one clock cycle think of it as one large instruction with multiple operations
- The mix of instructions in the packet (bundle) is usually restricted a single instruction with several predefined fields
The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards
VLIWs have

Multiple functional units Multi-ported register files
Wide program bus

Irwin Fall 09 PSU
CSE331 W14&15.32
Hyperthreading (aka Multithreading, SMT)

Can hide true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Hardware multithreading allows multiple processes (threads) to share the functional units of a single processor
Processor must duplicate the state hardware for each thread a separate register file, PC, instruction buffer, and store buffer for each thread
The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching
CSE331 W14&15.33
Irwin Fall 09 PSU
Hyperthreading Example: Suns Niagara
Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient
Fetch Thrd Sel Decode RegFile x8 Execute Memory WB
I$ ITLB
Inst bufx8
Thrd Sel Mux
Decode
ALU Mul Shft Div
D$ DTLB Stbufx8
Crossbar Interface
Thread Select Logic
Thrd Sel Mux
PC logicx8
Instr type Cache misses Traps & interrupts Resource conflicts
From MPR, Vol. 18, #9, Sept. 2004

CSE331 W14&15.34 Irwin Fall 09 PSU

Head's Up: Last Week's Material Week's Material

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Head's Up: Last Week's Material Week's Material

Caricato da

Copyright:

Formati disponibili

Heads Up

Last weeks material

This weeks material

Next (last!) weeks material

Irwin Fall 09 PSU

Review: Major Components of a Computer

Main Memory uses DRAM for density (size)

Secondary Memory (Disk)

Irwin Fall 09 PSU

Review: The Memory Hierarchy

Increasing distance from the processor in access time

Main Memory (DRAM)

1,024+ bytes (disk sector = page)

Secondary Memory (Disks)

(Relative) size of the memory at each level

DRAM Performance Metrics

DRAM addresses are divided into 2 halves (row and column)

RAS or Row Access Strobe that triggers the row decoder

CAS or Column Access Strobe that triggers the column selector

Latency: Time to access one word

Cycle Time: time between successive (read or write) requests

Usually cycle time > access time

Bandwidth: How much data can be supplied per unit time

width of the data channel * the rate at which it can be used

Review: A Typical Interconnect System

Memory - I/O Bus

Irwin Fall 09 PSU

Main Memory (DRAM) System

Its important to match the cache characteristics

with the main memory characteristics

with the memory-bus characteristics

Irwin Fall 09 PSU

DRAM Packaging - DIMMs

Dual In-line Memory Modules

Front Side Bus (1333MHz, 10.5GB/sec)

Intel Xeon 5300 processor

Main memory DIMMs

Memory Controller Hub (north bridge) 5000P

Irwin Fall 09 PSU

Classical DRAM Organization (~Square Planes)

Each intersection represents a 1-T DRAM cell

DRAM Cell Array

m planes (where m is the # of bits in the part)

row Address (RAS)

Column Selector & I/O Circuits

data bit data bit

Classical DRAM Operation

N rows x N column x M-bit (planes) Reads or Writes M-bit at a time

Each M-bit access requires a RAS / CAS cycle

Cycle Time Access Time

M bits M-bit Output

(Fast) Page Mode DRAM Operation

is kept open by keeping the RAS asserted

N x M row M bits M-bit Output 3rd M-bit 4th M-bit

Synchronous DRAMs (SDRAMs)

DDR SDRAMs (Double Data Rate SDRAMs)

Synchronous DRAM (SDRAM) Operation

After RAS loads a row into the SRAM cache

N x M SRAM M-bit Output

DRAM Memory Latency & Bandwidth Milestones

Patterson, CACM Vol 47, #10, 2004

Memory Systems that Support Caches

1 memory bus clock cycle to send the addr