Sei sulla pagina 1di 25

Heads Up

Last weeks material

Memory hierarchies, the basics of cache design Main memory architectures; advanced topics in computer architecture
- Reading assignment PH: 5.1-5.2, C.9

This weeks material

Next (last!) weeks material

Cumulative review

CSE331 W14&15.2

Irwin Fall 09 PSU

Review: Major Components of a Computer


Processor Control

Devices
Memory Input Output

Main Memory uses DRAM for density (size)

Datapath

Higher density (1 transistor cells), lower power, cheaper but slower (access times of 50 to 70 nsec) Dynamic, so needs to be refreshed regularly (~ every 8 ms)

Main Memory

Cache

Secondary Memory (Disk)

CSE331 W14&15.5

Irwin Fall 09 PSU

Review: The Memory Hierarchy

Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology
Processor
4-8 bytes (word)

Increasing distance from the processor in access time

L1$ (SRAM)
8-32 bytes (block)

L2$ (SRAM)
1 to 4 blocks

Main Memory (DRAM)

Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM

1,024+ bytes (disk sector = page)

Secondary Memory (Disks)

CSE331 W14&15.6

(Relative) size of the memory at each level


Irwin Fall 09 PSU

DRAM Performance Metrics

DRAM addresses are divided into 2 halves (row and column)

RAS or Row Access Strobe that triggers the row decoder

CAS or Column Access Strobe that triggers the column selector

Latency: Time to access one word

Access Time: time between request and when word is read or written
- read access and write access times can be different

Cycle Time: time between successive (read or write) requests

Usually cycle time > access time

Bandwidth: How much data can be supplied per unit time

width of the data channel * the rate at which it can be used


Irwin Fall 09 PSU

CSE331 W14&15.7

Review: A Typical Interconnect System


Interrupts

Processor
Cache

Memory - I/O Bus

Main Memory

I/O Controller

I/O Controller
Graphics

I/O Controller
Network

Disk

Disk

CSE331 W14&15.8

Irwin Fall 09 PSU

Main Memory (DRAM) System

Its important to match the cache characteristics

Remember, caches want information provided to them one block at a time (and a block is usually more than one word)

with the main memory characteristics

use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache

with the memory-bus characteristics

make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus_to_Cache bandwidth

CSE331 W14&15.9

Irwin Fall 09 PSU

DRAM Packaging - DIMMs

Dual In-line Memory Modules

Small printed circuit board that holds DRAMs with a 64-bit datapath
Each contain eight x4 (by 4) DRAM parts or x8 DRAM parts

Front Side Bus (1333MHz, 10.5GB/sec)


FB DDR2 667 (5.3GB/sec)

Intel Xeon 5300 processor

Main memory DIMMs

Memory Controller Hub (north bridge) 5000P

CSE331 W14&15.10

Irwin Fall 09 PSU

Classical DRAM Organization (~Square Planes)


bit (data) lines

r o w

Each intersection represents a 1-T DRAM cell


word (row) select line

d e c o d e r

DRAM Cell Array

m planes (where m is the # of bits in the part)


column Address (CAS)

row Address (RAS)

Column Selector & I/O Circuits


data bit

data bit data bit


CSE331 W14&15.11

The column address selects the requested bit from the row in each plane Irwin Fall 09 PSU

Classical DRAM Operation

DRAM Organization:

Column Address

N rows x N column x M-bit (planes) Reads or Writes M-bit at a time


N rows

N cols

DRAM

Each M-bit access requires a RAS / CAS cycle

Row Address

Cycle Time Access Time

M bits M-bit Output

1st M-bit
RAS

2nd M-bit

CAS
Row Address
CSE331 W14&15.12

Col Address

Row Address

Col Address
Irwin Fall 09 PSU

(Fast) Page Mode DRAM Operation


A row

is kept open by keeping the RAS asserted

Column Address

N cols

Pulse CAS to access other M-bit blocks on that row Successive reads or writes within the row are faster since dont have to precharge and (re)access that row
Cycle Time 2nd M-bit

DRAM N rows

Row Address

N x M row M bits M-bit Output 3rd M-bit 4th M-bit

1st M-bit

RAS
CAS Row Address
CSE331 W14&15.13

Col Address

Col Address

Col Address

Col Address
Irwin Fall 09 PSU

Synchronous DRAMs (SDRAMs)

Like page mode DRAMS, synchronous DRAMs have the ability to transfer a burst of data from a series of sequential addresses that are in the same row

For words in the same burst, dont have to provide the complete (row and column) addresses

Specify the starting (row+column) address and the burst length (burst must all be in the same DRAM row). The row is accessed from the DRAM and loaded into a row cache (SRAM). Data words in the burst are then accessed from that SRAM under control of a clock signal.

DDR SDRAMs (Double Data Rate SDRAMs)

Transfers burst data on both the rising and falling edge of the clock (so twice fast)

Now have DDR2 and DDR3 with even higher clock rates
Irwin Fall 09 PSU

CSE331 W14&15.14

Synchronous DRAM (SDRAM) Operation

After RAS loads a row into the SRAM cache

Column Address

+1 N cols

Input CAS as the starting burst address along with a burst length to read a burst of data from a series of sequential addresses within that row on the clock edge

DRAM N rows

Row Address

N x M SRAM M-bit Output

M bit planes

clock
RAS CAS

1st M-bit

2st M-bit

3st M-bit

4st M-bit

Row Address
CSE331 W14&15.15

Col Address

Row Address
Irwin Fall 09 PSU

http://en.wikipedia.org/wiki/DDR_SDRAM

DRAM Memory Latency & Bandwidth Milestones


DRAM Page DRAM Page DRAM Page DRAM SDRAM DDR SDRAM

Module Width
Year

16b
1980

16b
1983

32b
1986

64b
1993

64b
1997

64b
2000

Mb/chip
Die size (mm2) Pins/chip

0.06
35 16

0.25
45 16

1
70 18

16
130 20

64
170 54

256
204 66

BWidth (MB/s)
Latency (nsec)

13
225

40
170

160
125

267
75

640
62

1600
52

Patterson, CACM Vol 47, #10, 2004

In the time that the memory to processor bandwidth has more than doubled the memory latency has improved by a factor of only 1.2 to 1.4 To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks
CSE331 W14&15.16 Irwin Fall 09 PSU

Memory Systems that Support Caches

The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways
on-chip

CPU

One word wide organization (one word wide bus and one word wide memory)

Assume
1.

Cache
bus

1 memory bus clock cycle to send the addr

2.

32-bit data & 32-bit addr per cycle

15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd, 3rd, 4th words (column access time) 1 memory bus clock cycle to return a word of data number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle
Irwin Fall 09 PSU

DRAM Memory

3.

Memory-Bus to Cache bandwidth

CSE331 W14&15.18

One Word Wide Bus, One Word Blocks

on-chip

CPU

Cache bus

If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory
cycle to send address

cycles to read DRAM


cycle to return data

DRAM Memory

total clock cycles miss penalty

Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle

CSE331 W14&15.19

Irwin Fall 09 PSU

One Word Wide Bus, Four Word Blocks

on-chip

What if the block size is four words and each word is in a different DRAM row?
cycle to send 1st address

CPU

cycles to read DRAM


Cache bus

cycles to return last data word total clock cycles miss penalty

DRAM Memory

Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle

CSE331 W14&15.21

Irwin Fall 09 PSU

One Word Wide Bus, Four Word Blocks

on-chip

What if the block size is four words and all words are in the same DRAM row?
cycle to send 1st address

CPU

cycles to read DRAM


Cache bus

cycles to return last data word total clock cycles miss penalty

DRAM Memory

Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle

CSE331 W14&15.23

Irwin Fall 09 PSU

Interleaved Memory, One Word Wide Bus

on-chip

For a block size of four words


cycle to send 1st address
cycles to read DRAM banks cycles to return last data word

CPU

Cache bus DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3

total clock cycles miss penalty

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

CSE331 W14&15.25

bytes per memory bus clock cycleFall 09 PSU Irwin

Extracting Yet More Performance

Superpipelining

Increase the depth of the pipeline to overlap more instructions

Dynamic multiple-issue (superscalar)

Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware E.g., Pentium 4

Static multiple-issue (VLIW or EPIC)

Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically by the compiler

E.g., Intel Itanium and Itanium 2

Hyperthreading Multicore
Irwin Fall 09 PSU

CSE331 W14&15.28

Super Pipelined Processors

Increasing the depth of the pipeline leads to very short clock cycles, so very high clock rates (and more instructions in flight at one time)

The higher the degree of superpipelining


the more forwarding/hazard hardware needed


the more stall cycles (noop instructions) incurred

the more pipeline latch overhead (i.e., the pipeline latch accounts for a larger and larger percentage of the clock cycle time)
the bigger the clock skew issues (i.e., because of faster and faster clocks)

CSE331 W14&15.29

Irwin Fall 09 PSU

Super Scalar Processors

Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware

In-order fetch, in-order issue, out-of-order execution, and in-order commit

Pipelining creates true dependencies (read before write)


Out-of-order execution creates antidependencies (write before read) and output dependencies (write before write) In-order commit allows speculation and is required to implement precise interrupts

Register renaming (RUU) architecture structures are used to solve these storage dependencies in superscalar processors

CSE331 W14&15.30

Irwin Fall 09 PSU

A Super Scalar Example

Intel Pentium 4 (IA-32 ISA)

Decodes the IA-32 instructions into microoperations


Does register renaming with a RUU-like structure Has a 20 stage pipeline
op queue
RUU allocation 4 FU queues Instr dispatch 5 RegFile access 2 Execution

I$ access (Bpredict) # cycles 5

RUU queue

Commit

7 Functional Units: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex Up to 126 instructions in flight, including 48 loads and 24 stores

4K entry branch predictor

CSE331 W14&15.31

Irwin Fall 09 PSU

VLIW Processors

Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically at compile time by the compiler

Issue packet the set of instructions that are bundled together and issued in one clock cycle think of it as one large instruction with multiple operations
- The mix of instructions in the packet (bundle) is usually restricted a single instruction with several predefined fields

The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards

VLIWs have

Multiple functional units Multi-ported register files

Wide program bus


Irwin Fall 09 PSU

CSE331 W14&15.32

Hyperthreading (aka Multithreading, SMT)


Can hide true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Hardware multithreading allows multiple processes (threads) to share the functional units of a single processor

Processor must duplicate the state hardware for each thread a separate register file, PC, instruction buffer, and store buffer for each thread

The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching

CSE331 W14&15.33

Irwin Fall 09 PSU

Hyperthreading Example: Suns Niagara

Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient
Fetch Thrd Sel Decode RegFile x8 Execute Memory WB

I$ ITLB

Inst bufx8

Thrd Sel Mux

Decode

ALU Mul Shft Div

D$ DTLB Stbufx8

Crossbar Interface

Thread Select Logic

Thrd Sel Mux

PC logicx8

Instr type Cache misses Traps & interrupts Resource conflicts

From MPR, Vol. 18, #9, Sept. 2004


CSE331 W14&15.34 Irwin Fall 09 PSU

Potrebbero piacerti anche