Main Memory: Computer Architecture COE 501

Lecture 13 Main Memory
Computer Architecture COE 501
Who Cares About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency)
1000 Performance 100 10 1
Proc 60%/yr. Moores Law (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs)
CPU
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Time
Suppose a processor executes at

Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control
Impact on Performance
Inst M iss (0.5) 16% Idea l CPI (1.1) 35%
Suppose that 10% of memory Da taM iss operations get 50 cycle (1.6) 49% miss penalty CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!
Main Memory Background

Performance of Main Memory:
Latency: Cache Miss Penalty Access Time: time between request and when desired word arrives Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2)
Main Memory is DRAM: Dynamic Random Access Memory

Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves: RAS or Row Access Strobe CAS or Column Access Strobe
Cache uses SRAM: Static Random Access Memory

No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM) Address not divided
Size: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16
Typical SRAM Organization: 16-word x 4-bit

Din 3 Din 2 Din 1 Din 0
Precharge
WrEn
Wr Driver & - Precharger + SRAM Cell SRAM Cell
Wr Driver & - Precharger +

Word 0
A0
A1 A2 A3
Address Decoder
SRAM Cell
Word 1
SRAM Cell
:
SRAM Cell - Sense Amp +
:
:
:
Word 15
Q: Which is longer: word line or bit line?
Dout 3
Dout 2
Dout 1
Dout 0
Logic Diagram of a Typical SRAM

A
N
WE_L OE_L
2 N words x M bit SRAM

M
Write Enable is usually active low (WE_L) Bi-directional data

A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Both WE_L and OE_L are asserted: Result is unknown. Dont do that!!!
Classical DRAM Organization (square)

bit (data) lines r o w d e c o d e r RAM Cell Array
Each intersection represents a 1-T DRAM Cell
word (row) select
row address
Column Selector & I/O Circuits
Column Address
data
Row and Column Address together:

Select 1 bit a time
Logic Diagram of a Typical DRAM

RAS_L CAS_L WE_L OE_L A
9
256K x 8 DRAM
Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D):
WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin
Row and column addresses share the same pins (A)

RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address RAS/CAS edge-sensitive
Main Memory Performance
Wide:
Simple:
CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)
Interleaved:
CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved
CPU, Cache, Bus, Memory same width (32 bits)

Cycle Time Access Time Time
DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time

2:1; why?
DRAM (Read/Write) Cycle Time :

How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday
DRAM (Read/Write) Access Time:

How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money
DRAM Bandwidth Limitation analogy:

What happens if he runs out of money on Wednesday?
Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:
CPU Memory
D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Memory Bank 1 Memory Bank 2 Access Bank 0 Memory Bank 3 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again
Access Pattern with 4-way Interleaving:
CPU

(Figure 5.32, pg. 432)
How long does it take to send 4 words? Timing model

1 cycle to send address, 6 cycles to access data + 1 cycle to send data Cache Block is 4 words
Simple Mem. = 4 x (1+6+1) = 32 Wide Mem. =1+6+1 =8 Interleaved Mem. = 1 + 6 + 4x1 = 11

address 0 4 8 12 address 1 5 9 13 address 2 6 10 14 address 3 7 11 15
Bank 0
Bank 1
Bank 2
Bank 3
Wider vs. Interleaved Memory

Wider memory
Cost for the wider connection Need a multiplexor between cache and CPU May lead to significantly larger/more expensive memory
Intervleaved memory
Send address to several banks simultaneously Want Number of banks > Number of clocks to access bank As the size of memory chips increases, it becomes difficult to interleave memory Difficult to expand memory by small amounts May not work well for non-sequential accesses
Avoiding Bank Conflicts

Lots of banks
int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];
Even with 128 banks their are conflicts, since 512 is an even multiple of 128 SW: loop interchange or declaring array not power of 2 HW: Prime number of banks
bank number = address mod number of banks address within bank = address / number of banks modulo & divide per memory access? address within bank = address mod number words in bank (3, 7, 31)
Page Mode DRAM: Motivation

Regular DRAM
Column Organization:Address
N cols
N rows
N rows x N column x M-bit Read & Write M-bit at a time Each M-bit access requires a RAS / CAS cycle
DRAM
Fast Page Mode DRAM

N x M register to save a row
Row Address
M bits M-bit Output 1st M-bit Access RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk 2nd M-bit Access
Fast Page Mode Operation Column

Fast Page Mode DRAM
N x M SRAM to save a row
N rows Address N cols
After a row is read into the register

Only CAS is needed to access other M-bit blocks on that row RAS_L remains asserted while CAS_L is toggled
DRAM
Row Address
N x M SRAM M bits M-bit Output
1st M-bit Access RAS_L CAS_L A Row Address Col Address
2nd M-bit
3rd M-bit
4th M-bit
Col Address
Col Address
Col Address
Fast Memory Systems: DRAM specific

Multiple column accesses: page mode
DRAM buffers a row of bits inside the DRAM for column access (e.g., 16 Kbits row for 64 Mbits DRAM) Allow repeated column access without another row access 64 Mbit DRAM: cycle time = 90 ns, optimized access = 25 ns
New DRAMs: what will they cost, will they survive?

Synchronous DRAM: Provide a clock signal to DRAM, transfer
synchronous to system clock
RAMBUS: startup company; reinvent DRAM interface

Each chip a module vs. slice of memory Short bus between CPU and chips Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per chip)
Niche memory or main memory?

e.g., Video RAM for frame buffers, DRAM + fast serial output
Main Memory Summary

Wider Memory Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM

Main Memory: Computer Architecture COE 501

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Main Memory: Computer Architecture COE 501

Caricato da

Copyright:

Formati disponibili

Lecture 13 Main Memory

Computer Architecture COE 501

Who Cares About the Memory Hierarchy?

1000 Performance 100 10 1

Suppose a processor executes at

Main Memory Background

Main Memory is DRAM: Dynamic Random Access Memory

Cache uses SRAM: Static Random Access Memory

Size: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16

Typical SRAM Organization: 16-word x 4-bit

Wr Driver & - Precharger + SRAM Cell SRAM Cell

Wr Driver & - Precharger + SRAM Cell SRAM Cell

Wr Driver & - Precharger + SRAM Cell SRAM Cell

Wr Driver & - Precharger +

SRAM Cell - Sense Amp +

Q: Which is longer: word line or bit line?

Logic Diagram of a Typical SRAM

2 N words x M bit SRAM

Write Enable is usually active low (WE_L) Bi-directional data

Classical DRAM Organization (square)

Each intersection represents a 1-T DRAM Cell

word (row) select

Column Selector & I/O Circuits

Row and Column Address together:

Logic Diagram of a Typical DRAM

Row and column addresses share the same pins (A)

Main Memory Performance

CPU, Cache, Bus, Memory same width (32 bits)

Main Memory Performance

DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time

DRAM (Read/Write) Cycle Time :

DRAM (Read/Write) Access Time:

DRAM Bandwidth Limitation analogy:

Increasing Bandwidth - Interleaving

Access Pattern with 4-way Interleaving:

Main Memory Performance

How long does it take to send 4 words? Timing model

Simple Mem. = 4 x (1+6+1) = 32 Wide Mem. =1+6+1 =8 Interleaved Mem. = 1 + 6 + 4x1 = 11

Wider vs. Interleaved Memory

Avoiding Bank Conflicts

Page Mode DRAM: Motivation

Fast Page Mode DRAM

Fast Page Mode Operation Column

After a row is read into the register

N x M SRAM M bits M-bit Output

1st M-bit Access RAS_L CAS_L A Row Address Col Address

Fast Memory Systems: DRAM specific

New DRAMs: what will they cost, will they survive?

RAMBUS: startup company; reinvent DRAM interface

Niche memory or main memory?

Main Memory Summary

Potrebbero piacerti anche