Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Datapath Output
Todays Topics:
SRAM Memory Technology
DRAM Memory Technology
Memory Organization
Lec17.2
Technology Trends
DRAM
Year Size Cycle Time
1980
1000:1! 64 Kb
2:1! 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
Lec17.3
Who Cares About the Memory Hierarchy?
1000 CPU
Proc
60%/yr.
Moores Law
Performance
(2X/1.5yr)
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1 (2X/10 yrs)
198
198
198
198
198
198
198
898
199
199
199
199
199
199
599
199
199
199
200
198
198
0
1
2
3
4
5
6
7
1
9
0
2
3
4
1
6
7
8
9
0
Time
Lec17.4
Todays Situation: Microprocessor
Lec17.5
Impact on Performance
Suppose a processor executes at Inst Miss
(0.5) Ideal CPI
Clock Rate = 200 MHz (5 ns per cycle) 16% (1.1)
35%
CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Lec17.6
The Goal: illusion of large, fast, cheap
memory
Lec17.7
An Expanded View of the Memory System
Processor
Control
Memory
Memory
Memory
Memory
Memory
Datapath
Lec17.8
Why hierarchy
works
The Principle of Locality:
Program access a relatively small portion of the address space at
any instant of time.
Probability
of reference
Lec17.9
Memory Hierarchy: How Does it Work?
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y
Lec17.10
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level
(example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Lec17.11
Memory Hierarchy of a Modern Computer System
By taking advantage of the principle of locality:
Present the user with as much memory as is available in the
cheapest technology.
Provide access at the speed offered by the fastest technology.
Processor
Control Tertiary
Secondary Storage
Storage (Tape)
Second Main
(Disk)
On-Chip
Registers
Level Memory
Cache
Lec17.12
How is the hierarchy
managed?
Lec17.13
Memory Hierarchy Technology
Random Access:
Random is good: access time is the same for all locations
DRAM: Dynamic Random Access Memory
- High density, low power, cheap, slow
- Dynamic: need to be refreshed regularly
SRAM: Static Random Access Memory
- Low density, high power, expensive, fast
- Static: content will last forever(until lose power)
Lec17.15
Random Access Memory (RAM) Technology
Lec17.16
Static RAM
Cell
6-Transistor SRAM Cell word
word
0 1 (row select)
0 1
bit bit
Write:
bit bit
1. Drive bit lines (bit=1, bit=0)
2.. Select row
replaced with pullup
Read: to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
Lec17.17
Typical SRAM Organization: 16-word x 4-bit
Din 3 Din 2 Din 1 Din 0
WrEn
Precharge
Address Decoder
SRAM SRAM SRAM SRAM
Cell Cell Cell Cell A1
Word 1 A2
SRAM SRAM SRAM SRAM
Cell Cell Cell Cell A3
: : : :
Word 15
SRAM SRAM SRAM SRAM
Cell Cell Cell Cell
- Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Q: Which is longer:
word line or
bit line?
Dout 3 Dout 2 Dout 1 Dout 0
Lec17.18
Logic Diagram of a Typical SRAM
A
N 2 N words
WE_L x M bit
SRAM
OE_L D
M
OE_L
WE_L
Write
Hold Time Read Access Read Access
Time Time
Write Setup Time
Lec17.20
Problems with
SRAM
Select = 1
P1 P2
Off On
On
On
On Off
N1 N2
bit = 1 bit = 0
Six transistors use up a lot of area
Consider a Zero is stored in the cell:
Transistor N1 will try to pull bit to 0
Transistor P2 will try to pull bit bar to 1
row select
Write:
1. Drive bit line
2.. Select row
Read:
1. Precharge bit line to Vdd
2.. Select row bit
3. Cell and bit line share charges
- Very small voltage changes on the bit line
4. Sense (fancy sense amp)
- Can detect changes of ~1 million electrons
5. Write: restore the value
Refresh
1. Just do a dummy read to every cell.
Lec17.22
Classical DRAM Organization
(square) bit (data) lines
r
o Each intersection represents
w a 1-T DRAM Cell
d RAM Cell
e Array
c
o
d word (row) select
e
r
Column Decoder
11 Sense Amps & I/O D
Lec17.24
DRAM physical organization (4
Mbit)
8 I/Os
I/O I/O
Column Address I/O I/O
Row D
Address
Block
Row Dec.
Block
Row Dec.
Block
Row Dec.
Block
Row Dec.
9 : 512 9 : 512 9 : 512 9 : 512
Q
I/O I/O
I/O I/O
Block 0 Block 3
8 I/Os
Lec17.25
Memory Systems
address n
DRAM DRAM
Controller 2^n x 1
n/2 chip
Memory w
Timing
Controller
Bus Drivers
Lec17.26
Logic Diagram of a Typical DRAM
A 256K x 8
DRAM D
9 8
Lec17.28
DRAM Performance
Lec17.29
DRAM Write Timing
Every DRAM access RAS_L CAS_L WE_L OE_L
begins at:
The assertion of the RAS_L
A 256K x 8
2 ways to write: DRAM D
9 8
early or late v. CAS
DRAM WR Cycle Time
RAS_L
CAS_L
A Row Address Col Address Junk Row Address Col Address Junk
OE_L
WE_L
Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L
Lec17.30
DRAM Read Timing
Every DRAM access RAS_L CAS_L WE_L OE_L
begins at:
The assertion of the RAS_L
A 256K x 8
2 ways to read: DRAM D
9 8
early or late v. CAS
DRAM Read Cycle Time
RAS_L
CAS_L
A Row Address Col Address Junk Row Address Col Address Junk
WE_L
OE_L
Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
Lec17.31
Main Memory Performance
Wide: Interleaved:
CPU/Mux 1 word; CPU, Cache, Bus 1 word:
Mux/Cache, Bus, Memory N Modules
Memory N words (4 Modules); example is
(Alpha: 64 bits & 256 word interleaved
bits)
Simple:
CPU, Cache, Bus, Memory
same width
(32 bits)
Lec17.32
Main Memory Performance
Cycle Time
Lec17.33
Increasing Bandwidth -
Interleaving
Access Pattern without Interleaving:
CPU Memory
D1 available
Start Access for D1 Start Access for D2
Memory
Access Pattern with 4-way Interleaving: Bank 0
Memory
Bank 1
CPU
Memory
Bank 2
Memory
Access Bank 0
Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
Lec17.34
Main Memory Performance
Timing model
1 to send address,
4 for access time, 10 cycle time, 1 to send data
Cache Block is 4 words
Simple M.P. = 4 x (1+10+1) = 48
Wide M.P. = 1 + 10 + 1 = 12
Interleaved M.P. = 1+10+1 + 3 =15
Lec17.35
Independent Memory Banks
Lec17.36
Fewer DRAMs/System over Time
(from Pete
MacWilliams,
Intel) DRAM Generation
86 89 92 96 99 02
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
Minimum PC Memory Size
4 MB 32 8 Memory per
16 4 DRAM growth
8 MB @ 60% / year
16 MB 8 2
32 MB 4 1
Memory per 8 2
64 MB
System growth
128 MB @ 25%-30% / year 4 1
256 MB 8 2
Lec17.37
Page Mode DRAM:
Motivation
Column
Regular DRAM Organization: Address N cols
N rows x N column x M-bit
Read & Write M-bit at a time
Each M-bit access requires DRAM
a RAS / CAS cycle Row
N rows
Address
Fast Page Mode DRAM
N x M register to save a row
M bits
M-bit Output
CAS_L
A Row Address Col Address Junk Row Address Col Address Junk
Lec17.38
Fast Page Mode
Operation Column
Fast Page Mode DRAM Address N cols
N x M SRAM to save a row
N rows
Address
Only CAS is needed to access
other M-bit blocks on that row
RAS_L remains asserted while
CAS_L is toggled
N x M SRAM
M bits
M-bit Output
CAS_L
A Row Address Col Address Col Address Col Address Col Address
Lec17.39
DRAM v. Desktop Microprocessors Cultures
Lec17.40
DRAM Design Goals
Lec17.41
DRAM History
Lec17.42
Todays Situation: DRAM
Lec17.43
Todays Situation: DRAM
$15,000
$16B
(Miillions)
$10,000
$5,000 $7B
$0
1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q
94 94 94 94 95 95 95 95 96 96 96 96 97
Lec17.45
Summary: Processor-Memory Performance Gap Tax
Lec17.46
Recall: Levels of the Memory
Hierarchy
Upper Level
Processor faster
Instr. Operands
Cache
Blocks
Memory
Pages
Disk
Files
Tape Larger
Lower Level
Lec17.47
Cache performance equations:
memory stalls/instruction =
Average accesses/inst x Miss Rate x Miss Penalty =
(Average IFETCH/inst x MissRateInst x Miss PenaltyInst) +
(Average Data/inst x MissRateData x Miss PenaltyData)
Lec17.48
Impact on Performance
Processor
reference stream
<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
Lec17.50
Example: 1 KB Direct Mapped Cache with 32 B
Blocks
For a 2 ** N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ** M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache state
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
Lec17.51
Block Size Tradeoff
Lec17.52
Extreme Example: single
line
Valid Bit Cache Tag Cache Data
Byte 3 Byte 2 Byte 1 Byte 0 0
Lec17.53
Another Extreme Example: Fully Associative
Fully Associative Cache
Forget about the Cache Index
Compare the Cache Tags of all cache entries in parallel
Example: Block Size = 32 B blocks, we need N 27-bit comparators
: :
X Byte 63 Byte 33 Byte 32
X
X
X
: : :
Lec17.54
Set Associative Cache
N-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
Lec17.55
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped
Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
Lec17.56
A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first
reference): first access to a block
Cold fact of life: not a whole lot you can do about it
Note: If you are going to run billions of instruction,
Compulsory Misses are insignificant
Conflict (collision):
Multiple memory locations mapped
to the same cache location
Solution 1: increase cache size
Solution 2: increase associativity
Capacity:
Cache cannot contain all blocks access by the program
Solution: increase cache size
Conflict Miss
Capacity Miss
Coherence Miss
Lec17.58
Sources of Cache Misses Answer
Note:
If you are going to run billions of instruction, Compulsory Misses are insignificant.
Lec17.59
Four Questions for Caches and Memory
Hierarchy
Q1: Where can a block be placed in the upper level?
(Block placement)
Q2: How is a block found if it is in the upper level?
(Block identification)
Q3: Which block should be replaced on a miss?
(Block replacement)
Q4: What happens on a write?
(Write strategy)
Lec17.60
Q1: Where can a block be placed in the upper
level?
Block 12 placed in 8 block cache:
Fully associative, direct mapped, 2-way set associative
S.A. Mapping = Block Number Modulo Number Sets
Block 1111111111222222222233
no.
01234567890123456789012345678901
Lec17.61
Q2: How is a block found if it is in the upper
level?
Lec17.62
Q3: Which block should be replaced on a
miss?
Lec17.63
Q4: What happens on a
write?
Write throughThe information is written to both
the block in the cache and to the block in the lower-
level memory.
Write backThe information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
is block clean or dirty?
Lec17.64
Write Buffer for Write Through
Cache
Processor DRAM
Write Buffer
Lec17.65
Write Buffer Saturation
Cache
Processor DRAM
Write Buffer
Store frequency (w.r.t. time) > 1 / DRAM write cycle
If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row):
- Store buffer will overflow no matter how big you make it
- The CPU Cycle Time <= DRAM Write Cycle Time
Cache L2
Processor DRAM
Cache
Write Buffer
Lec17.66
Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and
causes a miss
Do we read in the block?
- Yes: Write Allocate
- No: Write Not Allocate
31 9 4 0
Cache Tag Example: 0x00 Cache Index Byte Select
Ex: 0x00 Ex: 0x00
: :
Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
Lec17.67
Impact of Memory Hierarchy on Algorithms
Radix sort
Quick
sort Instructions/key
Radix sort
Time
Quick
sort
Instructions
Radix sort
Cache misses
Quick
sort
Lec17.72
Impact on Cycle Time
PC
I -Cache
Cache Hit Time:
directly tied to clock rate IR
miss
increases with cache size
increases with associativity IRex A B
invalid
IRm R
Average Memory Access time (AMAT) =
D Cache
Hit Time + Miss Rate x Miss Penalty
IRwb T
Compute Time = IC x CT x (ideal CPI + memory stalls) Miss
Lec17.73
What happens on a Cache
Formiss?
in-order pipeline, 2 options:
Freeze pipeline in Mem stage (popular early on: Sparc, R4000)
IF ID EX Mem stall stall stall stall Mem Wr
IF ID EX stall stall stall stall stall Ex Wr
memory stalls/instruction =
Average memory accesses/inst x AMAT =
(Average IFETCH/inst x AMATInst) +
(Average DMEM/inst x AMATData) +
Lec17.75
3Cs Absolute Miss Rate (SPEC92)
0.14
1-way
0.12 Conflict
Miss Rate per Type
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Compulsory vanishingly Compulsory
Cache Size (KB)
small
Lec17.76
2:1 Cache Rule
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
Lec17.77
3Cs Relative Miss Rate
100%
1-way
Miss Rate per Type
80% Conflict
2-way
4-way
60% 8-way
40%
Capacity
20%
0%
1
128
16
32
64
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
Lec17.78
1. Reduce Misses via Larger Block
Size
25%
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
32
16
64
256
128
Lec17.79
2. Reduce Misses via Higher Associativity
Lec17.80
Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for
8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
Lec17.81
3. Reducing Misses via a Victim Cache
Lec17.82
4. Reducing Misses via Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if there, if so
have a pseudo-hit (slow hit)
Hit Time
Lec17.83
5. Reducing Misses by Hardware
Prefetching
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Lec17.84
6. Reducing Misses by Software Prefetching Data
Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults;
a form of speculative execution
Lec17.85
7. Reducing Misses by Compiler
Optimizations
McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of compound elements vs.
2 arrays
Loop Interchange: change nesting of loops to access data in order stored in
memory
Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
Blocking: Improve temporal locality by accessing blocks of data repeatedly vs.
going down whole columns or rows
Lec17.86
Summary #1 /
3:
The Principle of Locality:
Program likely to access a relatively small portion of the address space
at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
Lec17.88
Summary #3 / 3: Cache Miss
Optimization
Lots of techniques people use to improve the miss
rate of caches:
Technique MR MP HT Complexity
miss rate
Lec17.89