Computer Architecture and Organization: Lecture14: Cache Memory Organization

Computer Architecture and Organization
Lecture14: Cache Memory Organization
Majid Khabbazian mkhabbazian@ualberta.ca

Electrical and Computer Engineering University of Alberta
April 9, 2013
Recap
General Cache Organization (S, E, B)

E = 2e lines per set
set line
S = 2s sets
Cache size: C = S x E x B data bytes

v tag 0 1 2
B-1
valid bit
B = 2b bytes per cache block (the data)
Cache Read
E = 2e lines per set
Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset Address of word:
t bits s bits b bits
S = 2s sets
tag
set block index offset
data begins at this offset
tag
0 1 2
B-1
valid bit B = 2b bytes per cache block (the data)
2-Way Set Associative Cache Simulation

t=2 xx s=1 x b=1 x M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set
Address trace (reads, one byte per read): miss 0 [00002], hit 1 [00012], miss 7 [01112], miss 8 [10002], hit 0 [00002]
v 0 1 0 1 0 1 0 Tag ? 00 10 01 Block ? M[0-1] M[8-9] M[6-7]
Set 0
Set 1
Fully Associative Caches

Cache is a single set
No s bits
If the capacity is C bytes and the block size is B bytes, what would be the number of lines?
E = C/B
E = C/B lines in the one and only set
tag
Cache block
tag
Cache block
tag
Cache block

Works similar to set associative caches
There is only one set
Check all lines in set (in parallel)

Retrieve word if line is valid and line contains matching tag Otherwise:
Choose empty line to place block Or, evict a block if there are no empty lines Load block from memory (assuming only 1 cache) Use offset to get word in cached block

Logic for searching for tags is slow and expensive
Only an option in caches at lower end of hierarchy Too slow for L1 and L2 cache
L1 and L2 caches use either

Direct mapped caches 2-way caches 4-way caches 8-way caches
What about writes?

What about writing to memory?
Recall read procedure:
CPU requests word from cache If block with word is cached, its a hit Else its a miss, and cache fetches block from next level Word from block is returned once block is cached
Caches and Memory Writes

Writes are more complicated
Scenario: CPU writes a word to memory Either block with word is in cache, or not Case 1: If block is in cache (cache hit):
Block in cache is updated with word Eventually memory has to be updated with word
What does cache do about updating the copy of w in the next lower level in the hierarchy?

Two options:
Write-through: Immediately write block to memory
Advantage: Simplest to implement Disadvantage: Increases number of bus transactions
Write-back: Defer block write until block is evicted

Advantage: significantly reduces number of bus transactions Disadvantage:
Additional complex :
Cache must maintain a dirty bit to keep track of which blocks must be written back when evicted
Loading cache may take longer because eviction is more complex

Scenario: CPU writes a word to memory
Case 2: Block is not in cache (cache miss)
Should the block be loaded?

Two options:
Write-allocate: Load block into cache and update block
Exploits spatial locality of writes Reduces # of bus transactions Generally done by write-back caches Requires more cache hardware
No-write allocate: Send update directly to next level

Overall update is faster Block with updated word is being updated, more bus transactions will occur Generally done by write-through caches Takes less hardware
What about writes? (Summary)

Multiple copies of data exist:
L1, L2, Main Memory, Disk
What to do on a write-hit?
Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line)
Need a dirty bit (line different from memory or not)
What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow
No-write-allocate (writes immediately to memory)
Typical
Write-through + No-write-allocate Write-back + Write-allocate
Types of Caches
Modern CPUS use separate caches for data and instructions:
d-cache: for program data
Should handle a wide variety of access patterns Handles reads/writes
i-cache: for program instructions

Mainly needs to handle simple sequential access Does not need to handle writes Can be made simpler and faster than a d-cache
Unified cache: a single cache is used for both instructions and data
i-cache and d-cache

Why using separate i-caches and d-caches?
Processor can read an instruction word and a data word at the same time i-caches are typically read-only (simpler) Each cache is often optimized to different access patterns
Different block sizes, associativities, and capacities
Intel Core i7 Cache Hierarchy

Processor package Core 0 Core 3
Regs
L1 L1 d-cache i-cache L2 unified cache
Regs
L1 L1 d-cache i-cache L2 unified cache
L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles
L2 unified cache: 256 KB, 8-way, Access: 11 cycles

L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles Block size: 64 bytes for all caches.
L3 unified cache (shared by all cores)
Main memory
Writing Cache Friendly Code

Make the common case go fast
Focus on the inner loops of the core functions
Minimize the misses in the inner loops

Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

Computer Architecture and Organization: Lecture14: Cache Memory Organization

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Computer Architecture and Organization: Lecture14: Cache Memory Organization

Caricato da

Copyright:

Formati disponibili

Computer Architecture and Organization

Lecture14: Cache Memory Organization

Majid Khabbazian mkhabbazian@ualberta.ca

General Cache Organization (S, E, B)

Cache size: C = S x E x B data bytes

set block index offset

data begins at this offset

valid bit B = 2b bytes per cache block (the data)

2-Way Set Associative Cache Simulation

Fully Associative Caches

E = C/B lines in the one and only set

Fully Associative Caches

Check all lines in set (in parallel)

Fully Associative Caches

L1 and L2 caches use either

What about writes?

Caches and Memory Writes

Caches and Memory Writes

Write-back: Defer block write until block is evicted

Caches and Memory Writes

Caches and Memory Writes

No-write allocate: Send update directly to next level

What about writes? (Summary)

No-write-allocate (writes immediately to memory)

i-cache: for program instructions

i-cache and d-cache

Intel Core i7 Cache Hierarchy

L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles

L2 unified cache: 256 KB, 8-way, Access: 11 cycles

L3 unified cache (shared by all cores)

Writing Cache Friendly Code

Minimize the misses in the inner loops

Potrebbero piacerti anche