Sei sulla pagina 1di 18

Computer Architecture and Organization

Lecture14: Cache Memory Organization

Majid Khabbazian mkhabbazian@ualberta.ca


Electrical and Computer Engineering University of Alberta

April 9, 2013

Recap

General Cache Organization (S, E, B)


E = 2e lines per set
set line

S = 2s sets

Cache size: C = S x E x B data bytes


v tag 0 1 2
B-1

valid bit
B = 2b bytes per cache block (the data)

Cache Read
E = 2e lines per set

Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset Address of word:
t bits s bits b bits

S = 2s sets

tag

set block index offset

data begins at this offset

tag

0 1 2

B-1

valid bit B = 2b bytes per cache block (the data)

2-Way Set Associative Cache Simulation


t=2 xx s=1 x b=1 x M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read): miss 0 [00002], hit 1 [00012], miss 7 [01112], miss 8 [10002], hit 0 [00002]
v 0 1 0 1 0 1 0 Tag ? 00 10 01 Block ? M[0-1] M[8-9] M[6-7]

Set 0

Set 1

Fully Associative Caches


Cache is a single set
No s bits

If the capacity is C bytes and the block size is B bytes, what would be the number of lines?
E = C/B

E = C/B lines in the one and only set

tag

Cache block

tag

Cache block

tag

Cache block

Fully Associative Caches


Works similar to set associative caches
There is only one set

Check all lines in set (in parallel)


Retrieve word if line is valid and line contains matching tag Otherwise:
Choose empty line to place block Or, evict a block if there are no empty lines Load block from memory (assuming only 1 cache) Use offset to get word in cached block

Fully Associative Caches


Logic for searching for tags is slow and expensive
Only an option in caches at lower end of hierarchy Too slow for L1 and L2 cache

L1 and L2 caches use either


Direct mapped caches 2-way caches 4-way caches 8-way caches

What about writes?


What about writing to memory?
Recall read procedure:
CPU requests word from cache If block with word is cached, its a hit Else its a miss, and cache fetches block from next level Word from block is returned once block is cached

Caches and Memory Writes


Writes are more complicated
Scenario: CPU writes a word to memory Either block with word is in cache, or not Case 1: If block is in cache (cache hit):
Block in cache is updated with word Eventually memory has to be updated with word
What does cache do about updating the copy of w in the next lower level in the hierarchy?

Caches and Memory Writes


Two options:
Write-through: Immediately write block to memory
Advantage: Simplest to implement Disadvantage: Increases number of bus transactions

Write-back: Defer block write until block is evicted


Advantage: significantly reduces number of bus transactions Disadvantage:
Additional complex :

Cache must maintain a dirty bit to keep track of which blocks must be written back when evicted
Loading cache may take longer because eviction is more complex

Caches and Memory Writes


Scenario: CPU writes a word to memory
Case 2: Block is not in cache (cache miss)
Should the block be loaded?

Caches and Memory Writes


Two options:
Write-allocate: Load block into cache and update block
Exploits spatial locality of writes Reduces # of bus transactions Generally done by write-back caches Requires more cache hardware

No-write allocate: Send update directly to next level


Overall update is faster Block with updated word is being updated, more bus transactions will occur Generally done by write-through caches Takes less hardware

What about writes? (Summary)


Multiple copies of data exist:
L1, L2, Main Memory, Disk

What to do on a write-hit?
Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line)
Need a dirty bit (line different from memory or not)

What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow

No-write-allocate (writes immediately to memory)

Typical
Write-through + No-write-allocate Write-back + Write-allocate

Types of Caches
Modern CPUS use separate caches for data and instructions:
d-cache: for program data
Should handle a wide variety of access patterns Handles reads/writes

i-cache: for program instructions


Mainly needs to handle simple sequential access Does not need to handle writes Can be made simpler and faster than a d-cache

Unified cache: a single cache is used for both instructions and data

i-cache and d-cache


Why using separate i-caches and d-caches?
Processor can read an instruction word and a data word at the same time i-caches are typically read-only (simpler) Each cache is often optimized to different access patterns
Different block sizes, associativities, and capacities

Intel Core i7 Cache Hierarchy


Processor package Core 0 Core 3

Regs
L1 L1 d-cache i-cache L2 unified cache

Regs
L1 L1 d-cache i-cache L2 unified cache

L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles

L2 unified cache: 256 KB, 8-way, Access: 11 cycles


L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles Block size: 64 bytes for all caches.

L3 unified cache (shared by all cores)

Main memory

Writing Cache Friendly Code


Make the common case go fast
Focus on the inner loops of the core functions

Minimize the misses in the inner loops


Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality)

Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

Potrebbero piacerti anche