Lecture13 Cache

2019 EE 488
Memory Hierarchy Design

- Cache Memory -
Myoungsoo Jung
Computer Division
Computer Architecture and Memory systems Laboratory
CAMELab
Introduction of Cache
O3 Mem
Scheduling Management
Retirement & Execution L1I$
Execution
Units L2$
L1D$
Sandy Bridge die shot (18.5mm2)
CAMELab Source: wikichip.org

Why Do We Need the Cache Memory?
 The “Memory wall": Logic vs. DRAM speed gap continues to grow
1000
Clocks per DRAM access

Clocks per instruction
100
10 Core
Memory
1
0.1
0.01
VAX/1980 PPro/1996 2010+
CAMELab
The Memory Hierarchy
 Take advantage of the principle of locality to present the user with as much
memory as is available in the cheapest technology at the speed offered by the
fastest technology
Processor
Increasing distance from the processor

Managed by 4-8 bytes (word)
compiler
L1$
8-32 bytes (block)
Managed by the L2$
cache controller
1 to 4 blocks
HW
Main memory
Managed by
1) OS (VM) 1,024+ bytes (disk sector = page)
2) HW (TLB)
Secondary Memory
3) User (Files)
(Relative) size of the memory at each level
CAMELab
Key Idea of Cache: Locality!
Temporal Locality Spatial Locality
(Locality in Time) (Locality in Space)
The program is very likely to access the same The program is very likely to access data that
data again and again over time is close together
i = i+1;
sum = 0;
if (i < 20) {
for(i = 0; i < n; i++)
z = i*i + 3*i - 2;
sum += a[i];
}
return sum;
q = A[i];
[Insights for cache design]

Lower Level Memory
To Processor Upper Level Memory
i j k …
From Processor
a[i] a[i+1] a[i+2] a[i+3]
 Keep most recently accessed data items  Move blocks consisting of contiguous
closer to the processor words to the upper levels
CAMELab
Cache Behaviors
• Hit: data is in some block in the upper level (Block X)
- Hit Rate: the fraction of memory accesses found in the upper level
- Hit Time: Time to access the upper level which consists of
- RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieved from a block in the lower

level (e.g., Block Y)
- Miss Rate = 1 – (Hit Rate)
- Miss Penalty: Time to replace a block in the upper level
+ Time to deliver the block to the processor
- Hit Time << Miss Penalty
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y
CAMELab
Classification of Cache Misses (3Cs)
Assumption:
Write to here is the
0x1234 location of
0x1234
: empty
Assumption: there
: filled
are only 4 cachelines
Compulsory Miss Conflict Miss Capacity Miss

 First access to a block,  Multiple memory  Although we assume
“cold” fact of life, not a locations mapped to the 0x1234 can be put anywhere,
whole lot you can do about it same cache location cache cannot contain all data
0x1234 0x1234 0x5670 0x1234 0x5670

0x91B1 0x91B1
0x1112 0x1112
0x3113
CAMELab
Measuring Cache Performance
CPU time = Instruction Count x (CPIideal + Cyclesmemory-stall) x Clock Cycle
CPIstall
Assuming cache hit costs are included as part of the normal CPU
execution cycle
Read-stall cycles = reads / program x read miss rate x read miss penalty
Write-stall cycles = (write / program x write miss rate x write miss penalty)
+ write buffer stalls
We can simplify the cache metric like below
Average Memory Access Time (AMAT)

= Hit time + Miss rate x Miss penalty
CAMELab
Impacts of Cache Performance
• Relative cache penalty increases as processor performance
improves (e.g., faster clock rate and/or lower CPI)
- The memory speed is unlikely to improve as fast as processor cycle
time. When calculating CPIstall, the cache miss penalty is measured in
processor clock cycles needed to handle a miss
- In other words, while CPIideal decrases, CPIstall is dramatically increases
[Example]
Case2: what if the processor
Case1: what if the
clock rate is doubled (doubling
36% CPIideal is reduced to 1?
the miss penality)?
LD/ST
Instr’s
Penaltymiss =
100 cycle 2% x 200 + 36% x 4% x
Miss rate Cyclesmemory-stall =
I$ = 2% CPIstall = 1+ 3.44 200 = 6.88
D$= 4% = 4.44 ∴ CPIstall = 2 + 6.88 = 8.88
CPUideal = 2
The amount of execution
time spending on memory
Cyclesmemory-stall = 2% x 100 + 36% x stalls would increase
4% x 100 = 3.44
∴ CPIstall = 2 + 3.44 = 5.44 3.44 / 5.44 = 63 % 3.44 / 5.44 = 63 %
3.44 / 4.44 = 77 % 6.88 / 8.88 = 77 %
CAMELab
Basic Cache Cache
Design Partitioned
into “blocks”
Data is copied in
block-sized
There are several transfer units
questions about cache

memory design
Memory
CAMELab
Design#1: Block Placement & Identification
Cache
Where can a How is a block

block be found if it is in
placed in the the upper
upper level? level?
Memory
CAMELab
Method: Direct Mapped Cache
 For each item of data at the lower level, there is exactly one location in
the cache where it might be – so lots of items at the lower level must
share locations in the upper level
 A simple example
 Block size = one word (32b, 4B)
 # of blocks = 4
 Memory address (assuming 64B size thereby 6b)
Tag Index Byte offset
• Byte offset: 2 bits Size of block (Byte in the word) = 22

• Index: 2 bits Number of blocks = 22
• Tag: 2 bits = {address size} – 2 – 2
(address size = 6 bits)
CAMELab
Direct Mapped Cache
A Simple Example
Main memory
Q1: Does the data exist in the cache? 0000xx
0001xx
Compare the cache tag to the high 0010xx
order 2 memory address bits to tell if the
Cache memory block is in the cache
0011xx
0100xx
Index Valid Tag Data
0101xx
00 0110xx
01 0111xx
10 Q2: How do we find the data (block)? 1000xx
11 1001xx
Use next 2 low order memory address 1010xx
bits – the index – to determine which
1011xx
cache block (i.e., modulo the number of
blocks in the cache) 1100xx
1101xx
1110xx
1111xx
CAMELab
Direct Mapped Cache
A Simple Example
Main memory
0000xx
0001xx
0010xx
Cache 0011xx
0100xx
0101xx
00 0110xx
01 0111xx
10 1000xx
11 1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
CAMELab
Direct Mapped Cache
MIPS Direct Mapped Cache Example
 One word/block, cache size = 1 K words
Byte
31 30 ... 13 12 11 ... 2 1 0
offset
Tag 20 10 Data
Hit
Index
0
1
2
.
.
.
1021
1022
1023
20 32
CAMELab
Direct Mapped Cache
Challenge1: High miss rate
 Request sequence
0(0000) 1(0001) 2(0010) 3(0011) 4(0100) 3(0011) 4(0100) 15(1111) Main memory
Miss Miss Miss Miss Miss Hit Hit Miss 0000xx
0001xx
Fill the cache 0010xx
Cache InvalidTag
Tagmismatch
match Fill the cache 0011xx
Tag Hit!
Miss! Miss! Fill the cache 0100xx
Invalid
Index Valid Data
Miss! Fill the cache 0101xx
00 1Invalid
00
01 Tag Mem[0]
Mem[4]
mismatch
Tag match 0110xx
Miss!
Invalid
01 1 00 Miss! Hit!
Mem[1] 0111xx
Miss!
10 1 00 Mem[2] 1000xx
11 1 00 11 Mem[3]
Mem[15] 1001xx
 Start with an empty cache, all Fill the cache 1010xx
blocks initially marked as not valid 1011xx
1100xx
1101xx
8 request, 6 misses 1110xx
Any better idea? 1111xx
CAMELab
Solution: Multiword Block Direct Mapped
 Key idea: larger block sizes take advantage of spatial locality
 Four words per block, cache size = 1K words
31 30 . . . 13 12 11 ... 4 32 10
Byte
Hit offset Data
Tag 20 8 Block offset

Index

0
1
2
.
.
.
253
254
255
20
32
CAMELab
Block
Multiword Direct Mapped Tag Index offset Byte offset
Advantages: spatial locality
 Same example w/ direct mapped
0(0000) 1(0001) 2(0010) 3(0011) 4(0100) 3(0011) 4(0100) 15(1111) Main memory
Miss Hit Miss Hit Miss Hit Hit Miss 0000xx
0001xx
0011xx
Cache Tagmismatch
Tag
Invalid match Fill
Hit!
Miss! Fill the
the cache
cache 0100xx
Invalid
Miss!
Index Valid Tagmismatch
Tag
Tag match
Data Data 0101xx
Miss! Miss!
Hit!
0 1 01 Mem[1] Mem[4]
00 Mem[5] Mem[0] 0110xx
1 1 00 Mem[2]
Mem[3] Mem[14]
11 Mem[15] 0111xx
1000xx
 Start with an empty cache, all 1001xx
blocks initially marked as not valid Fill the cache 1010xx
1011xx
1100xx
1101xx
Reduces 2 misses than one-word direct mapped 1111xx
CAMELab
Multiword Direct Mapped
Disadvantages of multiword
 But, miss rate goes up if the block size becomes a significant fraction of
the cache size because the # of blocks that can be held in the same size
cache is smaller (increasing capacity misses)
10
Cache size
8 KB
Miss rate (%)
16 KB
5 64 KB
256 KB
0
8 16 32 64 128 256
Block size (bytes)
CAMELab
Tag Block index Block offset
Direct Mapped Cache
Challenge2: Ping pong effect
 Request sequence
0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100)
Main memory
Miss Miss Miss Miss Miss Miss Miss Miss 0000xx
0001xx
Cache Tag mismatch 0011xx
Invalid
Fill the cache
Tag Miss!
Miss!
Index Valid Data 0100xx
0101xx
00 1 01
00 Mem[4]
Mem[0] 0110xx
01 0111xx
10 1000xx
11 1001xx
1100xx
1101xx
Any better idea? 1111xx
CAMELab
Solution: Set-Associative Cache
 Key idea: divide the cache into sets each of which consists of n “ways”
and allow a memory block to be mapped to any “ways”
 A simple example
 Block size = one word (32b, 4B)
 # of blocks = 4, # of sets = 2
 Memory address (assume 6b)
• Byte offset: 2 bits Size of block (Byte in the word) = 22

• Index: 1 bits Number of sets = 21
• Tag: 3 bits = {address size} – 2 – 1
(address size = 6 bits)
CAMELab
Set-Associative Cache
A Simple Example
Main memory
0000xx
Q1: Does the data exist in the cache? 0001xx
0010xx
Cache Compare the cache tag in the set to 0011xx
the high order 3 memory address bits to 0100xx
Way Set V Tag Data tell if the memory block is in the cache
0101xx
0
0 0110xx
1
0111xx
0 1000xx
1
1 1001xx
Q2: How do we find the data (block)?
1010xx
Use next 1 low order memory address 1011xx
bits – the set index – to determine which 1100xx
set (i.e., modulo the number of sets in
1101xx
the cache)
1110xx
1111xx
CAMELab
A Simple Example
 Same example w/ direct mapped
0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100)
Main memory
Miss Miss Hit Hit Hit Hit Hit Hit 0000xx
0001xx
Cache Tag match 0011xx
Tag Hit!
Way Set V Miss! Data 0100xx
Tag match Fill the cache 0101xx
0
0 1 000 Mem[0]
Miss! Hit! 0110xx
1
0111xx
0 1 010 Mem[4] 1000xx
1
1 1001xx
1100xx
1101xx
Reduces 6 misses than one-word direct mapped 1111xx
CAMELab
MIPS Example: 4-Way SA
 One word/block, Cache size = 1K words
 28 = 256 sets each with four ways (each with one block)
Address of word:
Tag (22 bits) Set index (8 bits) Block offset (2 bits)
Index V Tag Data V Tag Data V Tag Data V Tag Data

0 0 0 0
1 1 1 1
. . . .
. . . .
255 255 255 255
4x1 select
Hit Data
CAMELab
Disadvantage: expensive cost
 One word/block, Cache size = 1K words
 28 = 256 sets each with four ways (each with one block)
Address of word:
Tag (22 bits) Set index (8 bits) Block offset (2 bits)
Mux delay (set
selection) before
data is available
Index V Tag
0
Data
0
V Tag Data
0
V Tag Data
N0comparators
V Tag
(N-
Data
1 1 1 1 way)
. . . . Delay & area
. . . .
255 255 255 255
Hit/Miss
decision delay
4x1 select
Hit Data
CAMELab
Advantage: can reduce miss rate with small associativity
 The choice of direct mapped or set associative depends on the
cost of a miss versus the cost of implementation
12 Largest gains are in

going from direct 4KB
10 mapped to 2-way 8KB
16KB
8
Miss Rate
32KB
64KB
6
128KB
4 256KB
512KB
2
0
1-way 2-way 4-way 8-way
Associativity
CAMELab Data from Hennessy & Patterson, Computer Architecture

Range of Set Associative Caches
 The For a fixed size cache, each increase by a factor of two in
associativity doubles the number of blocks per set (i.e., the
number or ways) and halves the number of sets – decreases the
size of the index by 1 bit and increases the size of the tag by 1 bit
Address decomposition
: Byte offset
: Block offset  selects the word in the block (multi-word support)
: Index  selects the set
: Tag  used for tag compare
Decreasing associativity Increasing associativity
Direct mapped Fully associative

(only one way) (only one set)
- Smaller tags (determined - Tag is all the bits except
by cache size) block and byte offset
CAMELab
Design#3: Block Replacement
Cache
Which block
should be
replaced on a Evict the
miss? cacheline first!
Memory
CAMELab
Cache Replacement Policy
• Static
- For direct mapped cache, there is only one choice
• Random
- Replace a randomly chosen cacheline
• FIFO
- Replace the oldest cacheline
• LRU (Least Recently Used)
- Replace the least recently used line
• NRU (Not Recently Used)
- Replace one of the lines that is not recently used
- In Itanium2, L1$, L2$ and L3$ use this policy
CAMELab
Least-Recently Used (LRU):
 Key idea: evict block with longest reuse distance
• For way = 2, LRU is equivalent to NMRU (Not Most

Recently Used)
- Single bit per set indicated LRU/MRU
- Set/clear on each access
• For way > 2, LRU is difficult / expensive
- Including timestamps? How many bits?
- Ideal implementation: find min timestamp on each eviction
- Sorted list? Re-sort on every access?
• Let’s remember the order in which all N cachelines
were last accessed
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
 Key idea: approximated LRU policy with a binary tree
 Example: PLUR for 4-way set associative cache
Pointer indicates
the older
All PLUR bits are 0: left is older
initialized as 0 AB/CD bit (L0) 0 1: right is older
A/B bit (L1) 0 C/D bit (L2) 0
Way A Way B Way C Way D
• Each node records which half is older/newer

- Cache ways are the leaves of the tree
CAMELab
 Update order: way A  way B  way C  way D
Way CD is older
Way B is older AB/CD bit (L0) 1
0 than Way AB
than Way A

• Update nodes on each reference
CAMELab
Way A is older AB/CD bit (L0) 1

than Way B

CAMELab
Way AB is older Way D is older

than Way CD AB/CD bit (L0) 0
1 than Way C

CAMELab
Way C is older
AB/CD bit (L0) 0 than Way D

CAMELab
Practical Pseudo-LRU: 𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
AB/CD bit (L0) 0
Victim block is A/B bit (L1) 0 C/D bit (L2) 0

in Way A

• Follow older pointers to find LRU victim
CAMELab
Clock algorithm
• Simpler implementation
“Clock” hand points to next page to replace A
H t=32
t=0 B
If R=0, replace page t=30 t=32
t=4
If R=1, set R=0 and advance the clock hand
G C
• Continue until page t=29 t=32
t=8
with R=0 is found
This may involve going all the way F DJ
around the clock… t=22 E t=15
t=32
t=21
referenced unreferenced
CAMELab Credit: U of pitts

Design#4: Write Strategy
Cache
What
happens on a
write?
Memory
CAMELab
Handling Cache Hits
Write through vs. Write back
Write through Write back

Allow cache and memory to be consistent Allow cache and memory to be inconsistent
 Always write the data into both the cache and next  Write the data only into the cache block
level memory  Need a dirty bit for each data cache block to tell if it
 Write-through is always combined with a write needs to be written back to memory when it is evicted
buffer; so we can eliminate the overhead
FIFO
(#entry = 4)
write buffer
CAMELab
Handling Cache Write Miss
Write allocate vs. No write allocate
Write allocate No write allocate

Just write the word into the cache updating Skip the cache write and just write the word
both the tag and data, no need to check for to the next memory level
cache hit, no need to stall the pipeline  Mostly use with write-through cache
When cache Step2: Write the

miss occurs.. new data into the
allocated
cacheline
Step1: Allocate Directly update

cacheline and the lower level
read data from memory
DRAM
CAMELab
Cache Optimizations
We can improve the cache performance by
reducing AMAT!

CAMELab
Summary of Cache Optimization
Technique Miss Rate Miss Pen. Hit time Hardware Complexity
Larger Block Size + - 0
Higher Associativity + - 1
Victim Caches + 2
Pseudo-associative + 2
Hardware Prefetching + 2
Compiler-controlled Pre + 3
Compiler Techniques + 0
Giving Read Misses Priority + 1
Subblock Placement + 1
Early Restart/Crit Wd First + 2
Nonblocking Caches + 3
Second-Level Caches + 2
Small and Simple Caches - + 0
Avoiding Address Trans. + 2
Pipelining Writes + 1
CAMELab
Goal#1:
Miss Penalty ↓
- Multi-level caches
- Critical Word First and early restart
- Combining writes
- Non-blocking caches

CAMELab
Technique1: Multi-level Caches
 Key idea: fill the gap between the central processing unit and main
memory by adding another level in the hierarchy
Processor Processor
L1$
L1-L2
SRAM
L3-L4
L2$ eDRAM
Main memory
Main memory
Storage-Class
Memory (SCM)
Secondary Memory Secondary Memory
Past Now
CAMELab
Kinds of Cache Hierarchies
 Today’s processors have multi-level cache hierarchies
 And multi-level caches can be designed in various ways depending on
whether the content of one cache in present in other level of caches
Inclusive Non-Inclusive Exclusive

hierarchy hierarchy hierarchy
 upper-level blocks always  May contain the upper-  upper-level blocks must
exist in the lower-level level blocks in the lower-level not exist in the lower-level
victim
fill Upper evict fill Upper fill Upper
Back-invalidation
fill fill Lower Lower

Lower
fill
victim victim victim
CAMELab Credit: Intel [MICRO’10]

Technique2: Early Restart & Critical Word First
 Key idea: do not wait for the entire block to be loaded before restarting
CPU – CPU needs only 1 word
Early restart Critical Word First

 As soon as the requested word of  Request the missed word first
the block arrives, send it to the CPU from memory and send it to the
and let the CPU continue execution CPU as soon as it arrives
Processor
Resume Resume
4-8 bytes (word)
L1$
8-32 bytes (block) 0 1 2 3 0 1 2 3
L2$
Critical
Block (ex: 32B, word
4 words) Critical word
CAMELab
Technique3: Combining Writes
 Background: MTRR (Memory type range register) in x86
 Indicates how accesses to memory ranges by the CPU are cached
 We can simply check with “cat /proc/mtrr”
 There are 5 memory types

Memory Types in x86 Explanation
UnCacheable (UC) Bypass all the cache subsystem
UnCacheable Speculative Write- Bypasses the caches, excluding write
combining (USWC or WC) combining buffer (WCB)
WriteBack (WB) Write the data only into the cache block
Always write the data into both the cache and
Write-Through (WT)
next level memory
Write-Protected (WP) Supervisor-mode access only
CAMELab Intel® 64 and IA-32 Architectures Software Developer’s Manual

Volume 3A: System Programming Guide, Part 1
Write Combining Buffer
 Key idea: coalescing writes before or during a long lasting operation like
a bus transaction or a cache coherency transaction
 Only for USWC (Uncacheable Speculative Write-Combining) memory
 Explicit method (instruction) to use USWC memory type: non-
temporal load (movntdqa) and non-temporal store (moventdq)
Processor Mem[900]
Mem[124]
Mem[116]
Mem[108]
Mem[400]
Mem[100]
WC buffer L1$
Mem[100] Mem[108] Mem[116] Mem[124]
L2$
Mem[400]
Mem[900] LLC
UC mode LowerUSWC
level memory
mode WB mode
CAMELab
Technique4: Non-Blocking Cache
Recall: pipeline stall by cache miss in MIPS
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9
Program
Cache miss
ALU
IM Reg DM Reg
Instruction1 occurs
stall
Instruction2 IM
stall
stall
ALU
IM Reg DM Reg
CAMELab
Miss Status Holding Registers(MSHRs)
 Key idea: allow for more than one outstanding miss by tracking of cache
misses and pending load & stores that refer to that cache block
 Miss Handling Architecture (MHA) & Miss Status Holding Register (MSHR)
Banked cache system (L1$, L2$, LLC)
MSHR1 Valid Block addr LD/ST info

Cache Cache Cache
Bank Bank Bank MSHR2 Valid Block addr LD/ST info
MSHRN Valid Block addr LD/ST info
MSHR MSHR MSHR Address

Comparators
File File File
Miss Handling Architecture
Hit
CAMELab
Goal#2:
Miss Rate ↓
- Hardware prefetching
- Compiler techniques

CAMELab
Technique1: Hardware Prefetching
 Key idea: hardware monitors memory accesses
Processor
Registers Intel Core2
Prefetcher
I-TLB L1 I-Cache L1 D-Cache D-TLB Locations
L2 Cache
What’s inside today’s chips

• L1D$ L3 Cache (LLC)
• PC-localized stride predictors
• Short-stride predictors within block  prefetch next block
• L1I$
• predict
• Real CPUsfuture have
PC multiple prefetches
• L2$
- • Usually closer to the core (easier to detect patterns)
Stream buffers
Adjacent-line prefetch
- • Prefetching at LLC is hard (cache is banked and hashed)
CAMELab Source: CSE502 prefetching (stony brook univ.)

Stream Buffer
 The pattern of prefetching successive blocks is called sequential
prefetching. Stream buffer keeps next-N available in buffer
Processor
L1$
Stream buffer (FIFO Queue version)

tag One cache line of data
Tail entry
+1
Lower level memory
CAMELab
Technique2: Cache-friendly Compiler
 Key idea: trying to modify the layout of data structures so that they are
accessed in a more cache-friendly manner
Merging Array Loop Fusion Loop Interchange

 Merge 2 arrays into a single  Combine 2 independent  Change loops nesting to access
array of compound elements loops that have same looping data in order stored in memory
and some variables overlap
/*two sequential for (i=0; i<10K; i++) for (j=0; j<N-1; j++)
arrays*/ a[i] = 1/a[i]; for (i=0;i<N-1;i++)
int val[SIZE]; for (i=0; i<10K; i++) x[i][j] = 2*x[i][j]
int key[SIZE]; sum = sum + a[i];
Optimized
struct merge { for (i=0; i<10K; i++) for (i=0; i<N-1; i++)
int val; a[i] = 1/a[i]; for (j=0;j<N-1;j++)
int key; sum = sum + a[i]; x[i][j] = 2*x[i][j]
} merged[SIZE];
 For the second line: hit all  Sequential access, not striding
 Improves spatial locality  Improves temporal locality  Improves spatial locality
CAMELab
Goal#3:
Hit Time ↓
- Avoiding address translation during
indexing of the cache (virtual cache)
- Way prediction
- Trace cache

CAMELab
Technique1: Virtually Indexed Cache
 Key idea: The CPU uses virtual addresses that must be mapped to a
physical address. Let’s index cache with virtual address (not physical addr.)
 Challenge: process switches require cache purging
 Solution: PID tags
 Challenge: Aliasing (two diff virtual addrs may have the same physical addr.)
 Solution: anti-aliasing hardware / page coloring / using the page offset
VA Address PA Physical Primary

CPU
translation cache memory
VA Virtual Address PA Primary

CPU
cache translation memory
CAMELab http://ece-research.unm.edu/jimp/611/slides/chap5_4.html
Technique2: Way Prediction
 Key idea: make set-associative caches faster by keeping extra bits in cache
to predict the “way” or block within the set, of next cache access; branch
predictor can override the decision of the way predictor
 Example: Way prediction instruction cache (Alpha 21264-like)
Jump target
0x4
Jump Add
control
PC
addr Primary inst

Instruction
way Cache
Sequential Way
Branch Target Way
CAMELab
Technique3: Trace Cache
 Key idea: make an instruction cache faster by packing multiple non-
contiguous basic blocks into one contiguous trace cache line
 Example: Pentium 4 (NetBurst) Trace Cache
 Trace cache stores decoded and cracked instructions
 Micro-operations (uops): returns 6 uops every other cycle
Instruction Cache Trace Cache

E F G
H I J A B C D E F G H I J
A B Branch
Branch C
D
IC Fetch (5 cycles) TC Fetch (1 cycle)
CAMELab
Wrap Up
Different Cache Design and Policies could be
chosen based on the target performance
CAMELab
Cache Parameters in Real μProcessor
Intel P4 AMD Opteron
L1 organization Split I$ and D$ Split I$ and D$
L1 cache size 8KB for D$, 96KB for 64KB for each of I$ and D$
trace cache (~I$)
L1 block size 64 bytes 64 bytes
L1 associativity 4-way set assoc. 2-way set assoc.
L1 replacement ~ LRU LRU
L1 write policy write-through write-back
L2 organization Unified Unified
L2 cache size 512KB 1024KB (1MB)
L2 block size 128 bytes 64 bytes
L2 associativity 8-way set assoc. 16-way set assoc.
L2 replacement ~LRU ~LRU
L2 write policy write-back write-back
CAMELab
More about Cache
Cache partitioning
 How about cache fairness in multi-core/thread
environment?
- Cache Allocation Technology (CAT) Processor 1 Processor 2
- Code and Data Prioritization (CDP) L1 cache L1 cache
Cache coherence
 How to make synchronization among multiple
local caches (per-core L1$ or L2$); we will learn L2 cache (shared)
the protocols in later
- Various protocols: MSI, MESI, MOSI, MOESI, MERSI,
MESIF, write-once, Synapse, Berkeley, and Firefly and
Dragon protocol Memory
Cache side-channel attack

How about the security issue?
- LLC as covert channel
- Reference: Cache side channel attacks: CPU Design as
a security problem [HITBSecConf’16]
CAMELab
2019 EE 488
Memory Hierarchy Design

- Cache Memory -
Myoungsoo Jung
Computer Division
Computer Architecture and Memory systems Laboratory
CAMELab
Types of Caches
 There are three types based on associativity
Direct mapping Full-associative

Set-associative (SA)
(DM) (FA)
DM and FA can be
thought as special
cases of SA
- DM: 1-way SA
- FA: All-way SA Cache Cache Cache
Example: 2-way
Mapping of data Specific location in Any of a set of Any location in

from memory to $ the cache (fixed) locations in the cache the cache
Complexity of Fast indexing Slightly more involved Extensive hardware
search the cache mechanism search mechanism resources (CAM)
CAMELab
Summary of Caches
 Where can a block be placed / found?
# of sets Blocks per set

Direct mapped # of blocks in cache 1
Set associative (# of blocks in cache)/ Associativity (typically
associativity 2 to 16)
Fully associative 1 # of blocks in cache
Location method # of comparisons

Direct mapped Index 1
Set associative Index the set; compare Degree of
set’s tags associativity
Fully associative Compare all blocks tags # of blocks
CAMELab

Lecture13 Cache

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture13 Cache

Caricato da

Copyright:

Formati disponibili

2019 EE 488

Memory Hierarchy Design

Computer Architecture and Memory systems Laboratory

CAMELab Source: wikichip.org

Clocks per DRAM access

Increasing distance from the processor

(Relative) size of the memory at each level

[Insights for cache design]

• Miss: data needs to be retrieved from a block in the lower

Compulsory Miss Conflict Miss Capacity Miss

0x1234 0x1234 0x5670 0x1234 0x5670

We can simplify the cache metric like below

Average Memory Access Time (AMAT)

questions about cache

Where can a How is a block

• Byte offset: 2 bits Size of block (Byte in the word) = 22

Tag 20 8 Block offset

Index Valid Tag Data

Tag Index Byte offset

• Byte offset: 2 bits Size of block (Byte in the word) = 22

Index V Tag Data V Tag Data V Tag Data V Tag Data

12 Largest gains are in

CAMELab Data from Hennessy & Patterson, Computer Architecture

Decreasing associativity Increasing associativity

Direct mapped Fully associative

• For way = 2, LRU is equivalent to NMRU (Not Most

A/B bit (L1) 0 C/D bit (L2) 0

Way A Way B Way C Way D

• Each node records which half is older/newer

A/B bit (L1) 10 C/D bit (L2) 0

Way A Way B Way C Way D

• Each node records which half is older/newer

Way A is older AB/CD bit (L0) 1

A/B bit (L1) 01 C/D bit (L2) 0

Way A Way B Way C Way D

• Each node records which half is older/newer

Way AB is older Way D is older

A/B bit (L1) 0 C/D bit (L2) 01

Way A Way B Way C Way D

• Each node records which half is older/newer

A/B bit (L1) 0 C/D bit (L2) 10

Way A Way B Way C Way D

• Each node records which half is older/newer

AB/CD bit (L0) 0

Victim block is A/B bit (L1) 0 C/D bit (L2) 0

Way A Way B Way C Way D

• Each node records which half is older/newer

CAMELab Credit: U of pitts

Write through Write back

Write allocate No write allocate

When cache Step2: Write the

Step1: Allocate Directly update

Average Memory Access Time (AMAT)

Average Memory Access Time (AMAT)

Inclusive Non-Inclusive Exclusive

fill fill Lower Lower

victim victim victim

CAMELab Credit: Intel [MICRO’10]

Early restart Critical Word First

 There are 5 memory types

CAMELab Intel® 64 and IA-32 Architectures Software Developer’s Manual

Banked cache system (L1$, L2$, LLC)

MSHR1 Valid Block addr LD/ST info