Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Myoungsoo Jung
Computer Division
CAMELab
Introduction of Cache
O3 Mem
Scheduling Management
Retirement & Execution L1I$
Execution
Units L2$
L1D$
Sandy Bridge die shot (18.5mm2)
1000
100
10 Core
Memory
1
0.1
0.01
VAX/1980 PPro/1996 2010+
CAMELab
The Memory Hierarchy
Take advantage of the principle of locality to present the user with as much
memory as is available in the cheapest technology at the speed offered by the
fastest technology
Processor
L1$
8-32 bytes (block)
Managed by the L2$
cache controller
1 to 4 blocks
HW
Main memory
Managed by
1) OS (VM) 1,024+ bytes (disk sector = page)
2) HW (TLB)
Secondary Memory
3) User (Files)
CAMELab
Key Idea of Cache: Locality!
Temporal Locality Spatial Locality
(Locality in Time) (Locality in Space)
The program is very likely to access the same The program is very likely to access data that
data again and again over time is close together
i = i+1;
sum = 0;
if (i < 20) {
for(i = 0; i < n; i++)
z = i*i + 3*i - 2;
sum += a[i];
}
return sum;
q = A[i];
Keep most recently accessed data items Move blocks consisting of contiguous
closer to the processor words to the upper levels
CAMELab
Cache Behaviors
• Hit: data is in some block in the upper level (Block X)
- Hit Rate: the fraction of memory accesses found in the upper level
- Hit Time: Time to access the upper level which consists of
- RAM access time + Time to determine hit/miss
CAMELab
Classification of Cache Misses (3Cs)
Assumption:
Write to here is the
0x1234 location of
0x1234
: empty
Assumption: there
: filled
are only 4 cachelines
CAMELab
Measuring Cache Performance
CPU time = Instruction Count x (CPIideal + Cyclesmemory-stall) x Clock Cycle
CPIstall
Assuming cache hit costs are included as part of the normal CPU
execution cycle
Read-stall cycles = reads / program x read miss rate x read miss penalty
Write-stall cycles = (write / program x write miss rate x write miss penalty)
+ write buffer stalls
CAMELab
Impacts of Cache Performance
• Relative cache penalty increases as processor performance
improves (e.g., faster clock rate and/or lower CPI)
- The memory speed is unlikely to improve as fast as processor cycle
time. When calculating CPIstall, the cache miss penalty is measured in
processor clock cycles needed to handle a miss
- In other words, while CPIideal decrases, CPIstall is dramatically increases
[Example]
Case2: what if the processor
Case1: what if the
clock rate is doubled (doubling
36% CPIideal is reduced to 1?
the miss penality)?
LD/ST
Instr’s
Penaltymiss =
100 cycle 2% x 200 + 36% x 4% x
Miss rate Cyclesmemory-stall =
I$ = 2% CPIstall = 1+ 3.44 200 = 6.88
D$= 4% = 4.44 ∴ CPIstall = 2 + 6.88 = 8.88
CPUideal = 2
The amount of execution
time spending on memory
Cyclesmemory-stall = 2% x 100 + 36% x stalls would increase
4% x 100 = 3.44
∴ CPIstall = 2 + 3.44 = 5.44 3.44 / 5.44 = 63 % 3.44 / 5.44 = 63 %
3.44 / 4.44 = 77 % 6.88 / 8.88 = 77 %
CAMELab
Basic Cache Cache
Design Partitioned
into “blocks”
Data is copied in
block-sized
There are several transfer units
Memory
CAMELab
Design#1: Block Placement & Identification
Cache
Memory
CAMELab
Method: Direct Mapped Cache
For each item of data at the lower level, there is exactly one location in
the cache where it might be – so lots of items at the lower level must
share locations in the upper level
A simple example
Block size = one word (32b, 4B)
# of blocks = 4
Memory address (assuming 64B size thereby 6b)
Tag Index Byte offset
CAMELab
Tag Index Byte offset
Direct Mapped Cache
A Simple Example
Main memory
0000xx
0001xx
0010xx
Cache 0011xx
0100xx
Index Valid Tag Data
0101xx
00 0110xx
01 0111xx
10 1000xx
11 1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
CAMELab
Direct Mapped Cache
MIPS Direct Mapped Cache Example
One word/block, cache size = 1 K words
Byte
31 30 ... 13 12 11 ... 2 1 0
offset
Tag 20 10 Data
Hit
Index
Index Valid Tag Data
0
1
2
.
.
.
1021
1022
1023
20 32
CAMELab
Tag Index Byte offset
Direct Mapped Cache
Challenge1: High miss rate
Request sequence
0(0000) 1(0001) 2(0010) 3(0011) 4(0100) 3(0011) 4(0100) 15(1111) Main memory
Miss Miss Miss Miss Miss Hit Hit Miss 0000xx
0001xx
Fill the cache 0010xx
Cache InvalidTag
Tagmismatch
match Fill the cache 0011xx
Tag Hit!
Miss! Miss! Fill the cache 0100xx
Invalid
Index Valid Data
Miss! Fill the cache 0101xx
00 1Invalid
00
01 Tag Mem[0]
Mem[4]
mismatch
Tag match 0110xx
Miss!
Invalid
01 1 00 Miss! Hit!
Mem[1] 0111xx
Miss!
10 1 00 Mem[2] 1000xx
11 1 00 11 Mem[3]
Mem[15] 1001xx
Start with an empty cache, all Fill the cache 1010xx
blocks initially marked as not valid 1011xx
1100xx
1101xx
8 request, 6 misses 1110xx
Any better idea? 1111xx
CAMELab
Solution: Multiword Block Direct Mapped
Key idea: larger block sizes take advantage of spatial locality
Four words per block, cache size = 1K words
31 30 . . . 13 12 11 ... 4 32 10
Byte
Hit offset Data
32
CAMELab
Block
Multiword Direct Mapped Tag Index offset Byte offset
Advantages: spatial locality
Same example w/ direct mapped
0(0000) 1(0001) 2(0010) 3(0011) 4(0100) 3(0011) 4(0100) 15(1111) Main memory
Miss Hit Miss Hit Miss Hit Hit Miss 0000xx
0001xx
Fill the cache 0010xx
0011xx
Cache Tagmismatch
Tag
Invalid match Fill
Hit!
Miss! Fill the
the cache
cache 0100xx
Invalid
Miss!
Index Valid Tagmismatch
Tag
Tag match
Data Data 0101xx
Miss! Miss!
Hit!
0 1 01 Mem[1] Mem[4]
00 Mem[5] Mem[0] 0110xx
1 1 00 Mem[2]
Mem[3] Mem[14]
11 Mem[15] 0111xx
1000xx
Start with an empty cache, all 1001xx
blocks initially marked as not valid Fill the cache 1010xx
1011xx
1100xx
1101xx
8 request, 4 misses 1110xx
Reduces 2 misses than one-word direct mapped 1111xx
CAMELab
Multiword Direct Mapped
Disadvantages of multiword
But, miss rate goes up if the block size becomes a significant fraction of
the cache size because the # of blocks that can be held in the same size
cache is smaller (increasing capacity misses)
10
Cache size
8 KB
Miss rate (%)
16 KB
5 64 KB
256 KB
0
8 16 32 64 128 256
Block size (bytes)
CAMELab
Tag Block index Block offset
Direct Mapped Cache
Challenge2: Ping pong effect
Request sequence
0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100)
Main memory
Miss Miss Miss Miss Miss Miss Miss Miss 0000xx
0001xx
Fill the cache 0010xx
Cache Tag mismatch 0011xx
Invalid
Fill the cache
Tag Miss!
Miss!
Index Valid Data 0100xx
0101xx
00 1 01
00 Mem[4]
Mem[0] 0110xx
01 0111xx
10 1000xx
11 1001xx
Start with an empty cache, all 1010xx
blocks initially marked as not valid 1011xx
1100xx
1101xx
8 request, 8 misses 1110xx
Any better idea? 1111xx
CAMELab
Solution: Set-Associative Cache
Key idea: divide the cache into sets each of which consists of n “ways”
and allow a memory block to be mapped to any “ways”
A simple example
Block size = one word (32b, 4B)
# of blocks = 4, # of sets = 2
Memory address (assume 6b)
CAMELab
Tag Index Byte offset
Set-Associative Cache
A Simple Example
Same example w/ direct mapped
0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100) 0(0000) 4(0100)
Main memory
Miss Miss Hit Hit Hit Hit Hit Hit 0000xx
0001xx
Fill the cache 0010xx
Cache Tag match 0011xx
Tag Hit!
Way Set V Miss! Data 0100xx
Tag match Fill the cache 0101xx
0
0 1 000 Mem[0]
Miss! Hit! 0110xx
1
0111xx
0 1 010 Mem[4] 1000xx
1
1 1001xx
Start with an empty cache, all 1010xx
blocks initially marked as not valid 1011xx
1100xx
1101xx
8 request, 2 misses 1110xx
Reduces 6 misses than one-word direct mapped 1111xx
CAMELab
Set-Associative Cache
MIPS Example: 4-Way SA
One word/block, Cache size = 1K words
28 = 256 sets each with four ways (each with one block)
Address of word:
Tag (22 bits) Set index (8 bits) Block offset (2 bits)
4x1 select
Hit Data
CAMELab
Set-Associative Cache
Disadvantage: expensive cost
One word/block, Cache size = 1K words
28 = 256 sets each with four ways (each with one block)
Address of word:
Tag (22 bits) Set index (8 bits) Block offset (2 bits)
Mux delay (set
selection) before
data is available
Index V Tag
0
Data
0
V Tag Data
0
V Tag Data
N0comparators
V Tag
(N-
Data
1 1 1 1 way)
. . . . Delay & area
. . . .
255 255 255 255
Hit/Miss
decision delay
4x1 select
Hit Data
CAMELab
Set-Associative Cache
Advantage: can reduce miss rate with small associativity
The choice of direct mapped or set associative depends on the
cost of a miss versus the cost of implementation
32KB
64KB
6
128KB
4 256KB
512KB
2
0
1-way 2-way 4-way 8-way
Associativity
CAMELab
Design#3: Block Replacement
Cache
Which block
should be
replaced on a Evict the
miss? cacheline first!
Memory
CAMELab
Cache Replacement Policy
• Static
- For direct mapped cache, there is only one choice
• Random
- Replace a randomly chosen cacheline
• FIFO
- Replace the oldest cacheline
• LRU (Least Recently Used)
- Replace the least recently used line
• NRU (Not Recently Used)
- Replace one of the lines that is not recently used
- In Itanium2, L1$, L2$ and L3$ use this policy
CAMELab
Least-Recently Used (LRU):
Key idea: evict block with longest reuse distance
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Pointer indicates
the older
All PLUR bits are 0: left is older
initialized as 0 AB/CD bit (L0) 0 1: right is older
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Update order: way A way B way C way D
Way CD is older
Way B is older AB/CD bit (L0) 1
0 than Way AB
than Way A
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Update order: way A way B way C way D
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Update order: way A way B way C way D
CAMELab
Practical Pseudo-LRU:𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Update order: way A way B way C way D
Way C is older
AB/CD bit (L0) 0 than Way D
CAMELab
Practical Pseudo-LRU: 𝑂𝑂(𝐿𝐿𝐿𝐿𝐿𝐿𝑁𝑁)
Key idea: approximated LRU policy with a binary tree
Example: PLUR for 4-way set associative cache
Update order: way A way B way C way D
CAMELab
Clock algorithm
• Simpler implementation
“Clock” hand points to next page to replace A
H t=32
t=0 B
If R=0, replace page t=30 t=32
t=4
If R=1, set R=0 and advance the clock hand
G C
• Continue until page t=29 t=32
t=8
with R=0 is found
This may involve going all the way F DJ
around the clock… t=22 E t=15
t=32
t=21
referenced unreferenced
Cache
What
happens on a
write?
Memory
CAMELab
Handling Cache Hits
Write through vs. Write back
FIFO
(#entry = 4)
write buffer
CAMELab
Handling Cache Write Miss
Write allocate vs. No write allocate
CAMELab
Cache Optimizations
We can improve the cache performance by
reducing AMAT!
CAMELab
Summary of Cache Optimization
Technique Miss Rate Miss Pen. Hit time Hardware Complexity
Larger Block Size + - 0
Higher Associativity + - 1
Victim Caches + 2
Pseudo-associative + 2
Hardware Prefetching + 2
Compiler-controlled Pre + 3
Compiler Techniques + 0
Giving Read Misses Priority + 1
Subblock Placement + 1
Early Restart/Crit Wd First + 2
Nonblocking Caches + 3
Second-Level Caches + 2
Small and Simple Caches - + 0
Avoiding Address Trans. + 2
Pipelining Writes + 1
CAMELab
Goal#1:
Miss Penalty ↓
- Multi-level caches
- Critical Word First and early restart
- Combining writes
- Non-blocking caches
CAMELab
Technique1: Multi-level Caches
Key idea: fill the gap between the central processing unit and main
memory by adding another level in the hierarchy
Processor Processor
L1$
L1-L2
SRAM
L3-L4
L2$ eDRAM
Main memory
Main memory
Storage-Class
Memory (SCM)
Secondary Memory Secondary Memory
Past Now
CAMELab
Kinds of Cache Hierarchies
Today’s processors have multi-level cache hierarchies
And multi-level caches can be designed in various ways depending on
whether the content of one cache in present in other level of caches
victim
fill Upper evict fill Upper fill Upper
Back-invalidation
L1$
8-32 bytes (block) 0 1 2 3 0 1 2 3
L2$
Critical
Block (ex: 32B, word
4 words) Critical word
CAMELab
Technique3: Combining Writes
Background: MTRR (Memory type range register) in x86
Indicates how accesses to memory ranges by the CPU are cached
We can simply check with “cat /proc/mtrr”
Processor Mem[900]
Mem[124]
Mem[116]
Mem[108]
Mem[400]
Mem[100]
WC buffer L1$
Mem[100] Mem[108] Mem[116] Mem[124]
L2$
Mem[400]
Mem[900] LLC
UC mode LowerUSWC
level memory
mode WB mode
CAMELab
Technique4: Non-Blocking Cache
Recall: pipeline stall by cache miss in MIPS
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9
Program
Cache miss
ALU
IM Reg DM Reg
Instruction1 occurs
stall
Instruction2 IM
stall
stall
ALU
IM Reg DM Reg
CAMELab
Miss Status Holding Registers(MSHRs)
Key idea: allow for more than one outstanding miss by tracking of cache
misses and pending load & stores that refer to that cache block
Miss Handling Architecture (MHA) & Miss Status Holding Register (MSHR)
Hit
CAMELab
Goal#2:
Miss Rate ↓
- Hardware prefetching
- Compiler techniques
CAMELab
Technique1: Hardware Prefetching
Key idea: hardware monitors memory accesses
Processor
Registers Intel Core2
Prefetcher
I-TLB L1 I-Cache L1 D-Cache D-TLB Locations
L2 Cache
Processor
L1$
CAMELab
Technique2: Cache-friendly Compiler
Key idea: trying to modify the layout of data structures so that they are
accessed in a more cache-friendly manner
Optimized
struct merge { for (i=0; i<10K; i++) for (i=0; i<N-1; i++)
int val; a[i] = 1/a[i]; for (j=0;j<N-1;j++)
int key; sum = sum + a[i]; x[i][j] = 2*x[i][j]
} merged[SIZE];
For the second line: hit all Sequential access, not striding
Improves spatial locality Improves temporal locality Improves spatial locality
CAMELab
Goal#3:
Hit Time ↓
- Avoiding address translation during
indexing of the cache (virtual cache)
- Way prediction
- Trace cache
CAMELab
Technique1: Virtually Indexed Cache
Key idea: The CPU uses virtual addresses that must be mapped to a
physical address. Let’s index cache with virtual address (not physical addr.)
Challenge: process switches require cache purging
Solution: PID tags
Challenge: Aliasing (two diff virtual addrs may have the same physical addr.)
Solution: anti-aliasing hardware / page coloring / using the page offset
CAMELab http://ece-research.unm.edu/jimp/611/slides/chap5_4.html
Technique2: Way Prediction
Key idea: make set-associative caches faster by keeping extra bits in cache
to predict the “way” or block within the set, of next cache access; branch
predictor can override the decision of the way predictor
Example: Way prediction instruction cache (Alpha 21264-like)
Jump target
0x4
Jump Add
control
PC
Sequential Way
Branch Target Way
CAMELab
Technique3: Trace Cache
Key idea: make an instruction cache faster by packing multiple non-
contiguous basic blocks into one contiguous trace cache line
Example: Pentium 4 (NetBurst) Trace Cache
Trace cache stores decoded and cracked instructions
Micro-operations (uops): returns 6 uops every other cycle
A B Branch
Branch C
D
IC Fetch (5 cycles) TC Fetch (1 cycle)
CAMELab
Wrap Up
Different Cache Design and Policies could be
chosen based on the target performance
CAMELab
Cache Parameters in Real μProcessor
Intel P4 AMD Opteron
L1 organization Split I$ and D$ Split I$ and D$
L1 cache size 8KB for D$, 96KB for 64KB for each of I$ and D$
trace cache (~I$)
L1 block size 64 bytes 64 bytes
L1 associativity 4-way set assoc. 2-way set assoc.
L1 replacement ~ LRU LRU
L1 write policy write-through write-back
L2 organization Unified Unified
L2 cache size 512KB 1024KB (1MB)
L2 block size 128 bytes 64 bytes
L2 associativity 8-way set assoc. 16-way set assoc.
L2 replacement ~LRU ~LRU
L2 write policy write-back write-back
CAMELab
More about Cache
Cache partitioning
How about cache fairness in multi-core/thread
environment?
- Cache Allocation Technology (CAT) Processor 1 Processor 2
- Code and Data Prioritization (CDP) L1 cache L1 cache
Cache coherence
How to make synchronization among multiple
local caches (per-core L1$ or L2$); we will learn L2 cache (shared)
the protocols in later
- Various protocols: MSI, MESI, MOSI, MOESI, MERSI,
MESIF, write-once, Synapse, Berkeley, and Firefly and
Dragon protocol Memory
CAMELab
2019 EE 488
Myoungsoo Jung
Computer Division
CAMELab
Types of Caches
There are three types based on associativity
DM and FA can be
thought as special
cases of SA
- DM: 1-way SA
- FA: All-way SA Cache Cache Cache
Example: 2-way
CAMELab
Summary of Caches
Where can a block be placed / found?
CAMELab