Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Christina Giannoula 3
Address Translation
Physical address space
Virtual address space
Application
Page Granularity
Christina Giannoula 4
Address Translation
Physical address space
Virtual address space
Application
Page Table
Virtual Physical
Page 0 Page 60
Page 1 Page 43
Page 2 Page 16
Page 3 Page 63
Page 4 Page 64
Page 5 Page 42
Page 6 Page 73
Page 7 Page 234
Christina Giannoula 5
Page Table Walk
Look up a mapping
Page Table Walk:
Page Table
Ten to hundreds of cycles
Virtual Physical
Page 0 Page 60
Page 1 Page 43
Page 2 Page 16
Page 3 Page 63
Page
Page 44 Page
Page 64
64 Found !
Page 5 Page 42
Page 6 Page 73
Page 7 Page 234
Christina Giannoula 6
Page Table Walk
Look up a mapping
Page Table
Virtual Physical
Page 0 Page 60
Page 1 Page 43
Page 2 Page 16
Page 3 Page 63
Page
Page 44 Page
Page 64
64 Found !
Page 5 Page 42
Page 6 Page 73
Page 7 Page 234
Christina Giannoula 7
Translation Lookaside Buffers (TLB)
TLBs:
Store recently used address
Page Table
translations
Address translation cache
Virtual Physical
Level 1
Level 2 TLB Page 0 Page 60
… Page 1 Page 43
Virtual Physical Page 2 Page 16
Page 1 Page 60 Page 3 Page 63
Limited Page 3 Page 63 Page 4 Page 64
Size
Page 4 Page 64 Page 5 Page 42
Page 5 Page 42 Page 6 Page 73
Page 7 Page 234
Christina Giannoula 8
State-of-the-art Virtual Memory on GPUs
GPU core GPU core GPU core GPU core
Christina Giannoula 9
State-of-the-art Virtual Memory on GPUs
GPU core GPU core GPU core GPU core
Christina Giannoula 10
State-of-the-art Virtual Memory on GPUs
GPU core GPU core GPU core GPU core
TLB miss!
Christina Giannoula 11
State-of-the-art Virtual Memory on GPUs
GPU core GPU core GPU core GPU core
Page Table
(GPU Memory)
Data
(GPU Memory)
Christina Giannoula 12
State-of-the-art Virtual Memory on GPUs
GPU core GPU core GPU core GPU core
Christina Giannoula 15
Address Translation Challenge
Small Pages Large Pages
(4KB) (2MB)
Christina Giannoula 16
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 17
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 18
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 19
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 20
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 21
Address Translation Challenge
Small Pages Large Pages
TLB
Fixed Size
3 entries
Christina Giannoula 22
Demand Paging Challenge
An application requests data that is not currently resident in GPU memory
GPU-side memory
CPU-side memory
I/O bus
GPU Memory CPU Memory
Christina Giannoula 23
Demand Paging Challenge
An application requests data that is not currently resident in GPU memory
GPU Threads
Small Pages
stall for a
short time period
GPU-side memory
CPU-side memory
I/O bus
GPU Memory CPU Memory
Low Latency
Christina Giannoula 24
Demand Paging Challenge
An application requests data that is not currently resident in GPU memory
GPU Threads
Large Pages
stall for a
long time period
GPU-side memory
CPU-side memory
I/O bus
GPU Memory large
CPU Memory
High Latency
Christina Giannoula 25
Demand Paging Challenge
An application requests data that is not currently resident in GPU memory
GPU Threads
Large Pages
stall …
GPU-side memory
CPU-side memory
I/O bus
GPU Memory large
CPU Memory
High Latency
Christina Giannoula 26
Page Size Trade-Off
Christina Giannoula 27
Executive Summary
Executive Summary
Problem
No single best page size for GPU virtual memory (large vs small pages)
Goal
Transparently and efficiently enable both page sizes
Christina Giannoula 29
Executive Summary
Problem
No single best page size for GPU virtual memory (large vs small pages)
Goal
Transparently and efficiently enable both page sizes
Key Observation
Can easily coalesce an application‘s contiguously-allocated small pages into a large
page
GPGPU applications typically allocate large chunks of memory at once
small pages
Christina Giannoula 30
Executive Summary
Problem
No single best page size for GPU virtual memory (large vs small pages)
Goal
Transparently and efficiently enable both page sizes
Key Observation
Can easily coalesce an application‘s contiguously-allocated small pages into a large page
GPGPU applications typically allocate large chunks of memory at once
Key Idea
Preserve the virtual address contiguity of small pages when allocating physical memory to
simplify coalescing
Mosaic:
A hardware/software cooperative framework
Enables the benefits of both small and large pages
Key Result: 55% on average performance improvement over state-of-the-art GPU memory
management mechanism
Christina Giannoula 31
Key Ideas and Challenges
Key Ideas
Large
• Translate using large page size
Pages High TLB reach
Physical address space
• Transfer using small page size
Virtual address space
Low demand paging latency
small pages
Contiguity
I/O bus
GPU CPU
Memory Memory
Christina Giannoula 33
Challenges with Multiple Page Sizes
Time State-of-the-art
App1 Large page frame 1
Allocation Large page frame 2
Large page frame 3
App2 Large page frame 4
Allocation Large page frame 5
App1
Allocation Need to search
which pages to
App2 coalesce
Allocation
Coalesce App1
App2
App2
Pages
Unallocated
Christina Giannoula 34
Challenges with Multiple Page Sizes
Time State-of-the-art
App1 Large page frame 1
Allocation Large page frame 2
Large page frame 3
App2 Large page frame 4
Allocation Large page frame 5
App1
Allocation
App2
Allocation
Coalesce App1
App2
App2
Pages
Unallocated
Christina Giannoula 35
Challenges with Multiple Page Sizes
Time State-of-the-art
App1 Large page frame 1
Allocation Large page frame 2
Large page frame 3
App2 Large page frame 4
Allocation Large page frame 5
App1
Allocation Cannot coalesce
without migrating
App2 multiple pages
Allocation
Coalesce App1
App2
App2
Pages
Unallocated
Christina Giannoula 36
Why page migration is bad
GPU core GPU core GPU core GPU core
Page Table
(GPU Memory)
Data
(GPU Memory)
Christina Giannoula 37
Why page migration is bad
GPU core GPU core GPU core GPU core
Page Table
Walkers
Page Table
(GPU Memory)
Data
(GPU Memory)
Christina Giannoula 38
Why page migration is bad
GPU core GPU core GPU core GPU core
Page Table
(GPU Memory)
Data
(GPU Memory)
Christina Giannoula 39
Desirable Allocation
Time Desirable Allocation
App1 Large page frame 1
Allocation Large page frame 2
Large page frame 3
App2 Large page frame 4
Allocation Large page frame 5
App1
Allocation Can coalesce
without
App2 moving data
Allocation
Coalesce App1
App2
App2
Pages
Unallocated
Christina Giannoula 40
Key Ideas
Christina Giannoula 41
Mechanism
Mosaic
3 components
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
GPU Runtime
Hardware
Christina Giannoula 43
Mosaic
3 components
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
GPU Runtime
Hardware
Christina Giannoula 44
Mosaic: Data Allocation
GPU Runtime Application demands data
1
Contiguity-Conserving In-Place Contiguity-Aware
Allocation Coalescer Compaction
Hardware
Allocate memory
Christina Giannoula 45
Mosaic: Data Allocation
GPU Runtime Application demands data
Hardware
Allocate memory
2
Large Page Frame
Page
Table Key observation:
en masse memory allocation, i.e.,
Data applications allocate a large number
of base pages at once
Hardware
Allocate memory
2
Large Page Frame
Page
Table Soft guarantee:
A large page frame contains pages
Data from only a single address space
Hardware
Allocate memory
Hardware
Allocate memory
Hardware
Transfer done
4
Page
Table
Send to In-Place Coalescer a list of the large page frame addresses that
were allocated
Christina Giannoula 51
Mosaic
3 components
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
GPU Runtime
Hardware
Christina Giannoula 52
Mosaic: Coalescing
GPU Runtime
Hardware
Hardware
Page
Table
List of coalesceable pages Update page tables 6
Data
Christina Giannoula 54
Mosaic: Coalescing
GPU Runtime
Hardware
Page
Table
List of coalesceable pages Update page tables
Data
Large Page Table Small Page Table
1 Large page
Coalesced Bit page number page offset
Small page
small page number page offset
Hardware
Christina Giannoula 56
Mosaic: Data Deallocation
GPU Runtime Application deallocates memory
7
Contiguity-Conserving In-Place Contiguity-Aware
Allocation Coalescer Compaction
Hardware
gpu_free()
Christina Giannoula 57
Mosaic: Data Deallocation
GPU Runtime Application deallocates memory
7
Contiguity-Conserving In-Place Contiguity-Aware
Allocation Coalescer Compaction
Hardware
Check whether the large page frame has high degree of internal
fragmentation
Free-up not fully-used large page frames
Christina Giannoula 58
Mosaic: Data Deallocation
GPU Runtime
Hardware
8
Page Splinter pages
Table
Data
Update the page table to splinter the large page back into its
constituent base pages
Christina Giannoula 59
Mosaic: Data Deallocation
GPU Runtime
Hardware
Page
Table
Christina Giannoula 60
Mosaic: Data Deallocation
GPU Runtime
Hardware
Page
Table
Christina Giannoula 61
Mosaic: Data Deallocation
GPU Runtime
Hardware
Page
Not coalescable Table
Christina Giannoula 62
Mosaic: Data Deallocation
GPU Runtime
Hardware
8
Page Splinter pages
Table
Christina Giannoula 63
Mosaic: Data Deallocation
Send list of newly-free 10
GPU Runtime
pages after compaction
Workloads
Homogeneous workloads = multiple copies of the same application
Heterogeneous workloads = randomly selected applications
Multiple GPGPU applications execute concurrently
Test suites: Parboil, SHOC, LULESH, Rodinia, CUDA SDK
Evaluation metric
𝑁 𝐼𝑃𝐶𝑠ℎ𝑎𝑟𝑒𝑑
Weighted Speedup = 𝑖=1 𝐼𝑃𝐶𝑎𝑙𝑜𝑛𝑒 for each application i
Christina Giannoula 66
Comparison Points
1. State-of-the-art CPU-GPU memory management
GPU-MMU based on [Power et al., HPCA’14]
Utilizes parallel page walks, TLB request coalescing and page walk
cache to improve performance
Limited TLB reach (4KB pages)
Christina Giannoula 67
Evaluation
Homogeneous workloads
Christina Giannoula 69
Heterogeneous workloads
Christina Giannoula 70
Heterogeneous workloads
Mosaic :
• significantly improves performance for TLB-friendly
workloads
Christina Giannoula 71
Heterogeneous workloads
Mosaic :
• significantly improves performance for TLB-friendly
workloads
• provides less benefit in TLB-sensitive workloads
Christina Giannoula 72
TLB Hit Rate
Christina Giannoula 73
More Results in the Paper
Per-application IPC
Christina Giannoula 75
Summary
Executive Summary
Problem
No single best page size for GPU virtual memory (large vs small pages)
Goal
Transparently and efficiently enable both page sizes
Key Observation
Can easily coalesce an application‘s contiguously-allocated small pages into a large page
GPGPU applications typically allocate large chunks of memory en masse
Key Idea
Preserve the virtual address contiguity of small pages when allocating physical memory to
simplify coalescing
Mosaic:
A hardware/software cooperative framework
Enables the benefits of both small and large pages
Key Result: 55% on average performance improvement over state-of-the-art GPU memory
management mechanism
Christina Giannoula 77
Strengths & Weaknesses
Strengths
Intuitive Idea:
Exploits the benefits of using both small and large pages
Well-written, insightful paper
Christina Giannoula 79
Strengths
Intuitive Idea:
Exploits the benefits of using both small and large pages
Well-written, insightful paper
Mechanism:
Avoids page migration when memory alocation is requested
Application-transparent support for multiple page sizes
Data can be accessed using either page size
Christina Giannoula 80
Strengths
Intuitive Idea:
Exploits the benefits of using both small and large pages
Well-written, insightful paper
Mechanism:
Avoids page migration when memory alocation is requested
Application-transparent support for multiple page sizes
Data can be accessed using either page size
Evaluation:
Investigate behavior of multiple GPGPU applications that run
concurrently
Explore the performance in case of highly fragmented pages
High variety of workloads explored
Christina Giannoula 81
Strengths
Intuitive Idea:
Exploits the benefits of using both small and large pages
Well-written, insightful paper
Mechanism:
Avoids page migration when memory alocation is requested
Application-transparent support for multiple page sizes
Data can be accessed using either page size
Evaluation:
Investigate virtual memory when multiple GPGPU applications run
concurrently
Explore the performance in case of highly fragmented pages
High variety of workloads explored
Online available
Christina Giannoula 82
Weaknesses
Mechanism:
Provides soft guarantee that a large page frame contains pages from
only a single address space
What is the threshold after which Mosaic splinters a large page frame
into small pages?
Needs many changes in the system stack
Software-hardware cooperative solutions are not always be easy to adopt
Christina Giannoula 83
Weaknesses
Mechanism:
Provides soft guarantee that a large page frame contains pages from
only a single address space
What is the threshold after which Mosaic splinters a large page frame
into small pages?
Needs many changes in the system stack
Software-hardware cooperative solutions are not always be easy to adopt
Evaluation:
No comparison with an approach that uses large page frames
Mosaic mainly benefits from TLB-friendly applications
Model sequential page walks in simulation
Less benefit in TLB sensitive applications and highly fragmented pages
Simulation-based evaluation
Christina Giannoula 84
Takeaways
Takeaways
A novel idea to enable benefits of both small and large pages
Hardware/software cooperative framework
Application-transparent support for multiple page sizes
No TLB flush when coalescing
Online available
Easy to read and understand paper
Christina Giannoula 86
Open Discussion
Discussion
Christina Giannoula 88
Discussion
Christina Giannoula 89
Discussion
Christina Giannoula 90
Discussion
Christina Giannoula 91
Discussion
Christina Giannoula 92
Discussion
Christina Giannoula 93
Discussion
Christina Giannoula 94
Discussion
Christina Giannoula 95
Discussion
Christina Giannoula 96
printf ( “Thank you” );