Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin
Wang (MTU), Xiaolin Wang (Peking), Yingwei Luo (Peking),
and Lieven Eeckhout (Ghent)
GPU Memory Systems
DRAM
LLC Slice
Streaming
Channel 1
DRAM
LLC Slice
Channel 2
DRAM
LLC Slice
Channel 3
Achieving high bandwidth requires effectively utilizing the parallel units in the
memory system
2
Bank and channel bits must be highly variable
Entropy Valley to ensure even distribution of memory requests
across LLC slices, channels and banks
Memory Address
Most Least
Row Channel Bank Column Block
significant bit significant bit
CPUs
GPUs
Entropy is a
Entropy
measure of the
information Entropy
content of each Valley
address bit
Entropy valleys create significant resource imbalance in GPU memory systems - leading to
poor performance and low power-efficiency
3
Why Do Entropy Valleys Exist?
Column-Major 1D Thread
Block (TB) Allocation Channel
[y,x] bits Channel 0
7
[0,0] … 0000 00 … Request [0,0]
6
[1,0] … 0010 00 … Request [1,0]
5 [2,0] … 0100 00 … Request [2,0] Channel 1
Y-dimension
4
Why Do Entropy Valleys Exist?
Column-Major 1D Thread
Block (TB) Allocation Channel
[y,x] bits Channel 0
Request [0,0] Request [1,0]
7
[0,0] … 0000 00 … Request [2,0] Request [3,0]
6 Request [4,0] Request [5,0]
[1,0] … 0010 00 …
Request [6,0] Request [7,0]
5 [2,0] … 0100 00 …
Y-dimension
3 [4,0] … 1000 00 …
Entropy valleys are caused
Channel 2 by
[5,0] … 1010 00 …
2 dimension-related array indexing
[6,0] … 1100 00 …
1
[7,0] … 1110 00 … Our solution:
Channel 3
0 BIM-based address mapping
0 1 2 3 4 5 6 7
5
Getting Out of the Entropy Valley
Channel
BIM-based
Column-Major 1D Thread [y,x] bits Address Mapping
Block (TB) Allocation [0,0] … 0000 00 … Channel 0
Output Addr.
Binary
Input Addr.
[1,0] … 0010 00 … Invertible
7 [2,0] … 0100 00 … Matrix
x =
6
[3,0] … 0110 00 … (BIM)
4 [7,0] … 1110 00 …
Channel 2
3 [0,0] … 0000 00 … Request [0,0]
[1,0] … 0010 11 … Request [1,0]
2 [2,0] … 0100 01 … Request [2,0]
[3,0] … 0110 10 … Request [3,0]
1 Channel 3
[4,0] … 1000 11 … Request [4,0]
[5,0] … 1010 00 … Request [5,0]
0
[6,0] … 1100 10 … Request [6,0]
0 1 2 3 4 5 6 7 [7,0] … 1110 01 … Request [7,0]
6
Getting Out of the Entropy Valley
Channel
BIM-based
Column-Major 1D Thread [y,x] bits Address Mapping
Block (TB) Allocation [0,0] … 0000 00 … Channel 0
Output Addr.
Binary
Input Addr.
[1,0] … 0010 00 … Invertible Request [0,0]
7 [2,0] … 0100 00 … Matrix
x =
Request [5,0]
6
[3,0] … 0110 00 … (BIM)
Request [7,0]
4 [7,0] … 1110 00 …
Channel 2
3 [0,0] … 0000 00 …
[1,0] Request [3,0]
… 0010 11 …
2 [2,0] … 0100 01 … Request [6,0]
[3,0] … 0110 10 … Perfect channel
1 Channel 3
[4,0] … 1000 11 … utilization! Request [1,0]
[5,0] … 1010 00 …
0 Request [4,0]
[6,0] … 1100 10 …
0 1 2 3 4 5 6 7 [7,0] … 1110 01 …
X-dimension Memory Addresses and Requests DRAM Channels
7
Outline
1. Introduction
4. Results
5. Conclusion
8
Window-based Entropy
With Greedy-Then-Oldest (GTO) warp scheduling, we heuristically set the window size
to the number of Streaming Multiprocessors (SMs)
9
Entropy Profile Examples
Two channel bits
Three bank bits
and one bank bit
1.0 1.0 1.0
Entropy
Entropy
Entropy
Entropy
Entropy
GPU address mapping schemes must harvest entropy across broad address bit ranges
10
Outline
1. Introduction
4. Results
5. Conclusion
11
The Binary Invertible Matrix (BIM)
Output Addr.
Binary
Input Addr.
The BIM can represent all possible Invertible
x =
address mapping schemes that consist Matrix
(BIM)
of AND and XOR operations
Example Memory Map
• Matrix covers all possible transformations
• Invertibility criterion ensures that all possible
one-to-one relations are considered Remap (RMP)
13
Entropy Impact of Address Mapping
Schemes for the MT Benchmark
Baseline Remap PM
1.0 1.0 1.0
Entropy
Entropy
Entropy
0.5 0.5 0.5
Entropy
Entropy
0.5 0.5 0.5
PAE, FAE, and All remove the entropy valleys – the other mapping schemes do not
14
Outline
1. Introduction
4. Results
5. Conclusion
15
Execution Time vs. DRAM Power
1,2
Average Execution Time Normalized to BASE
BASE
1
PM
RMP
0,8 - 1.51X
PAE FAE ALL
0,6
+1.30X
0,4
0,2
0
0,8 0,9 1 1,1 1,2 1,3 1,4 1,5
Average DRAM Power Consumption Normalized to BASE
16
Performance
BASE PM RMP PAE FAE ALL
8
+7.5X
7 +6.7X
PAE improves
Speed-up Relative to BASE
6
performance by
5 +1.31X on average
+4.0X compared to PM
4
3
+1.9X +2.0X
2 +1.5X
+1.4X +1.4X +1.3X
+1.1X +1.0X +1.0X
1
0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN
17
Performance per Watt
BASE PM RMP PAE FAE ALL
4,5
by +1.25X on average
3 compared to PM
2,5
1,5 +1.4X
1
0,5
0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN
18
Why is PAE Most Power-Efficient?
background activate read write
60
BASE PM RMP PAE FAE ALL
DRAM Power Breakdown (W)
50
40
30
20
10
0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP AVG
FAE and ALL tend to distribute requests with good DRAM page locality to different banks
which increases the number of DRAM page activations
19
Outline
1. Introduction
4. Results
5. Conclusion
20
Conclusion
Window-Based Entropy
• A novel entropy metric tailored for the highly concurrent memory
behavior of GPU compute workloads
Binary Invertible Matrix (BIM) address mapping
• A unified representation of address mapping schemes that use
AND and XOR operations
Page Address Entropy (PAE) address mapping
• PAE improves performance by 1.31X and performance per Watt by
1.25X compared to the state-of-the-art permutation-based
address mapping scheme
21
Thank You!
22