Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Just follow steps 1, 2, and 3 below! (or click here for help)
1.) Select Compute Capability (click):
1.b) Select Shared Memory Size Config (bytes)
2.) Enter your resource usage:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)
3.5
49152
(Help)
256
32
4096
(Help)
2048
64
8
100%
(Help)
3.5
32
64
2048
16
65536
256
warp
255
49152
256
4
1024
Per Block
8
8
4096
= Allocatable
Limit Per SM Blocks Per SM
64
8
64
49152
8
12
8
64
8
0
Physical Max Warps/SM = 64
Occupancy = 64 / 64 = 100%
CUDA Occupancy Calculator
Version:
Copyright and License
5.1
Threads Warps/Multiprocessor
256
64
32
12
64
24
96
36
128
48
160
60
192
60
224
63
256
64
288
63
320
60
352
55
384
60
416
52
448
480
512
544
576
608
640
672
704
736
768
800
832
864
896
928
960
992
1024
1056
1088
1120
1152
1184
1216
1248
1280
1312
1344
1376
1408
1440
1472
1504
1536
1568
1600
1632
1664
1696
1728
1760
1792
1824
1856
1888
1920
1952
1984
2016
56
60
64
51
54
57
60
63
44
46
48
50
52
54
56
58
60
62
64
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
2048
2080
2112
2144
2176
2208
2240
2272
2304
2336
2368
2400
2432
2464
2496
2528
2560
2592
2624
2656
2688
2720
2752
2784
2816
2848
2880
2912
2944
2976
3008
3040
3072
Click Here for detailed instructions on how to use this occupancy calculator.
For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda
M u l ti p r o c e sso r W a r p O c c u p a n c y
(# w a r p s)
Your chosen resource usage is indicated by the red triangle on the graphs. The other
data points represent the range of possible block sizes, register counts, and shared
memory allocation.
56
48
40
32
24
16
M u l ti p ro c e sso r W a rp O c c u p a n c y
(# w a rp s)
8
0
0
64
Threa ds Pe r Block
128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
56
48
40
32
24
16
8
0
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
16
8
256
240
224
208
192
176
160
144
128
112
96
80
64
48
32
16
M u l ti p r o c e sso r W a r p O c c u p a n c y
(# w a r p s)
56
48
40
32
24
16
8
0
49152
45056
40960
36864
Registers Warps/Multiprocessor
32
64
1
64
2
64
3
64
4
64
5
64
6
64
7
64
8
64
9
64
10
64
11
64
12
64
13
64
32768
28672
24576
20480
16384
12288
8192
4096
Share d M e m or y Pe r Block
Shared Me Warps/Multiprocessor
4096
64
0
64
512
64
1024
64
1536
64
2048
64
2560
64
3072
64
3584
64
4096
64
4608
64
5120
64
5632
64
6144
64
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
48
48
48
48
48
48
48
48
40
40
40
40
40
40
40
40
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
6656
7168
7680
8192
8704
9216
9728
10240
10752
11264
11776
12288
12800
13312
13824
14336
14848
15360
15872
16384
16896
17408
17920
18432
18944
19456
19968
20480
20992
21504
22016
22528
23040
23552
24064
24576
25088
25600
26112
26624
27136
27648
28160
28672
29184
29696
30208
30720
31232
31744
56
48
48
48
40
40
40
32
32
32
32
32
24
24
24
24
24
24
24
24
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
8
8
8
8
8
8
8
8
8
8
8
8
8
8
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
32
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
32256
32768
33280
33792
34304
34816
35328
35840
36352
36864
37376
37888
38400
38912
39424
39936
40448
40960
41472
41984
42496
43008
43520
44032
44544
45056
45568
46080
46592
47104
47616
48128
48640
49152
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
IMPORTANT
This spreadsheet requires Excel macros for full functionality. When you load this file, make sure you enable macros
because they are often disabled by default by Excel.
Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. Th
multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the G
Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a
shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to
minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a progr
tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail
The size of N on GPUs with compute capability 1.0-1.1 is 8192 32-bit registers per multiprocessor. On GPUs with compute
capability 1.2-1.3, N = 16384. On GPUs with compute capability 2.0-2.1, N = 32768. On GPUs with compute capability 3.0,
N=65536.
Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The
occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this,
programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy
Calculator can assist in choosing thread block size based on shared memory and register requirements.
Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps.
1.) First select your device's compute capability in the green box.
Click to go there
1.b) If your compute capability supports it, you will be shown a second green box in which you can select the size in bytes of t
shared memory (configurable at run time in CUDA).
Click to go there
2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total
shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread.
Click to go there
3.) Examine the blue box and the graph to the right. This will tell you the occupancy, as well as the number of active threads,
warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you
occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph.
Click to go there
You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GP
occupancy.
Determining Registers Per Thread and Shared Memory Per Thread Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option
--ptxas-options=-v to nvcc. This will output information about register, local memory, shared memory, and constant memory us
for each kernel in the .cu file. However, if your kernel declares any external shared memory that is allocated dynamically, you
need to add the (statically allocated) shared memory reported by ptxas to the amount you dynamically allocate at run time to g
the correct shared memory usage. An example of the verbose ptxas output is as follows:
ptxas info
ptxas info
Let's say "my_kernel" contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our tot
shared memory usage per block is 2048+8+16 = 2072 bytes. We enter this into the box labeled "shared memory per block
(bytes)" in this occupancy calculator, and we also enter the number of registers used by my_kernel, 5, in the box labeled regist
per thread. We then enter our thread block size and the calculator will display the occupancy.
Notes about Occupancy
Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth-limited or latency-limited, then
increasing occupancy will not necessarily increase performance. If a kernel grid is already running at least one thread block p
multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupa
may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions
more register spills to local memory (which is off-chip), more divergent branches, etc. As with any optimization, you should
experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth-bound applications, on the
other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.
For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda
Compute Capability
SM Version
Threads / Warp
Warps / Multiprocessor
Threads / Multiprocessor
Thread Blocks / Multiprocessor
Max Shared Memory / Multiprocessor (bytes)
Register File Size
Register Allocation Unit Size
Allocation Granularity
Max Registers / Thread
Shared Memory Allocation Unit Size
Warp allocation granularity
Max Thread Block Size
Shared Memory Size Configurations (bytes)
[note: default at top of list]
1.0
sm_10
32
24
768
8
16384
8192
1.1
sm_11
32
24
768
8
16384
8192
1.2
sm_12
32
32
1024
8
16384
16384
1.3
sm_13
32
32
1024
8
16384
16384
2.0
sm_20
32
48
1536
8
49152
32768
2.1
sm_21
32
48
1536
8
49152
32768
3.0
sm_30
32
64
2048
16
49152
65536
3.5
sm_35
32
64
2048
16
49152
65536
256
block
124
256
block
124
512
block
124
512
block
124
64
warp
63
64
warp
63
256
warp
63
256
warp
255
512
2
512
2
512
2
512
2
128
2
128
2
256
4
256
4
512
512
512
512
1024
1024
1024
1024
16384
16384
16384
16384
49152
16384
49152
16384
49152
16384
49152
16384
32768
32768
256
256
64
128
64
128
NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SPREADSHEET AND DATA FOR ANY PURPO
IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRAN
WITH REGARD TO THIS SPREADSHEET AND DATA, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING F
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SPREADSHEET AND DA
U.S. Government End Users. This spreadsheet and data are a "commercial item" as that term is defined at 48 C.F.R. 2.101
1995), consisting of "commercial computer software" and "commercial computer software documentation" as such terms are
in 48 C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government only as a commercial end item. Consistent with 48
C.F.R.12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the spread
and data with only those rights set forth herein. Any use of this spreadsheet and data in individual and commercial software m
include, in the user documentation and internal comments to the code, the above Disclaimer and U.S. Government End Users
Notice.