Introduction To CUDA: CAP 4730 Spring 2012

Introduction to CUDA
CAP 4730
Spring 2012
Tushar Athawale
Resources
• CUDA Programming Guide

• Programming Massively Parallel
Processors: A Hands-on Approach
- David Kirk
Motivation
• Process independent tasks in parallel for a

given application.
• How can we modify line drawing routine?
a)Divide line into parts and assign each part
to each processor.
b)What if we assign a processor per scanline?
(Each processor knows its own y coordinate)
Squaring Array Elements
Consider Array of 8 elements

Array sitting in the host(CPU) memory
Void host_square(float* h_A)
{
For(I =0; I < 8; I ++ )
h_A[I] = h_A[I] * h_A[I]
}
• Array Sitting in the Device(GPU) memory

(Also called as Global Memory)
1)Spawn the threads
2)Each thread automatically gets a number
3)Programmer controls how much work each
thread will do. e.g(in our current example)
4 threads - each thread squares 2 elements
8 threads – each thread squares 1 element.
• Consider 8 threads are spawned by

programmer, where each thread squares 1
element.
• Each thread automatically gets a number
(0,0,0) (1,0,0) (2,0,0) … (7,0,0) in
registers corresponding to each thread.
These built in registers are called
ThreadIdx.x ,ThreadIdx.y, ThreadIdx.z
• Write a program for only 1 thread

• Following program will be executed for
each thread in parallel
_global_ void device_square(float* d_A)
{
myid = ThreadIdx.x;
d_A[myid] = d_A[myid] * d_A[myid];
}
• Block Level Parallelism?

Suppose programmer decides to visualize
array of 8 elements as 2 blocks
BlockId’s are stored in built in BlockIdx.x,
BlockIdx.y, BlockIdx.z
BlockID (0,0,0) (1,0,0)

ThreadID (0,0,0) .. (3,0,0) (0,0,0)..(3,0,0)
Here each thread knows its own blockID

and ThreadId
Int BLOCK_SIZE = 4;
_global_ void device_square(float* d_A)
{
myid = BlockIdx.x*BLOCK_SIZE
+ThreadIdx.x;
d_A[myid] = d_A[myid] * d_A[myid];
General flow of .cu file
• Allocate host memory(malloc) for Array h_A

• Initialize that Array
• Allocate device memory(cudaMalloc) for Array d_A
• Transfer data from host to device memory(cudaMemCpy)
• Specify kernel execution Configuaration (This is very
important. Depending upon it blocks and threads
automatically get assigned numbers)
• Call Kernel
• Transfer result from device to host memory(cudaMemCpy)
• Deallocate host(free) and device(cudaFree) memories
GPGPU
• What is GPGPU?
– General purpose computing on GPUs
• Why GPGPU?
– Massively parallel computing power
– Hundreds of cores, thousands of concurrent
threads
– Inexpensive
CPU v/s GPU
© NVIDIA Corporation 2009

CPU v/s GPU

GPGPU
• How?
– CUDA
– OpenCL
– DirectCompute
CUDA
• ‘Compute Unified Device Architecture’
• Heterogeneous serial-parallel computing
• Scalable programming model
• C for CUDA – extension to C

C for CUDA
• Both serial (CPU) and parallel (GPU) code
• Kernel – function that executes on GPU

__global__ void matrix_mul(…){}
matrix_mul<<<dimGrid, dimBlock>>>(…)
• Array Squaring – (our example of 8
elements)
device_square<<<2, 4>>>(float* d_A)
• File has extension ‘.cu’
C for CUDA
__global__ void matrix_mul(…){
…
[GPU (Parallel) code]
…
}
void main(){
…
[CPU (serial) code]
…
matrix_mul<<<dimGrid, dimBlock>>>(…)
…
[CPU (serial) code]
…
}
Compiling
• Use nvcc to compile .cu files

nvcc –o runme kernel.cu
• Use –c option to generate .obj files

nvcc –c kernel.cu
g++ –c main.cpp
g++ –o runme *.o
Programming Model
• SIMT (Single Instruction Multiple Threads)
• Threads run in groups of 32 called warps
• Every thread in a warp executes the same

instruction at a time
Programming Model
• A single kernel executed by several threads
• Threads are grouped into ‘blocks’
• Kernel launches a ‘grid’ of thread blocks

Programming Model
© NVIDIA Corporation
Programming Model
• All threads within a block can

– Share data through ‘Shared Memory’
– Synchronize using ‘_syncthreads()’
• Threads and Blocks have unique IDs

– Available through special variables
Programming Model
Transparent Scalability
• Hardware is free to schedule thread blocks

on any processor

Control Flow Divergence
Courtesy Fung et al. MICRO ‘07

Memory Model
• Host (CPU) and device (GPU) have

separate memory spaces
• Host manages memory on device

– Use functions to allocate/set/copy/free memory
on device
– Similar to C functions
Memory Model
• Types of device memory

– Registers – read/write per-thread
– Local Memory – read/write per-thread
– Shared Memory – read/write per-block
– Global Memory – read/write across grids
– Constant Memory – read across grids
– Texture Memory – read across grids
Memory Model
Memory Model
Block (0,0) Block (1,0)
Shared Memory Shared Memory
Registers Registers Registers Registers
Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)
Local Local Local Local

Memory Memory Memory Memory
Global
Memory
Constant
Host Memory
Texture
Memory
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device

__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host

int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}oi1
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c

__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds

if (id < n)
c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )

{
// Size of vectors
int n = 100000;
// Host input vectors

double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors

double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector

size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU

cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}
// Copy host vectors to device

cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block

blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel

vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host

cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory

free(h_a);
free(h_b);
free(h_c);
return 0;
}
Thank you
• http://developer.download.nvidia.com/prese
ntations/2009/SIGGRAPH/Alternative_Ren
dering_Pipelines.mp4

Introduction To CUDA: CAP 4730 Spring 2012

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction To CUDA: CAP 4730 Spring 2012

Caricato da

Copyright:

Formati disponibili

Introduction to CUDA

• CUDA Programming Guide

• Process independent tasks in parallel for a

Consider Array of 8 elements

• Array Sitting in the Device(GPU) memory

• Consider 8 threads are spawned by

• Write a program for only 1 thread

• Block Level Parallelism?

BlockID (0,0,0) (1,0,0)

Here each thread knows its own blockID

• Allocate host memory(malloc) for Array h_A

© NVIDIA Corporation 2009

© NVIDIA Corporation 2009

• ‘Compute Unified Device Architecture’

• Heterogeneous serial-parallel computing

• Scalable programming model

• C for CUDA – extension to C

• Both serial (CPU) and parallel (GPU) code

• Kernel – function that executes on GPU

• Use nvcc to compile .cu files

• Use –c option to generate .obj files

• SIMT (Single Instruction Multiple Threads)

• Threads run in groups of 32 called warps

• Every thread in a warp executes the same

• A single kernel executed by several threads

• Threads are grouped into ‘blocks’

• Kernel launches a ‘grid’ of thread blocks

• All threads within a block can

• Threads and Blocks have unique IDs

• Hardware is free to schedule thread blocks

© NVIDIA Corporation 2009

Courtesy Fung et al. MICRO ‘07

• Host (CPU) and device (GPU) have

• Host manages memory on device

• Types of device memory

Registers Registers Registers Registers

Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Local Local Local Local

// Kernel that executes on the CUDA device

// main routine that executes on the host

// CUDA kernel. Each thread takes care of one element of c

// Make sure we do not go out of bounds

int main( int argc, char* argv[] )

// Host input vectors

// Device input vectors

// Size, in bytes, of each vector

// Allocate memory for each vector on GPU

// Copy host vectors to device

int blockSize, gridSize;

// Number of threads in each thread block

// Execute the kernel

// Copy array back to host

// Release device memory

// Release host memory

Potrebbero piacerti anche