Sei sulla pagina 1di 35

Introduction to CUDA

CAP 4730
Spring 2012

Tushar Athawale
Resources

• CUDA Programming Guide


• Programming Massively Parallel
Processors: A Hands-on Approach
- David Kirk
Motivation

• Process independent tasks in parallel for a


given application.
• How can we modify line drawing routine?
a)Divide line into parts and assign each part
to each processor.
b)What if we assign a processor per scanline?
(Each processor knows its own y coordinate)
Squaring Array Elements

Consider Array of 8 elements


Array sitting in the host(CPU) memory
Void host_square(float* h_A)
{
For(I =0; I < 8; I ++ )
h_A[I] = h_A[I] * h_A[I]
}
Squaring Array Elements

• Array Sitting in the Device(GPU) memory


(Also called as Global Memory)
1)Spawn the threads
2)Each thread automatically gets a number
3)Programmer controls how much work each
thread will do. e.g(in our current example)
4 threads - each thread squares 2 elements
8 threads – each thread squares 1 element.
Squaring Array Elements

• Consider 8 threads are spawned by


programmer, where each thread squares 1
element.
• Each thread automatically gets a number
(0,0,0) (1,0,0) (2,0,0) … (7,0,0) in
registers corresponding to each thread.
These built in registers are called
ThreadIdx.x ,ThreadIdx.y, ThreadIdx.z
Squaring Array Elements

• Write a program for only 1 thread


• Following program will be executed for
each thread in parallel
_global_ void device_square(float* d_A)
{
myid = ThreadIdx.x;
d_A[myid] = d_A[myid] * d_A[myid];
}
Squaring Array Elements

• Block Level Parallelism?


Suppose programmer decides to visualize
array of 8 elements as 2 blocks
BlockId’s are stored in built in BlockIdx.x,
BlockIdx.y, BlockIdx.z

BlockID (0,0,0) (1,0,0)


ThreadID (0,0,0) .. (3,0,0) (0,0,0)..(3,0,0)
Squaring Array Elements

Here each thread knows its own blockID


and ThreadId
Int BLOCK_SIZE = 4;
_global_ void device_square(float* d_A)
{
myid = BlockIdx.x*BLOCK_SIZE
+ThreadIdx.x;
d_A[myid] = d_A[myid] * d_A[myid];
General flow of .cu file

• Allocate host memory(malloc) for Array h_A


• Initialize that Array
• Allocate device memory(cudaMalloc) for Array d_A
• Transfer data from host to device memory(cudaMemCpy)
• Specify kernel execution Configuaration (This is very
important. Depending upon it blocks and threads
automatically get assigned numbers)
• Call Kernel
• Transfer result from device to host memory(cudaMemCpy)
• Deallocate host(free) and device(cudaFree) memories
GPGPU

• What is GPGPU?
– General purpose computing on GPUs

• Why GPGPU?
– Massively parallel computing power
– Hundreds of cores, thousands of concurrent
threads
– Inexpensive
CPU v/s GPU

© NVIDIA Corporation 2009


CPU v/s GPU

© NVIDIA Corporation 2009


GPGPU

• How?

– CUDA

– OpenCL

– DirectCompute
CUDA

• ‘Compute Unified Device Architecture’

• Heterogeneous serial-parallel computing

• Scalable programming model

• C for CUDA – extension to C


C for CUDA

• Both serial (CPU) and parallel (GPU) code

• Kernel – function that executes on GPU


__global__ void matrix_mul(…){}
matrix_mul<<<dimGrid, dimBlock>>>(…)
• Array Squaring – (our example of 8
elements)
device_square<<<2, 4>>>(float* d_A)
• File has extension ‘.cu’
C for CUDA
__global__ void matrix_mul(…){

[GPU (Parallel) code]

}

void main(){

[CPU (serial) code]

matrix_mul<<<dimGrid, dimBlock>>>(…)

[CPU (serial) code]

}
Compiling

• Use nvcc to compile .cu files


nvcc –o runme kernel.cu

• Use –c option to generate .obj files


nvcc –c kernel.cu
g++ –c main.cpp
g++ –o runme *.o
Programming Model

• SIMT (Single Instruction Multiple Threads)

• Threads run in groups of 32 called warps

• Every thread in a warp executes the same


instruction at a time
Programming Model

• A single kernel executed by several threads

• Threads are grouped into ‘blocks’

• Kernel launches a ‘grid’ of thread blocks


Programming Model

© NVIDIA Corporation
Programming Model

• All threads within a block can


– Share data through ‘Shared Memory’
– Synchronize using ‘_syncthreads()’

• Threads and Blocks have unique IDs


– Available through special variables
Programming Model

© NVIDIA Corporation
Transparent Scalability

• Hardware is free to schedule thread blocks


on any processor

© NVIDIA Corporation 2009


Control Flow Divergence

Courtesy Fung et al. MICRO ‘07


Memory Model

• Host (CPU) and device (GPU) have


separate memory spaces

• Host manages memory on device


– Use functions to allocate/set/copy/free memory
on device
– Similar to C functions
Memory Model

• Types of device memory


– Registers – read/write per-thread
– Local Memory – read/write per-thread
– Shared Memory – read/write per-block
– Global Memory – read/write across grids
– Constant Memory – read across grids
– Texture Memory – read across grids
Memory Model

© NVIDIA Corporation
Memory Model
Block (0,0) Block (1,0)
Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Local Local Local Local


Memory Memory Memory Memory

Global
Memory

Constant
Host Memory

Texture
Memory
© NVIDIA Corporation
#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device


__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host


int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}oi1
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c


__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;

// Make sure we do not go out of bounds


if (id < n)
c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )


{
// Size of vectors
int n = 100000;

// Host input vectors


double *h_a;
double *h_b;
//Host output vector
double *h_c;

// Device input vectors


double *d_a;
double *d_b;
//Device output vector
double *d_c;

// Size, in bytes, of each vector


size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);

// Allocate memory for each vector on GPU


cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}

// Copy host vectors to device


cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block


blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);

// Execute the kernel


vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

// Copy array back to host


cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);

// Release device memory


cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

// Release host memory


free(h_a);
free(h_b);
free(h_c);

return 0;
}
Thank you

• http://developer.download.nvidia.com/prese
ntations/2009/SIGGRAPH/Alternative_Ren
dering_Pipelines.mp4

Potrebbero piacerti anche