Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Derek Gerstmann
University of Western Australia
http://local.wasp.uwa.edu.au/~derek
OpenCL Event Model Usage
Outline
Execution Model
Usage Patterns
Synchronisation
Event Model
Examples
Execution Model
Execution Model
cl_uint num_devices;
cl_device_id devices[1];
!
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &devices[0], &num_devices);
context = clCreateContext(0, 1, devices, NULL, NULL, &err);
!
cl_command_queue queue_cpu;
queue_cpu = clCreateCommandQueue(context, devices[0], 0 /* IN-ORDER */, &err);
queue
execution
WRITE
WRITE READ
memory
time
Single Device In-Order Usage Model
queue
execution
WRITE
WRITE READ
memory
time
Other Usage Patterns Which
Require Synchronisation
Single Device Out-of-Order Usage Model
context
out-of-order queue
CPU MEM
cl_uint num_devices;
cl_device_id devices[1];
!
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &devices[0], &num_devices);
context = clCreateContext(0, 1, devices, NULL, NULL, &err);
!
cl_command_queue queue_cpu;
queue_cpu = clCreateCommandQueue(context, devices[0],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
queue
execution
WRITE
WRITE READ
memory
time
Single Device Out-of-Order Usage Model
queue
execution
WRITE
WRITE READ
memory
time
Separate Multi-Device Usage Model
in-order cpu queue
CPU MEM
! cl_uint num_devices;
! cl_device_id devices[2];
!
! err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &devices[0], &num_devices);
! err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &devices[1], &num_devices);
! context_cpu = clCreateContext(0, 1, &devices[0], NULL, NULL, &err);
! context_gpu = clCreateContext(0, 1, &devices[1], NULL, NULL, &err);
queue
execution
memory
queue
execution
memory
time
Separate Multi-Device Usage Model
queue
execution
memory
queue
execution
memory
time
Cooperative Multi-Device Usage Model
in-order cpu queue
CPU
MEM
in-order gpu queue
GPU
! cl_uint num_devices;
! cl_device_id devices[2];
!
! err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &devices[0], &num_devices);
! err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &devices[1], &num_devices);
! context = clCreateContext(0, 2, devices, NULL, NULL, &err);
queue
execution
queue
execution
memory
time
Cooperative Multi-Device Usage Model
Still wrong!
Both devices start executing commands as soon as they can!
Memory transactions overlap and clobber combined memory!
queue
execution
queue
execution
memory
time
Synchronisation Mechanisms
Synchronisation Mechanisms
clFlush()
Send all commands in the queue to the compute device
clFinish()
Flush, and then wait for all commands to finish
Synchronisation Mechanisms
clFlush()
Send all commands in the queue to the compute device
clFinish()
Flush, and then wait for all commands to finish
clEnqueueBarrier()
Enqueue a fence which insures that all preceding commands
in the queue are complete before any commands which get
enqueued afterwards are processed
Synchronisation Mechanisms
Event objects
Unique objects which can be used to determine command
status
clGetEventStatus()
Returns command status for an event
! cl_event returned_event;
!
! err = clEnqueueReadBuffer(queue, buffer,
CL_FALSE /* non-blocking */,
0, 0, ptr, , 0,
&returned_event);
Event-Based Synchronisation
! cl_event returned_event;
!
! err = clEnqueueReadBuffer(queue, buffer,
CL_FALSE /* non-blocking */,
0, 0, ptr, , 0,
&returned_event);
clWaitForEvents(num_events, event_list)
Waits and blocks until all commands identified by events in
the given event list are complete
Event-Based Synchronisation
! cl_event read_event;
!
! err = clEnqueueReadBuffer(queue, buffer,
CL_FALSE /* blocking_read */,
0, 0, ptr, 0, 0,
&read_event);
! cl_uint num_events_in_waitlist = 2;
! cl_event event_waitlist[2] = { event_one, event_two };
!
! err = clEnqueueReadBuffer(queue, buffer,
CL_FALSE /* non-blocking */,
0, 0,
num_events_in_waitlist,
event_waitlist, NULL);
Event-Based Synchronisation
! cl_uint num_events_in_waitlist = 2;
! cl_event event_waitlist[2] = { event_one, event_two };
!
! err = clEnqueueReadBuffer(queue, buffer,
CL_FALSE /* non-blocking */,
0, 0,
num_events_in_waitlist,
event_waitlist, NULL);
cl_uint num_events_in_waitlist = 2;
cl_event event_waitlist[2];
cl_uint num_events_in_waitlist = 2;
cl_event event_waitlist[2];
cl_event read_event;
cl_ulong start, end;
clEnqueueWaitList()
Enqueue a list of events to wait for such that all events need
to complete before this particular command can be executed
clEnqueueMarker()
Enqueue a command which marks this location in the queue
with a unique event object that can be used for synchronisation
Cooperative Multi-Device Usage Model
commands in-order cpu queue
CPU
MEM
in-order gpu queue
GPU
clFlush(queue_cpu); clFlush(queue_gpu);
time
Cooperative Multi-Device Usage Model
commands in-order cpu queue
CPU
MEM
in-order gpu queue
GPU
clFlush(queue_cpu); clFlush(queue_gpu);
time
Cooperative Multi-Device Usage Model
queue
execution
queue
execution
memory
time
Cooperative Multi-Device Usage Model
Event wait lists provide synchronised execution
Execution is dependant on prior command execution status
View of shared memory pool is consistent during execution
Could submit other work to fill in the time gaps!
queue
execution
queue
execution
memory
time
Event-Based Synchronisation
size_t origin[] = { 0, 0, 0 };
size_t region[] = { width, height, 1 };
cl_event convert_event;
kernel native
kernel native
On the device:
filter kernel reads from cl_mem A writes to cl_mem B
Streaming Data
kernel native
On the device:
filter kernel reads from cl_mem A writes to cl_mem B
kernel native
On the device:
filter kernel reads from cl_mem A writes to cl_mem B
kernel native
kernel native
kernel native
kernel native
Some caveats....
Size of data stream must be large to offset transfer costs
Complexity of filter must be large to offset launch costs
Both stream_write and stream_read methods must be timely
Streaming Data
stream_read stream_write
kernel native
stream_read stream_write
kernel native
Summary
Course Organisers
AMD / NVIDIA / Khronos / Apple
Derek Gerstmann
University of Western Australia
http://local.wasp.uwa.edu.au/~derek