Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Richard Ansorge
29 Oct 2008
Richard Ansorge
The problem
CT, MRI, PET and Ultrasound produce 3D volume images Typically 256 x 256 x 256 = 16,777,216 image voxels. Combining modalities (inter modality) gives extra information. Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important. Have to spatially register the images.
CUDA Image Registration 29 Oct 2008 Richard Ansorge
CT
MRI
PET
29 Oct 2008
Richard Ansorge
PET-MR Fusion
The PET image shows metabolic activity. This complements the MR structural information
29 Oct 2008
Richard Ansorge
Registration Algorithm
Im A Transform Im B to match Im A Im A Compute Cost Function
No
Richard Ansorge
Transformations
General affine transform has 12 parameters:
a11 a12 a13 a14 a a a a 21 22 23 24 a a a a 31 32 33 34 0 0 0 1
Polynomial transformations can be useful for e.g. pincushion type distortions: x = a11x + a12y + a13z + a14 + b1x 2 + b2xy + b3y 2 + b4z 2 + b5xz + b6yz y = L z = K Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.
CUDA Image Registration 29 Oct 2008 Richard Ansorge
64 56 48 40 32 24 16
Speedup Factor
Richard Ansorge
600 400 200 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Processors SR2201 PC 333MHz Speedup perfect scaling
8 0
29 Oct 2008
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
Architecture
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
GPU Speedup
250 200 150 100 50 0 0 1024 2048 3072 N average speedup 4096 5120 6144
29 Oct 2008
Richard Ansorge
GPU Speedup
500.0 400.0 300.0 200.0 100.0 0.0 0 1024 2048 3072 N speedup average speedup
Richard Ansorge
4096
5120
6144
29 Oct 2008
30 25 20 15 10 5 0 0 1024 2048 3072 N speedup CPU mads/100 ns GPU mads/ns 4096 5120 6144
29 Oct 2008
35
GPU Speedup
Image Registration
CUDA Code
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host }
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture texture<float, 3, cudaReadModeElementType> tex1; __constant__ float c_aff[16]; // 4x4 Affine transform
tex1: moving image, stored as 3D texture c_aff: affine transformation matrix, stored as constants
// Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host }
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b // device function declaration // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host
nx, ny & nz: image dimensions (assumed same of both) b: output array for partial sums s: reference image (mislabelled in code)
}
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host }
29 Oct 2008
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero Which thread am I? (similar to MPI) however one thread for each xy pixel, 240x256=61440 threads (CF ~128 nodes for MPI)
Richard Ansorge
#include <cutil_math.h>
float4 v = make_float4(x,y,z,1.0f); Function arguments are image dimensions and pointers to output buffer b float4 r0 = // make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); { float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float x = (float)ix; float y = (float)iy; float tx = dot(r0,v); // Matrix // Multiply using dot products float z = 0.0f; start with slice zero float4 v = make_float4(x,y,z,1.0f); float ty = dot(r1,v); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float tz = dot(r2,v); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float source = float4 0.0f; r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float target = 0.0f; float ty = dot(r1,v); float tz = dot(r2,v); float cost = 0.0f; //= accumulates cost function contributions float source 0.0f; float target// = 0.0f; v.z=0.0f; z of first slice is zero (redundant as done above) float cost = 0.0f; uint is = iy*nx+ix; uint is = iy*nx+ix; // this is index of my voxel in first z-slice uint istep = nx*ny; uint istep = nx*ny; // stride index voxel for(int iz=0;iz<nz;iz++) { //to process all z's same in same thread here in subsequent slices Initialisations and first matrix multiply. v is 4-vector current voxel x,y,z address tx,ty,tz hold corresponding transformed position
} b[iy*nx+ix]=cost; // store thread sum for host }
CUDA Image Registration 29 Oct 2008 Richard Ansorge
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform
source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host }
29 Oct 2008
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!! is += istep; v.z += 1.0f; // step to next z slice tx = dot(r0,v); ty = dot(r1,v); Z Y tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required X } b[iy*nx+ix]=cost; // store thread sum for host
Loop sums contributions for all z values at fixed x,y position. Each tread updates a different element of 2D results array b.
Richard Ansorge
e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origin o = make_float3(0.0f); // translations r = make_float3(0.0f); // rotations s = make_float3(1.0f,1.0f,1.0f); // scale factors t = make_float3(0.0f); // tans of shears ...
29 Oct 2008
Richard Ansorge
29 Oct 2008
Richard Ansorge
Desktop 3D Registration
Registration with FLIRT 4.1 8.5 Minutes
Comments
This is actually already very useful. Almost interactive (add visualisation) Further speedups possible
Faster card Smarter optimiser Overlap IO and Kernel execution Tweek CUDA code
Intel Larabee?
29 Oct 2008
Richard Ansorge
Thank you
29 Oct 2008
Richard Ansorge