Sei sulla pagina 1di 10

FINAL REPORT

(Updated, Web Version)

Multiprocessor
Programming (521288S)

Kaushik
Sundarajayaraman Venkat

Submitted on: 31.05.2018


Overview:
The Zero-Normalized Cross-Correlation (ZNCC) based stereo-disparity application was developed in C++11
under the Linux environment (although began in Windows).
Development Environment:

Hardware: Intel i5-3570 3.4 GHz


2 × 8 GiB DIMM DDR3 RAM 1600 MHz
NVIDIA GeForce GTX 1060 3GB

OS: Ubuntu 18.04 LTS Windows 10

S/W: gcc version 7.3.0 Visual Studio


NVIDIA Driver Version 390.48
OpenCL 1.2 CUDA 9.1.84

Compilation and Running:


A very simple Makefile is included. The application can be compiled in Linux with simple make command:

# Use bash on cse-cn0011.oulu.fi


$ bash

# Clone Repository
$ git clone https://bitbucket.org/kaushiksv2/zncc.git
$ cd zncc

# Set paths or edit Makefile


$ export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/cuda-8.0/targets/x86_64-linux/lib"
$ export CPLUS_INCLUDE_PATH="$CPLUS_INCLUDE_PATH:/usr/local/cuda-8.0/targets/x86_64-linux/include"

# CPU ONLY VERSION


$ make cpu
$ ./zncc --use-gpu
Recompile with GPU support :)

# WITH GPU SUPPORT


$ make
$ ./zncc -q
CL_DEVICE_LOCAL_MEM_TYPE ............... : CL_LOCAL
CL_DEVICE_LOCAL_MEM_SIZE ............... : 49152 Bytes
CL_DEVICE_MAX_COMPUTE_UNITS ............ : 9
CL_DEVICE_MAX_CLOCK_FREQUENCY .......... : 1784 MHz
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE ..... : 65536 Bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE .......... : 1024 Bytes
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS ..... : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES .......... : 1024x1024x64
$ ./zncc --use-gpu –w 9
la 26.5.2018 10.42.28 +0300 :: maxdisp = 64; winsize = 09; thres = 08; nhood = 08, t_sg = 0.0000 ms;
t_d0 = 39.0000 ms; t_d1 = 39.0000 ms; t_cc = 0.0000 ms; t_of = 0.0000 ms

The application outputs the final output image to outputs/depthmap.png, overwriting any existing file. The directory
`outputs` directory must exist, and is included in the repository. The output to stdout has the following information:

Timestamp
Maximum disparity: maxdisp
Window size: winsize
Threshold for Cross-Checking: thresh
Neighborhood size for Occlusion Fill: nhood
Number of threads – CPU mode only: nthreads
Time taken to shrink and greyscale the original images: t_sg
Time taken to calculate prelim disparity maps (t_d0 and t_d1): t_d0 and t_d1
Time taken for cross check: t_cc
Time taken for occlusion filling: t_of
The same is also appended to performance_log.txt for the sake of record. it is to be noted that the nthreads value is
included in output only when run in CPU mode.

In GPU mode, all time values displayed are the time taken for kernel execution, usually in multiples of 1000ns, and
hence 0.000 ms is seen often (low-level limitation). In GPU mode, the time measured by gettimeofday happened to
be around 7.85ms for shrink_and_grey kernel and less than 5ms for cross-checking and occlusion filling kernels ,
including calls to clEnqueueNDRangeKernel, and clWaitEvent.

List of command line options:

kaushik@kaushik-ubuntu:~/mp/zncc$ ./zncc --help


zncc 2.0

Multithreaded and OpenCL (GPU) implementation

Usage: zncc [OPTIONS]

Computes ZNCC based depthmap. Looks for im0.png and im1.png in working
directory. Outputs to outputs/depthmap.png relative to working directory. See
zncc.cpp and zncc_gpu.cpp for details.

-h, --help Print help and exit


-V, --version Print version and exit
-w, --window-size=INT Side of the window used for zncc. Must be odd.
(Ex: 11, window has 121 elements)
(default=`9')
-t, --threshold=INT The threshold used for cross-checking.
(default=`8')
-n, --neighbourhood-size=INT The neighbourhood size for occlusion filling.
(default=`8')
-d, --maximum-disparity=INT The maximum disparity between images.
(default=`64')
-g, --use-gpu Use GPU for computation. (default=off)
-j, --nthreads=INT Number of threads for zncc computation. Has no
effect when using GPU. (default=`1')
-q, --query-gpu-info Query the GPU for:
CL_DEVICE_LOCAL_MEM_TYPE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_COMPUTE_UNITS
CL_DEVICE_MAX_CLOCK_FREQUENCY
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
(default=off)
--show-status Print status-update messages that describe the
ongoing activity. (default=off)
--platform-number=INT The platform number (different from platform
ID) varies from 0..N_PLATFORMS-1. Use a tool
like clinfo to customize this. (default=`0')
--device-number=INT The device number (different from device ID)
varies from 0..N_DEVICES-1. Use a tool like
clinfo to customize this. (default=`0')
-s, --skip-depthmapping OBSELETE. Previously, this flag had been used
to skip computation of preliminary depthmaps,
and reuse previously output images. Has no
effect when using GPU. This option would use
images specified by --image-0 and --image-1
options. if ommitted, it looks for previously
output files at ./outputs/ directory, and use
them to perform just cross-checking and
occlusion-filling. Missing files would cause
the program to terminate. `d0_filepath` and
`d1_filepath` in zncc.cpp define the default
files that would be looked for.
(default=off)
--image-0=STRING Image 0 filepath (default=`im0.png')
--image-1=STRING Image 1 filepath (default=`im1.png')
--shrink-by=INT Shrink factor to downscale image. Typically set
to 1 when skipping depthmapping step.
(default=`4')

In CPU mode, set shell variable INTIMG=1 in order to output intermediary images
to 'outputs/' directory.

Author: Kaushik Sundarajayaraman Venkat


E-mail: speak2kaushik@gmail.com, kaushik.sv@student.oulu.fi
Many images and logs from different phases of development of this application can be found at:
http://www.ee.oulu.fi/~ksundara/mpo/.

Normal depthmap computed with maxdisp=64, window_size=19, threshold=8, neighbourhood_size=25:

This happens to be the best noticed result.

The CPU version of depth-mapping is robust in running the job under wide range of parameters. The parameters are
configurable by command line arguments, as shown in the previous page.

1470 × 1008 pixels image example:

[ksundara@cse-cn0011 zncc]$ ./zncc -j 32 -d 128 -w 35 -t 15 -n 26 --shrink-by=2 --show-status


Reading png files...
Computing depthmap 1 of 2...
Computing depthmap 2 of 2...
Cross checking...
Occlusion fill...
Done
Wed May 30 18:42:14 EEST 2018 :: maxdisp = 128; thres = 15; winsize = 35; nhood = 26, nthreads
= 32; t_sg = 23.1820 ms; t_d0 = 72679.0490 ms; t_d1 = 72088.3940 ms; t_cc = 5.6570 ms; t_of =
953.0360 ms

URL: http://www.ee.oulu.fi/~ksundara/mpo/depthmap.hd.png
Application Development:
Initial Stage Problem:
The programming exercise was originally began on 1st of February with Microsoft Visual C++, as a dialog
based MFC application. After experimentation with of LodePNG functions, quite some time was taken on
deciding the best way to perform the looping. The initial code had involved plenty of boundary checking,
and shrinking the window suitably at boundaries so as to fit within the image. The complications were then
removed as they were deemed unnecessary.
The wrong preliminary disparity map at that beginning stage had looked something like this:

This portion of the code that caused the problem:

zncc = 0; // <-- This was inside the `for` loop below and went unnoticed for long time.

for (d = minimum_disparity; d < maximum_disparity; d++){


// Right window mean
mean_r = 0;
for (i = -((window_size - 1) / 2) ; i <= ((window_size - 1) / 2); i++){
// . . .

The few non-white grey values see on pictures were a result of negative zncc.
Without having fixed this, the multithreading part was implemented on Windows with a hope that changing
parameters would result in getting an image as shown in the course instructions document. Later in Linux,
after confirming that the problem was in the algorithm itself rather than incorrect window_size, a
binge-debugging session finally helped spot the problem.
The correct output for window_size=9 is shown below:

(Moving the window at im1.png, keeping that of im0.png fixed.) (Moving the window at im0.png, keeping that of im1.png fixed.)

Switch over to Linux:


The wrong output and very long execution time (1-2 mins) on Windows 10 machine meant it would take very
long hours to try many combinations of window size, even on release mode with optimizations turned on.
The 4-core i5-3570 CPU could do only so much. This brought about a need to run the application on a
high-end computer with plenty of parallel CPU cores. It was planned to use the cse-cn0011.oulu.fi server to
run these jobs. Hence the windows application was abandoned, and the development continued in Linux
environment. In addition to having 40 cores, the server also has 8 x NVIDIA TESLA K80 devices which was
expected to give significant boost in performance later when switching to OpenCL/GPU.
It was my first time writing Makefile and I had learnt quite some valuable things about the Makefile and
build processes in general.
The GNU Gengetopt tool was used to auto-generate cmdline.h and cmdline.cpp from getopts.ggo to easily
parse command line arguments.
The example at http://svn.clifford.at/tools/trunk/examples/cldemo.c* helped in understanding OpenCL
usage.

Source Code Organization:


.c/.cpp files: main.cpp, zncc.cpp, zncc_gpu.cpp, util.cpp and cmdline.c

Generally, details of what each function/set of lines do, is mentioned as comments.

main.cpp parses the command line arguments and depending upon the mode of operation (default CPU; use -g
or --use-gpu flag for OpenCL/GPU), it invokes exec_project_cpu or exec_project_gpu functions which are defined in
zncc.cpp and zncc_gpu.cpp respectively.

util.cpp has many helper functions.

cmdline.c is used for command line parsing, and is auto-generated from getopts.ggo by GNU gengetopt 2.22.6.

* - The links to cldemo.c were erroneously pointing to different version, and have been rectified in this document.
.h files: includes.h, zncc.h, util.h and cmdline.h

includes.h is for easy inclusion of header files.

Other headers declare functions, few helper like static functions, macros etc.

zncc.cl All __kernel functions in one .cl file. This is for the purpose of simplicity.

getopts.ggo Command line options for GNU gengetopt.

Salient Features of Code:


1. Optimization: Left Image Mean and Standard Deviation

In calculating the preliminary depth-maps using ZNCC method, the mean and standard deviation of window
on left image is done just once, and reused for calculating ZNCC values along different ‘d’ (disparity) values.
This saves many computations. To my surprise, many implementations on the internet have been oblivious to
this basic and easy optimization opportunity. For a 735x504 image, when maxdisp=64, this would save upto
23708160 calculations if the optimizer misses to move out the loop-invariant computations, which is
especially likely when the buffer is not marked as const, or in some other edgy conditions.

2. Use of leading/trailing buffers:


This helps avoid bounds checking and out-of-bounds buffer read/write. ~64 Bytes is a very less price to pay
when compared to the speed gain and code simplicity by avoiding multiple bounds checking that occurs before
we could do image_right[ <row_value> + (j+x - d) ].

3. Negative indices for better code readability:


The use of negative indexes in pointer is apparently valid as per C99 standards. This has been exploited by
assigning a pointer to middle of the window, and using negative indices to access left/top areas as seen in
zncc.cpp:zncc_worker and zncc.cl:compute_disparity.
4. Kernel Optimizations:
The kernel uses certain macros like IMAGE_HEIGHT, IMAGE_WIDTH which are defined during clBuildProgram.
The hypothesis is that this would enable better loop-outrolling and thereby reduce kernel execution time.
For a window size of 9, enabling this optimization along with -cl-mad-enable build option decreased the kernel
execution time from minimum time of 48ms to 27ms, which is a very significant performance boost.

5. OpenCL Kernel: __local memory and work-group size:


3D range is used when enqueueing the `compute_disparity` kernel, with work-group size of 1x1x64. This
means many kernels can execute at a time, and number of loops within a kernel is lesser than it would have
been with a 2D range. Dimensions 0 and 1 represent x and y coordinates, and dimension 2 represents disparity.
This means __local memory can be used to share left_mean and left_std among all work items in work group.
This means fast and efficient memory. To deal with higher values of maxdisp, one could finally write to global
memory the best disparity value within work-group, as atomic operations. In such case, the work-group size
can be reduced to 1x1x<recommended_work_group_size>, as this should ideally free up the compute units at
the earliest and slightly improve performance. However that is not implemented now. GPU maximum disparity
is restricted to 64, and the work-group size is fixed.

6. Vectorization:
The Shrink/Grey kernel uses vectorized notation for demonstration purpose.

7. Memory Coalescing:
Given that barriers used in the ‘compute_disparity’ kernel has no conditional branching between the two calls
to barrier(), memory access during zncc calculation can be expected to be coalesced.
Also, certain parts of occlusion_fill had column-major access for debugging purpose, and they were promptly
changed to do row-major access.
(This commit: https://bitbucket.org/kaushiksv2/zncc/diff/zncc.cl?diff2=3b73400a0ab1&at=master)

CPU code parallelism is done only when during preliminary depth-mapping, and not in other steps, because the
overhead of creating threads etc. would outweigh the benefits of parallelism when done for small tasks like shrinking,
greyscale, cross-check, and occlusion fill. Maybe not when there is a thread-pool ready to execute jobs, but that is not
focused here.

Performance Gain:

Timings for computation of one preliminary disparity map out of two small images:

window_size = 9 window_size = 19 window_size = 25


CPU (4 threads) 2008 ms 8593 ms 14548 ms
GPU 26 ms 104 ms 175 ms
Gain 77.23 82.625 83.13
Device : Ubuntu 18.04 LTS; Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz; NVIDIA GeForce GTX 1060 3GB

window_size = 9 window_size = 19 window_size = 25


CPU (4 threads) 3927 ms 17218 ms 28773 ms
GPU 93 ms 335 ms 535 ms
Gain 42 51.39 53.78
Device : CentOS Linux 7 (Core); Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz; NVIDIA Tesla K80

It is interesting to note how the home PC is faster, and also has better gain than the server. This can be explained by
CPU/GPU clock frequency, and also maybe many other factors affect, like the number of users accessing the computer
at a given time.
Device Specification Comparison:

NVIDIA TESLA K80 (cse-cn0011.oulu.fi)

[ksundara@cse-cn0011 zncc]$ ./zncc -q


CL_DEVICE_LOCAL_MEM_TYPE ............... : CL_LOCAL
CL_DEVICE_LOCAL_MEM_SIZE ............... : 49152 Bytes
CL_DEVICE_MAX_COMPUTE_UNITS ............ : 13
CL_DEVICE_MAX_CLOCK_FREQUENCY .......... : 823 MHz
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE ..... : 65536 Bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE .......... : 1024 Bytes
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS ..... : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES .......... : 1024x1024x64

NVIDIA GeForce GTX 1060 3GB (Ubuntu 18.04 LTS PC)

kaushik@kaushik-ubuntu:~/mp/zncc$ ./zncc -q
CL_DEVICE_LOCAL_MEM_TYPE ............... : CL_LOCAL
CL_DEVICE_LOCAL_MEM_SIZE ............... : 49152 Bytes
CL_DEVICE_MAX_COMPUTE_UNITS ............ : 9
CL_DEVICE_MAX_CLOCK_FREQUENCY .......... : 1784 MHz
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE ..... : 65536 Bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE .......... : 1024 Bytes
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS ..... : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES .......... : 1024x1024x64

It can be seen that the number of max compute units and clock frequency is
better in Ubuntu-PC and therefore it is faster than the Tesla K80, at least in
this specific problem.

Time-analysis on GPU:
The following analysis of execution time vs window size was done on Ubuntu-PC and later transferred to cse-
cn0011.oulu.fi for evaluation.

450

400

350

300
Time (ms)

250

200 w vs t_d0
w vs t_d1
150

100

50

0
0 10 20 30 40
Window size (side of square)
w vs log10(t_d0)
3

2.5

log(time_taken) 1.5

0.5

0
0 5 10 15 20 25 30 35 40
Window size (side of square)

It can be seen that time taken for complete execution of ‘compute_disparity’ kernel increases in a kind-of predictable
fashion based on the window-size in use. Further investigation is quite likely to throw some light on the relationship
between the variables, and help further fine tune real-world applications to achieve a fine balance between execution
times and any other desired variable(s).

Shell script http://www.ee.oulu.fi/~ksundara/mpo/analysis/analysis.sh.

The output depth maps with wide-range of parameter can be found at:
http://www.ee.oulu.fi/~ksundara/mpo/analysis/

Acknowledgements:
Thanks to the University for having provided me this opportunity to learn more on multiprocessor
programming.
Special thanks to Clifford Wolf (http://clifford.at) for having “unlicensed” his example program from
GNU GPL v2 to Open Domain. The CL_CHECK and CL_CHECK_ERR macros are used from his example
program available at http://svn.clifford.at/tools/trunk/examples/cldemo.c *

* - The links to cldemo.c were erroneously pointing to different version, and have been rectified in this document.