Sei sulla pagina 1di 3

Internship Report (Depth Estimation From Monocular

Images)
Prithvi Poddar
July 25, 2019

Introduction
My task in this internship was to develop a deep learning based algorithm to generate dense
depth maps from monocular images, that’d be finally put to use in a autonomous vehicle.

Part I - Working on DenseDepth

Work done
I started with working upon the code developed by Ibraheem Alhashim and Peter Wonka,
from KAUST. Their model is called DenseDepth, the codes for which can be found at
https://github.com/ialhashim/DenseDepth. This model is based on transfer learning
and is built upon DenseNet which can be found at https://github.com/liuzhuang13/Dense
Net. DenseDepth is basically an autoencoder that makes use of a dense convolutional neural
network and skip connections, to produce a dense depth map of the input image. I made a few
necessary changes in the code which can be found at https://github.com/prithvi-poddar/
DenseDepth. The changes were to commute to our requirement of a real time depth map
output to a continuous video feed form a camera.

Conclusion
The pre-trained kitti model worked fine with depth measurements varying by ±2m from the
ground truth. But the model didn’t cater to our need for a real time output. The frames
needed a considerable amount of time to be processed by the computer that’d be on board
the autonomous vehicle.

1
Part II - Building a recurrent neural net based model for depth
from monocular images

Introduction
This is an attempt to use recurrent neural networks to get depth estimation from monoc-
ular images. The idea is to use information from the previous frames to get depth estimation
(known as depth from motion). Where models like DeepTAM (https://github.com/lmb-fre
iburg/deeptam) use geometry, camera tracking and key-frames to generate depth maps, I
take a completely deep learning based approach by using recurrent neural networks to get
features from preceding frames. This is a supervised learning approach and needs immense
amount data which include continuous frames from a video sequence and the corresponding
ground truth depth.
This model closely follows the architecture used by John Mern et al., (Visual Depth
from Monocular Images using Recurrent Convolutional Neural Networks), with some
minor changes to the architecture.

Training Data
Being a supervised learning based model, I require data on Indian college campuses as
the final autonomous vehicle is to be deployed in Indian campuses. But because of the
unavailability of means to collect our own data, I’m testing the model on Kitti dataset.
The ground truth raw depths from kitti need to be first inpainted to generate a dense
depth map as the raw data is a sparse depth maps generated by the Lidars. For this
purpose, I used the Python re-implementation of the Matlab code provided in the NYU
Depth V2 toolbox. This Python code was written by Ibraheem Alhashim and can b found
at https://gist.github.com/ialhashim/be6235489a9c43c6d240e8331836586a

Model
This model is an autoencoder which makes use of convolutional gated recurrent units (Con-
vGRU). This allows the extraction of featured from not only the preceding frames, but also
from the the adjacent pixels for each frame. The architecture of the model is shown in the
table below

2
Layer Filter Size Stride Depth Activation
ConvGRU(E0) (3,3) 2 64 LReLU
ConvGRU(E1) (3,3) 2 256 GRU
ConvGRU(E2) (3,3) 2 512 GRU
ConvGRU(E3) (3,3) 2 512 GRU
Convolutional(D0) (1,1) 1 512 LReLU
Transpose Reshape
ConvGRU(D1) (3,3) 1 512 GRU
Transpose Reshape
ConvGRU(D2) (3,3) 1 256 GRU
Transpose Reshape
ConvGRU(D3) (3,3) 1 256 GRU
Transpose Reshape
ConvGRU(D4) (3,3) 1 128 GRU
Convolutional(D5) (1,1) 1 1 GRU

Where E means encoder layer and D means decoder layer. LReLU stands for Leaky
Rectified Linear Unit.
The model is trained using matched camera images and dense ground truth depth maps.
The loss function is the L1 norm of the difference between the generated depth map and
the ground truth depth map which defined as. The optimizer used is Adam optimizer. 32
sequences of images are fed in each iteration of the training.
This is the current architecture of the model at the end of the internship period. Further
changes along with parameter tuning and evaluation are yet to be done.

Current status
Currently the model is still under progress. The dataset being used is kitti but the final train-
ing has to be done on Indian campus images. The architecture of the model isn’t final yet and
future changes will be made. Training, evaluation and parameter tuning are yet to be carried
out. The codes for the same can be found at https://github.com/prithvi-poddar/depth-
estimation-using-recurrent-neural-nets Kindly note that the codes have not been fi-
nalized yet.

Potrebbero piacerti anche