Sei sulla pagina 1di 5

A LOW POWER HARDWARE IMPLEMENTATION OF MULTI-OBJECT DPM DETECTOR

FOR AUTONOMOUS DRIVING

Alaa Ali Oladiran G. Olaleye Bappaditya Dey Kasem Khalil Magdy A. Bayoumi

The Center for Advanced Computer Studies (CACS)


University of Louisiana at Lafayette, LA, USA
{ama8030, ogo8842, bxd9836, kmk8148, mab0778}@louisiana.edu

ABSTRACT applications, such as autonomous vehicles, smart traffic and


driver assistance.
Object detection is a fundamental process in traffic manage- The deformable part model (DPM) [3] is very popular
ment systems and self-driving cars. Deformable part model and challenging object detector for being a mature and sta-
(DPM) is a popular and competitive detector for its high ble technique. DPM learns a multi-component mixture model
precision. This paper presents a programmable, low power based on Histogram of Oriented Gradients (HOG) [4] and has
hardware implementation of DPM based object detection a merit of large appearance variations handling for challeng-
for real-time applications. Our approach employs a very ing benchmarks. However, original DPM takes more than 10
fast object detection pipeline with complementary techniques seconds (single thread) per image on Pascal VOC [5].
such as fast feature pyramid, Fast Fourier Transform (FFT) Multiple accelerated techniques are proposed to speed
and early classification to accelerate DPM with a reason- up DPM while keeping close to its high detection rates such
able accuracy loss and achieves a speed-up of 50x and 6x as CascadeDPM [6], CoarseDPM [7], FFT-DPM [8], Rapid-
over original DPM and cascade DPM respectively on single DPM [9], FastDPM [10] and R-DPM [11].
core CPU. The hardware circuit uses 65nm CMOS technol-
We propose a low power hardware implementation of
ogy and consumes only 36.5mW (0.81 nJ/pixel) based on
DPM object detection pipeline based on three complemen-
the post-layout simulation. The ASIC has an area of 3362
tary components to speedup DPM. We provide an extensive
kgates and 295.5 KB on-chip memory and the design utilizes
performance evaluation in terms of time and accuracy on
two simultaneous engines to process two independent object
detection benchmarks.
categories with 8 deformable parts per category.
The remainder of the paper is organized as follows: Sec-
Index Terms— Object Detection, DPM, FFT, HOG, Au- tion 2 surveys briefly the related work with focus on DPM.
tonomous Driving Section 3 presents details of our proposed framework. Sec-
tion 4 shows experimental results and performance evaluation
on benchmark datasets. Finally section 5 concludes the paper.
1. INTRODUCTION

The main challenges of autonomous cars relate to two dy- 2. RELATED WORK
namic problems. First, a detailed investigation of the roads
and terrains it is passing by to acquire knowledge about the DPMs [3] have the potential to have more competitive run-
dynamic environment (free-way, neighborhood). Second, an times to squeeze the involved extensive computations and bot-
accurate detection and tracking of differentiable objects such tlenecks. Recent research work and multiple approaches have
as pedestrians, road-side light-post, traffic signals, etc. In been introduced to speed up DPMs technique. Felzenszwalb
addition to that, self-driving cars should be capable enough et al. proposed CascadeDPM [6], which eliminates unpromis-
to derive most accurate inference about its driving context ing hypotheses efficiently and achieves about 14× processing
(perception capability to judge the immediate intention of time speed-up on PASCAL 2007 dataset [5] with a reasonable
other drivers or a pedestrian waiting at an intersection to precision loss in comparison to original DPM [3]. Pedersoli et
cross-walk). Autonomous vehicles include grand challenges al. proposed a hierarchical part based model with coarse-to-
to understand physical world [1, 2] and achieve transporta- fine procedure, CoarseDPM [7], to prune hypotheses at low
tion capabilities of a classic car. Current autonomous driving cost. Iasonas uses Dual-Tree Branch-and-Bound (DTBB) [9]
techniques depends mainly on computer vision besides GPS, to calculate the cost function’s upper bounds of part-based
RADAR and LIDAR. Object detection and accurate vehi- DPM model and result in efficient acceleration with almost
cle detection are essential problems for practical intelligent similar detections.

978-1-5386-4658-8/18/$31.00 ©2018 IEEE 1937 ICASSP 2018


Recently, Yan et al. proposed FastDPM [10] to im-
prove time performance through alternative strategies of
Star-Cascade technique [6] based on a lookup table His-
togram of Oriented Gradients (HOG) [4] and neighborhood
aware cascade. Fast DP-DPM uses deep learning to improve
Input image
the performance of DPM [12]. However, it is still slow for
real-time applications.
In terms of hardware architecture, Kenta et al. presented a
HOG features extraction
with variable cell size

simplified HOG strategy oriented for a real-time multi-object


detection VLSI processor [13], a dual core architecture is de- Fig. 1. Generating feature pyramid.
signed to process HDTV video at 30 HZ and fabricated 65 nm
chip as a scalable architecture under low power conditions
C. Also we can rewrite αΩ (Is ) as follows for simplicity:
[14]. Amr et al. presented an energy efficient multi-scale
HOG architecture [15] on a 45 nm chip to process 1080HD 1 X
αΩ (Is ) ≡ ωijk Cs (i, j, k) (2)
video at 60 HZ consuming 45.3 mW and a programmable en- hs ωs k
ijk
ergy efficient DPM hardware architecture [16] using 65 nm
chip to process HDTV video at 30 HZ and consuming 58.6 where (hs × ωs ) is dimensionality of the image Is sampled
mW. These architectures could be utilized in fault prediction at any scale s and k is the layer [18]. Now, if we decom-
and self-healing hardware systems [17]. pose our initial image I into M smaller images such that
I = [I 1 , I 2 , ..., I M ]. We can rewrite Eq. 2 as follows:
P
3. PROPOSED OBJECT DETECTION SYSTEM αΩ (I m )
αΩ (I) ≈ (3)
M
Our work is complementary to recent approaches for faster
If we can express αΩ (I) ≈ E[αΩ (I m )] by considering
DPMs as we focus on a combination to improve the speed
[I , I , ..., I M ] as a small image ensemble, where E[·] de-
1 2
of pyramid construction and classification step, we couple
notes the expectation over that ensemble of the image I. Then
fast feature pyramid technique [18] and LUT HOG features
according to Ruderman and Bialeks finding with more details
[4, 10] with an optimized FFT classification scheme and then
in [18] about functional representation of the statistics of any
we provide SIMD optimized and multi-threaded version for
natural image I, in terms of the sampling scale s at which im-
better speedup results while maintaining accuracy [19].
age ensemble was captured, we can substitute the power law
by:
3.1. Proposed Hardware Architecture αΩ (Is1 )/αΩ (Is2 ) = (s1 /s2 )λΩ + η (4)
The overall architecture of the proposed DPM hardware where the variable η denotes a measure of deviation from Eq.
detector is composed of the following units: fast feature pyra- 4 for any given image. The above computation is significant
mid generation, Fast Fourier Transform (FFT) unit, SVM subject to its dependency on the ratio of the scales at which
classification and SVM early detection and rejection unit. image ensembles were captured (i.e. s1 /s2 ) instead of the
The required units will provide on-the-fly data processing, corresponding individual ones.
therefore all calculations are executed as soon as needed
data becomes available, which reduces the on-chip memory 3.3. Pyramid Generation
needed. The proposed detector architecture is a standalone
hardware to accelerate detection process, it reads an input The HOG-based detection is simple to understand, it’s a
image and generates the automatic detection locations. We ”‘global”’ feature to describe an object rather than a collec-
provide and optimized implementation in order to ignore the tion of local features [4] where the entire object is represented
need of external storage and reduce the power consumption as a feature vector. To compute HOG feature descriptors, an
of whole system. input image is divided into non-overlapping 8 × 8 pixels size
cells to be processed within detection windows. The cells
are arranged into overlapping blocks. Orientations histogram
3.2. Fast Feature Pyramid is computed for each cell and then normalized with respect
We can represent any feature α as a weighted sum of the chan- to neighbors [4]. A scale generator is utilized to create the
nel as: image pyramid with multiple scales as shown in Fig. 1.
X
αΩ (I) = ωijk C(i, j, k) (1)
ijk 3.4. Multi-scale Generator Architecture
where I is any image, C = Ω(I) and Ω(I) is a low-level shift A block diagram of the scale generator module in Fig. 2. In
variant function applying on the image to create the channel order to generate multiple scales, input pixels are streamed

1938
Bilinear Interpolators
Pixels stream Scale 1 2 1
Interp 1
0 2 4 6
R2 R2 R2 R2
Bilinear Interpolators

Buffer
Scale 2
×

Line
Low Pass Interp 2 2 1
Filter Bilinear Interpolators Scale 3
0 1 2 3
Interp 3 Original image + 3 bilinear
Scale 4 interpolated scaled images
Bilinear Interpolators
× 2 1
Original image
Down Sample by 2 R2 R2 R2 R2
Scale 5
× ×

Buffer
2 1 ×

Line
Scale 6
Bilinear Interpolators
Scale 7
Low Pass Scale 8 First octave + 3 bilinear 4 4 4 4 0 3 6 9
Filter interpolated scaled images 4 4 4 4
First octave:
Row and Column
b1b0 b1b0 b3b0 b3b2

b1b2
down sampled by 2

b2b3
b3b2

b0b1
Scale 9
Buffer

3 2 1 0 3 2 1 0 9 8 1 0 12 8 4 0
Line

INDICES
Down Sample by 2 Bilinear Interpolators Scale 10
Scale 11 11 10 9 8 7 6 5 4 11 10 3 2 13 9 5 1
Scale 12 Second octave + 3 bilinear
interpolated scaled images 7 6 5 4 11 10 9 8 13 12 5 4 14 10 6 2
Second octave: 15 14 13 12 15 14 13 12 15 14 7 6 15 11 7 3
Row and Column down sampled by 4

Fig. 2. Scale generator architecture. Fig. 4. 4-parallel radix-22 feedforward architecture for the
computation of the 16-point FFT.
Histogram Normalization

HOG Feature (36-D)


9-bin histogram

DEMUX
9 Dividers
9-bin Histogram of the current cell STAGE 1 STAGE 2 STAGE 3 STAGE 4 STAGE 5 STAGE 6
Histogram Buffer SRAM

Cell Histogram Generation Cell Feature Accumulation 8 4 2 1


R2 R2 R2 R2 R2 R2
× ×
fx

Bin
MUX

8 4 2 1
Angle ( . )2 Reg
+
Gradient

× 8 4 × 1
Histogram Sqrt 2
fy

R2 R2 R2 R2 R2 R2
Concatenate 4
Pixels × × × × ×
Mag (9-D) normalized 8 4 2 1
input Mag bins to get HOG b3b2b1b0 b3b2b1b0 b5b2b1b0 b5b4b1b0 b5b4b3b0 b5b4b3b2
Reg
+

feature

b0b1
b1b2
b5b4

b4b5

b3b4

b2b3
Block Accumulation

Fig. 5. 4-parallel radix-22 feedforward 64-point DIT FFT ar-


Fig. 3. Block diagram of HOG feature extraction. chitecture.

into module from the frame buffer. We use low pass filters convolve all HOG features and HOG filters and then to sum
to process input pixels before down-sampling at different lev- all the resulting per-feature scores. Leveraging FFT in our
els in order to prevent aliasing and then create two separate pipeline is inspired from [8], convolutions are done by com-
octaves. All generated images are stored into line buffers par- puting FTs of features pyramid, multiplying them in Fourier
tially, we generate mainly the exact scales which are used in domain, and then we compute inverse FT of the result. The
further steps to extract HOG features for those scales. Our convolution process is accelerated by an order of magnitude
line buffers will store 30 columns of pixels of each rescaled at least. If we need to convolve L HOG filters and sum them
image, which is sufficient to generate the different scales. across K HOG features, then the total cost per image is
These line buffers are shared across the multiple scales cre-
ated within same octave [15]. Cf ourier/img = KCF F T
| {z }
+ LCF F T
| {z }
+ KLC
| {zmul}
(6)
f orward F F T s inverse F F T s multiplications

3.5. Fast Fourier Transform (FFT)


3.5.1. Radix-22 FFT Architecture
FFT is used to efficiently compute the discrete Fourier trans-
form (DFT) and it is one of the most important techniques DFT of an input sequence for N point is given by:
in signal processing. Individual part detectors of part based
models are responsible for extracting low-level features from N
X −1

every scale of an image pyramid. The score of a filter evalu- X[k] = x[n]WNnk , k = 0, 1, ...., N − 1 (7)
ated on an image can be computed as: n=0

F −1 M −1 N −1
where WNnk is given by WNnk = e−j(2π/N )nk
The FFT is based on the Cooley Tukey algorithm which
X X X
spq = afp+m,q+n bfmn (5)
f =0 m=0 n=0
is widely used in efficient DFT computations when N is a
power of two [20]. The complexity of Cooley-Tukey algo-
Now, we can compute the responses of a filter considering rithm is O(N log2 N) [21]. Butterflies and rotations are com-
PF −1 f f
all possible locations as: S = f =0 a b where b is the puted within each stage of the graphs, where the lower edges
reversed filter. With the above image and filter size, the cost of the butterflies are always multiplied by -1 for graph simpli-
of standard convolution, of complexity O(LW M N ), will be fication.
replaced with O(LW logLW ) when applying FFT. The flow graph of a radix-22 DIF FFT can be acquired
We use the Fourier transform to speed-up the different from the graph of a radix-2 DIF.Therefore, each stage is bro-
evaluations required of a linear classifier in a multi-scale de- ken down: at odd stages, the breaking is done into trivial and
tection framework. The standard process without FFT is to non trivial rotation, φ̄ and φ̄ = φ mod N/4 and moving φ̄ to

1939
SVM Classification
Control Unit
SVM weights
SRAM
Table 2. Performance comparison on PASCAL VOC 2007
dataset for multiple hardware implementations.
HOG [14] DPM [16] Our Work

Accumulator Controller
MAC0 MAC16
Process 65nm 65nm 65nm
MUX
MAC1 MAC17
Chip size 4.2 × 2.1 mm2 4.0×4.0 mm2 4.2×4.0 mm2
Accumulation Input resolution 1920 × 1080 1920 × 1080 1920 × 1080
HOG feature SRAM
(36-D) Multi-scale one scale 12 scales 12 scales
Deformable parts No 8 8
MAC15 MAC31
Position & scale of Object classes 2 2 2
positive detected Frame rate 30 30 42
windows
Frequency 84.3 MHz 62.5 MHz 62.5 MHz
Power 84 mW 58.6 mW 36.5 mW
Energy/pixel 1.35 nJ 0.94 nJ 0.81 nJ
Fig. 6. SVM Classification. Mean AP 18.5% 26% 31.4%

the next stage. Fig. 4 shows a 16-point 4-parallel radix-22 on HOG. Both circuits in [14] and [16] can process full HD
butterflies based feedforward FFT architecture. frames at 30 fps with lower accuracy than the original DPM
algorithm. Our proposed hardware circuit can provide fast
3.6. Early SVM Classification multi-scale feature map generation and the whole detection
process to run on-chip with utilizing deformable parts in de-
After offline training of a linear SVM classifier for the re- tection, which provide higher the detection rates and robust
quired DPM model in root and parts filters, the model coef- detection. In the meanwhile, it consumes 34% less energy in
ficients are loaded to an on-chip buffers. Our detector can comparison to previous architecture [14].
be configured for different object categories. For example, in
A comparison between the proposed hardware circuit and
case of INRIA Person benchmark, the 4, 608 SVM weights
a software implementation of DPM algorithm running on
which represents 128 × 64 pixels for a pedestrian root filter
ODROID-XU3 board [16] is shown in Table 1. The required
are quantized into 4-bit fixed-point and a memory size equals
pre-processing steps and post processing step of non-maximal
to 0.018 Mbit [15]. The SVM classification processing is per-
suppression for the DPM based detection hardware architec-
formed using a number of MACs where each MAC has an
ture have a small power consumption and area overhead in
adder and two multipliers used to calculate the partial dot
comparison to the main intensive computations steps.
product for values of the extracted features per window and
SVM coefficients plus an extra adder to accumulate the par- The proposed ASIC architecture has less energy con-
tial dot products. The classification block diagram is shown sumption than the Samsung embedded processor in about 3
in Fig. 6. orders of magnitude. These results show performance evalu-
The technique for early classification is inspired from a ation on HD frames and it shows that a maximum throughput
simplified HOG algorithm proposed in [14]. In our work, we of 0.24 frames per seconds can be achieved on ODROID-
use a similar technique to early detect and reject the DPM XU3 board with the Samsung processor when using the 4
part candidates in each component (we use 8 parts per com- cores of Cortex-A16.
ponent), instead of waiting to calculate all parts score, we
compare partial parts score obtained subsequently with early
5. CONCLUSION
classification threshold and then posterior computations can
be skipped.
In this paper, a novel pipeline is proposed to accelerate de-
formable part models and achieve a real-time object detec-
Table 1. Comparison with ODROID-XU3 board processing tion, while keeping the similar high detection rates of DPM.
Full HD 1920 × 1080 frames We present a low power and real-time hardware implemen-
Platform Cortex-A7 [16] Cortex-A15 [16] DPM [16] Our Work
1 core 4 cores 1 core 4 cores 0.77V 1.11 V 0.77V 1.11 V tation for a DPM based object detector. Our circuit uses a
Process Technology
Throughput fps
 28nm HKMG
0.04 0.10
28nm HKMG
0.11 0.24
65nm CMOS
30 60
65nm CMOS
42 74
65 nm CMOS technology and processes 1920 × 1080 videos
Power mW  155.6 383.5 1,703.8 3,575.6 58.6 216.5 36.5 182.4 at 36 fps while consuming only 36.5 mW and resulting in
Energy nJ/pixel 1,881.8 1,849.0 7,301.2 7,165.7 0.94 1.74 0.81 1.48
an energy efficiency of 0.81 nJ/pixel. Experimental results
on different benchmarks show that our work is generic and
effective in the context of vehicle detection and other object
4. EXPERIMENTAL RESULTS categories.This work presents DPM to real-time applications
such as driver assistance and autonomous vehicles and pro-
Table 2 shows a comparison between our hardware archi- vides object detection to be low power and energy efficient as
tecture versus other previous hardware object detector based video compression.

1940
6. REFERENCES [14] Kenta Takagi, Kotaro Tanaka, Shintaro Izumi, Hiroshi
Kawaguchi, and Masahiko Yoshimoto, “A real-time
[1] E. Guizzo, “How googles self-driving car works,” in scalable object detection system using low-power hog
IEEE Spectrum, 2011. accelerator vlsi,” Journal of Signal Processing Systems
(JSPS), vol. 76, no. 3, pp. 261–274, 2014.
[2] Hyunggi Cho, Young-Woo Seo, B.V.K. Vijaya Kumar,
and R.R. Rajkumar, “A multi-sensor fusion system for [15] Amr Suleiman and Vivienne Sze, “An energy-efficient
moving object detection and tracking in urban driving hardware implementation of hog-based object detection
environments,” in ICRA, 2014, pp. 1836–1843. at 1080hd 60 fps with multi-scale support,” Journal of
Signal Processing Systems (JSPS), vol. 84, no. 3, pp.
[3] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and 325–337, 2016.
D. Ramanan, “Object detection with discriminatively
trained part-based models,” PAMI, vol. 32, no. 9, pp. [16] Amr Suleiman, Zhengdong Zhang, and Vivienne Sze,
1627–1645, Sept 2010. “A 58.6 mw 30 frames/s real-time programmable multi-
object detection accelerator with deformable parts mod-
[4] N. Dalal and B. Triggs, “Histograms of oriented gradi- els on full hd 1920x1080 videos,” IEEE Journal of
ents for human detection,” in CVPR, 2005, vol. 1, pp. Solid-State Circuits (JSSC), vol. 52, no. 3, pp. 844–855,
886–893. 2017.

[17] K. Khalil, O. Eldash, and M. Bayoumi, “Self-healing


[5] M. Everingham, L. Van Gool, C.I. Williams, J Winn,
router architecture for reliable network-on-chips,” in
and A. Zisserman, “The pascal visual object classes
ICECS, 2017.
challenge 2007 (voc2007) results,” Technical Report,
2007. [18] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast
feature pyramids for object detection,” PAMI, vol. 36,
[6] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester, no. 8, pp. 1532–1545, 2014.
“Cascade object detection with deformable part mod-
els,” in CVPR, 2010, pp. 2241–2248. [19] A. Ali and M. A. Bayoumi, “Towards real-time dpm
object detector for driver assistance,” in ICIP, 2016, pp.
[7] M. Pedersoli, A. Vedaldi, and J. Gonzalez, “A coarse- 3842–3846.
to-fine approach for fast deformable object detection,”
in CVPR, 2011, pp. 1353–1360. [20] M. Garrido, J. Grajal, M. A. Sanchez, and O. Gustafs-
son, “Pipelined radix- 2k feedforward fft architec-
[8] Charles Dubout and Franois Fleuret, “Exact accelera- tures,” IEEE Transactions on Very Large Scale Integra-
tion of linear object detectors,” in ECCV. 2012, vol. tion (VLSI) Systems, vol. 21, no. 1, pp. 23–32, 2013.
7574, pp. 301–311, Springer.
[21] James W Cooley and John W Tukey, “An algorithm
[9] Iasonas Kokkinos, “Bounding part scores for rapid de- for the machine calculation of complex fourier series,”
tection with deformable part models,” in ECCV, 2012, Mathematics of computation, vol. 19, no. 90, pp. 297–
vol. 7585, pp. 41–50. 301, 1965.

[10] Junjie Yan, Zhen Lei, Longyin Wen, and S.Z. Li, “The
fastest deformable part model for object detection,” in
CVPR, 2014, pp. 2497–2504.

[11] A. Ali, O. G. Olaleye, and M. Bayoumi, “Fast region-


based dpm object detection for autonomous vehicles,”
in MWSCAS, 2016.

[12] A. Ali, O. G. Olaleye, B. Dey, and M. Bayoumi, “Fast


deep pyramid dpm object detection with region proposal
networks,” in ISSPIT, 2017.

[13] K. Takagi, K. Mizuno, S. Izumi, H. Kawaguchi, and


M. Yoshimoto, “A sub-100-milliwatt dual-core hog ac-
celerator vlsi for real-time multiple object detection,” in
ICASSP, 2013, pp. 2533–2537.

1941

Potrebbero piacerti anche