Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract—Deep Convolutional Neural Network (CNN) based [12] have focused on massive off-chip memory access while
methods have shown outstanding performance in a wide range of designing accelerators, trying to decrease energy consumption
applications. Nowadays neural networks become deeper, leading by declining off-chip data transmission. One popular method
to demand of substantial computation and memory resources.
Customized hardware is one option which maintains high per- is designing efficient on-chip memory hierarchy [13]–[15]. For
formance in lower energy consume than general CPUs or GPUs. instance, DaDianNao [13] lays tens of eDRAM banks on chip
While hardware designing, we need to address the problem of in order to store the whole network’s parameters. In situations
massive data transmission, and ensure high throughput at the of dealing with deep networks owning substantial parameters,
same time. Actually, substantial data transfer consumes more it will need huge on-chip storage capacity. Another method
energy than computation. The larger scales of neural networks
become, the harder to solve this problem. In this paper, we is to compress data [8], [10], [11], decreasing the stress of
focus on convolution operation, which occupies nearly 90% bandwidth. One drawback is that extra hardware overhead or
computation and runtime in deep CNN. We propose a data- time consumption will be required during compression and
centric computation mode for convolution, which declines the decompression. Furthermore, energy consumed by on-chip
total requirements of data transfer during convolution pro- data transfer also plays an important role. Off-chip access
cessing period efficiently, and utilizes data locality to achieve
high throughput. Different from previous methods, which adopt once spends hundreds of Nanojoule [16],and that is tens of
efficient on-chip memory hierarchy or focus on partial results’ Nanojoule for on-chip access. However, the amount of on-chip
movements, our proposed method concentrates on operands access is as ten to hundred times as that of off-chip access.
themselves in convolution, minimizing data transfer right from
the start. Obviously, it can be combined with others to achieve
higher energy efficient. Furthermore, we also simulate and To address these problems, we propose a data-centric com-
analyse the hardware overhead of our data-centric convolution, putation mode, declining the total amount of on-chip data
corroborating its potentiality of performing high throughput in transfer during the entire processing. We focus on convolution
low energy consumption.
operation, which plays an important role in contemporary deep
I. I NTRODUCTION neural networks. Utilizing data reuse patterns in convolution
operation, operands in the data-centric computation mode are
Deep Convolution Neural Networks (CNN) are showing
less moved. Meanwhile, our method also guarantees high
great power in broad application scopes, such as face detection,
throughput. In terms of hardware implementation, it can be
object localization, scene classification, tracking, automatic
accomplished in simple computational primitives. We simulate
drive, speech recognition and so on [1]–[5]. In recent years,
and analyse the hardware details, and then compare to the tra-
‘going deeper’ has become a trend of CNNs to improve
ditional convolution mode, which is used by many researches.
accuracy. Large number of layers, millions of filter weights,
The experimental results demonstrate our method can work in
different sizes of filters, and more complex network structures
low energy consume in small footprint. Basically, our method
are adopted in todays deep CNNs. For example, the googlenet
can be combined with previous approaches to achieve higher
[6] model, one of champions in Image Net Large Scale
performance. Ground on these, our data-centric computation
Vision Recognition Challenge (ILSVRC) 2014, proposes the
method can use data sufficiently, decrease data access effi-
conception of inception, which uses different sizes of filters
ciently and successfully decline energy consumption.
in parallel.
While achieving these advanced techniques, CNN-based
methods demand substantial computation and memory re- The paper is organized as follows. Related works are intro-
sources. Generally, large and expensive servers are required. duced in Section 2. In Section 3, we provide the background of
It becomes crucial problem in some situations requiring high CNN. In Section 4, we investigate the data-centric computation
throughput in low power, like embedded systems. Customized mode for convolution operation. The evaluation is proposed in
hardware addresses these problems. Many researchers [7]– Section 5.We finally conclude this paper in Section 6.
134
RXW RXW RXW Fig. 3 displays a sample of using 3 × 3 filter to convolve an
[ [ [ [ [ [ «« ILOWHU
input feature map. Each 3 × 3 box stands for one convolution,
producing one value in output. It’s obvious that for datum X22 ,
RXW [ [ [ [ [ [ «« . . .
it is used in 9 convolution operations. In previous methods,
RXW
RXW [ [ [ [ [ [ «« ۪ . . .
convolution is nearly treated as an atomic operator, calculated
[ [ [ [ [ [ «« ൈ . . . as (3).
RXW [ [ [ [ [ [ «« J+K−1
I+K−1
outIJ = Xij × Kij (3)
««
IV. DATA - CENTRIC CONVOLUTION outIJ = rowi , where rowi = Xij × Kij (4)
i=I j=J
A. Analysis of data reuse
As filter window slides, it is apparent that Xij is used in all
According to (1) and (2), we can easily finger out two K fragments. In previous convolution computation method, it
different grains in data reuse: is natural to compute all fragments of one output pixel, sum
• Inter-output level: For CONV layers and FC layers, them together, and then turn to next output pixel. However, as
multiplying the same input data with different weights we analyse in Fig. 3, this method leads to K 2 times loading of
creates different outputs. Therefore, the input data is one input pixel. The key point of data-centric convolution is
reused among all outputs. calculating all fragments involving Xij as long as Xij loaded.
• Operator level: For CONV layers, there is a lot of data This means once a time we calculate fragments belong to
reused in one convolution operation caused by the sliding different output pixels. Then the fragment is sent to delay
window pattern. As (1) shows, one input pixel can be flip-flop, waiting for other fragments calculated in subsequent
reused for K 2 times (where K × K represents the size cycles. When the calculation of next fragment belonging to the
of one convolution filter), excluding the inter-output data same output pixel is completed, we add the result to the above
reuse. In shared weights convolution, filter is reused for fragment saved in delay flip-flop directly. After K cycles initial
the whole computation of one input channel. For FC time, all fragments for one output pixel are completed, and
layers, multi-layer perceptron operations have no such a they are already added together. The entire process is executed
reuse pattern. in pipeline, and every one cycle an output pixel is streamed
Considering data reuse in inter-output level, some previous out.
works [7] [8] exchange the order of different processing di- For instance, we use a 3 × 3 filter convolve a 7 × 7
mensions to utilize this character. What’s more, tiling strategy input block. Fig. 4 demonstrates the dataflow of data-centric
also benefits to performance in both CONV and FC layers convolution. Due to the symmetry of row and column, we
because of the limitation of resources. These approaches have take column-major computation as an example. To simplify
shown great power in reality. Our work mainly focuses on how situation, we roughly set the total time of finishing one
to utilize operator level data reuse pattern efficiently. multiplication and one 4-inputs addition as per unit time,
135
WLPH [ [ [ [ [ [ [
ηϬ ηϭ ηϮ ηϯ ηϰ ηϱ ηϲ ηϳ ηϴ ηϵ
[ [ [ [
'DWDORDGGLUHFWLRQ
,QSXW'DWD [ [ [
3( 3( 3( [ [ [ [ [ [ [
URZ ; ; ; ; ; ; ; ; ;
. . . ; [ [ [ [ [ [ [
URZ
3( 3( 3(
. . .
; ; ; ; ; ; ; ; ; ; ͙͙ [ [ [ [ [ [ [
[ [ [ [ [ [ [
3( 3( 3( ; ; ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
URZ . . .
Fig. 4. A case of data-centric convolution. The input block size is 7 × 7, filter size is 3 × 3
namely, one cycle, regardless the concrete implementation. data in Ref1 is sent to each PE row. The dataflow is totally the
Initially, filter weights are stored respectively in each PE. In same with what we discuss before. When convolution window
cycle #0 shown in Fig. 4, we load the first three pixels of input, slides near the bottom, PE rows need to choose from these two
and send them to every PE row. PEs in the first column all reference groups. Fig. 5 elaborates the details. Continuing our
receive X00 ; P Ex1 receive X01 ; P Ex2 receive X02 . Evidently, previous case, while calculating O40 , the last output pixel of
the sum of row 0 is the first fragment of output O00 , expressed the first column, we skip the useless computation by sending
as O00−1 . However, the sums of row 1 and row 2 are useless operands of next output pixel O01 as Ref1, i.e. data X01 to
because of the initialization of pipeline. In cycle #2, input data X03 . The original operands, X50 to X52 , are treated as Ref2.
X20 to X22 are loaded. Row 0 calculates the fragment O20−1 . As shown in Fig. 5, row 0 chooses Ref1 while row 1 and row 2
Row 1 gets the result of fragment O10−2 . The sum of row 2 choose Ref2. Instead of convolving X50 to X52 and producting
O00−3 is the last fragment of output O00 , so the computation useless result, row 0 calculates the first fragment of O01 in
of O00 is completed. The rest can be done in the same way. cycle #5. In cycle #6, row 1 skips useless computation as well,
processing the data in next column. Adopting this mechanism,
C. Reference Mechanism pipeline keeps practicing. Reference Mechanism eliminates the
Actually, when we meet the bottom of input block, the bubbles successfully, improving the performance of pipeline in
pipeline of data-centric convolution is interrupted. For in- low overhead.
stance, in our previous case there is no output in cycle #7
and #8. The order of sending input data to PEs is row by D. Performance evaluation
row in a width of filters constantly until the last row of this Fig. 6 demonstrates the dataflow of traditional convolution
column, and then turning back to the top of next column. In computation. It is apparent that the throughput of these two
cycle #5 in Fig. 4, the result of row 0 is useless because of methods are similar, i.e. producing one output pixel per cycle.
the boundary. The next cycle is in the same circumstance. This The initial time of data-centric convolution can be ignored
leads to two bubbles in pipeline, emerging in cycle #7 and #8. when input is large. Meanwhile, the advantages of data-centric
Therefore, we introduce reference mechanism to address this mode are obvious.
problem. Firstly, less data transfer per cycle. During computation, we
We send two different groups of input pixels, Ref1 and need to transfer data from on-chip buffer to PEs constantly
Ref2. PEs choose one group between them as operands. In the to ensure the high throughput. It’s easy to figure out that
ordinary situation, Ref1 is valid. In other words, the group of for K × K filter, data-centric convolution requires K data
136
&\FOH WLPH
ηϬ ηϭ ηϮ ηϯ ηϰ ηϱ ηϲ ηϳ
,QSXW'DWD
URZ 3( 3( 3( ,QSXW'DWD [ [ [ [ [ [ [
. . . ; ; ; ; ; ; ; ;
3( 3( 3( [ [ [ [
'DWDORDGGLUHFWLRQ
; ; ; ; ; ; ;
[ [ [
;
. . .
; ; ; 5HI ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
URZ 3( 3( 3( ; [ [ [
. . . ;;; 5HI 3( 3( 3( ;
;
;
;
;
;
;
; ;
;
;
;
;
;
; ͙͙ [ [ [ [
2 . . .
; ; ; ; ; ; ; ; [ [ [ [ [ [ [
3( 3( 3( [ [ 3( 3( 3( ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
[ [ [ [ [ ; ; ; ; ; ; ; ;
. . . . . . ; [ [ [ [ [ [ [
URZ [ [ [ [ [ [ [ ; ; ; ; ; ; ;
2 2
137
,QSXWGDWD
:HLJKWV
͘ TABLE I
3( 3( 3( OVERHEAD OF ADDERS @700MH Z
,QSXWGDWD
:HLJKWV
$GGHU7UHH
3( 3( 3(
%ORFN
LQSXWV
Area(μm2 )
RXWSXW $GGHU
͘ 2002.8 4146.8 1548.8 2960.4
3( 3( 3( '
3( 3( 3(
LQSXWV
$GGHU
͘ TABLE II
OVERHEAD OF PE ARRAYS @700MH Z
%ORFNRXWSXW
138
[4] A.-M. Zou, K. D. Kumar, Z.-G. Hou, and X. Liu, “Finite-time atti- [24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
tude tracking control for spacecraft using terminal sliding mode and large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
chebyshev neural network,” IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), vol. 41, no. 4, pp. 950–963, 2011.
[5] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn,
and D. Yu, “Convolutional neural networks for speech recognition,”
IEEE/ACM Transactions on audio, speech, and language processing,
vol. 22, no. 10, pp. 1533–1545, 2014.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM,
2014, pp. 269–284.
[8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. ACM,
2016, pp. 26–35.
[9] J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim, “14.6 a
1.42 tops/w deep convolutional neural network recognition processor for
intelligent ioe systems,” in 2016 IEEE International Solid-State Circuits
Conference (ISSCC). IEEE, 2016, pp. 264–265.
[10] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional neural
networks,” in 2016 IEEE International Solid-State Circuits Conference
(ISSCC). IEEE, 2016, pp. 262–263.
[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “Eie: efficient inference engine on compressed deep neural
network,” arXiv preprint arXiv:1602.01528, 2016.
[12] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[13] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,”
in Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
[14] P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and
Y. Xie, “Prime: A novel processing-in-memory architecture for neural
network computation in reram-based main memory,” in Proceedings of
ISCA, vol. 43, 2016.
[15] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
in Proc. ISCA, 2016.
[16] D. Pandiyan, “Data movement energy characterization of emerging
smartphone workloads for mobile platforms,” Ph.D. dissertation, ARI-
ZONA STATE UNIVERSITY, 2014.
[17] E. Säckinger, B. E. Boser, J. M. Bromley, Y. LeCun, and L. D. Jackel,
“Application of the anna neural network chip to high-speed character
recognition,” IEEE Transactions on Neural Networks, vol. 3, no. 3, pp.
498–505, 1992.
[18] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for
vision,” in Cvpr 2011 Workshops. IEEE, 2011, pp. 109–116.
[19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: shifting vision processing closer to the
sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.
ACM, 2015, pp. 92–104.
[20] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynam-
ically configurable coprocessor for convolutional neural networks,” in
ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,
2010, pp. 247–257.
[21] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neu-
rocube: A programmable digital neuromorphic architecture with high-
density 3d memory,” in Computer Architecture (ISCA), 2016 ACM/IEEE
43rd Annual International Symposium on. IEEE, 2016, pp. 380–392.
[22] Q. V. Le, “Building high-level features using large scale unsupervised
learning,” in 2013 IEEE international conference on acoustics, speech
and signal processing. IEEE, 2013, pp. 8595–8598.
[23] B. Catanzaro, “Deep learning with cots hpc systems,” 2013.
139