Sei sulla pagina 1di 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320968058

Systolic-based 2D convolver for CNN in FPGA

Conference Paper · October 2017


DOI: 10.1109/ICETA.2017.8102485

CITATIONS READS

2 130

4 authors:

Jakub Hrabovsky Pavel Segeč


University of Žilina University of Žilina
13 PUBLICATIONS   12 CITATIONS    49 PUBLICATIONS   94 CITATIONS   

SEE PROFILE SEE PROFILE

Marek Moravcik Jozef Papan


University of Žilina University of Žilina
19 PUBLICATIONS   21 CITATIONS    25 PUBLICATIONS   59 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Network security View project

M-REP IPFRR View project

All content following this page was uploaded by Jozef Papan on 22 December 2017.

The user has requested enhancement of the downloaded file.


Systolic-based 2D convolver for CNN in FPGA
J. Hrabovsky, P. Segec, M. Moravcik, J. Papan
Faculty of Management Science and Informatics, University of Zilina, Univerzitna 8215/1, 010 26 Zilina, Slovakia
{jakub.hrabovsky, pavel.segec, marek.moravcik, jozef.papan}@fri.uniza.sk

Abstract— Convolution is a primary mathematical operation convolution that increases the dimension of data from a
used in many signal processing and analysis algorithms. High vector to a matrix while it preserves the basic mathematical
dependence of the complex systems on the correct operation way of output computation - the weighted sum. 2D
of the convolver demands its continual improvements mostly convolution is applied in areas that process multivariate
related to the decrease of resource consumption. The paper data, such as image processing. Because we selected the
proposes a model of 2D convolution massively used in the pipelining as a primary processing approach related to
algorithms of image processing. The paper provides a systolic arrays, we expect pixels to flow through the system
detailed description of the model structure with focus on the one-by-one in a stream. The streaming approach allows us
implementation aspect. The model is particularly applied to to avoid data preprocessing related to the data
the convolutional layer of Convolutional Neural Network, rearrangement in memory, which would increase the
currently the most known image-based deep learning latency. However, the vector-based form does not fit data
method. The key difference of the proposed model compared structure required for the 2D convolution. Thus, the system
with other common implementations lies in the placement of requires modifications in comparison with the traditional
line buffers. The correctness of the model design is validated 1D convolution design.
through the simulation discussed at the end of paper.
There are various approaches presented in [1]–[5] that
Keywords- 2d convolution, FPGA, CNN, systolic array.
deal with 2D convolution from the point of implementation
optimization regarding the consumed resources. The
I. INTRODUCTION common idea of streaming data through the system of
processing elements is same for all models described in
Convolution is a primary mathematical operation used mentioned papers, but they differ in the structure and
in many signal processing and analysis algorithms, e.g. interconnection of elementary processing units.
FIR and IIR filters. Therefore, there will be always need According to (1), which defines 2D convolution, we can
for an improvement of the convolution implementation in split the computation of the final output into several steps,
order to meet increasing requirements on the consumed where each step returns only the partial result
resources of hardware platforms - memory, processing corresponding to one line of the weights matrix - kernel
units, and time. illustrated in Fig. 1 and presented as inner sum in (1). This
An example of massive convolution application is the view enables us to decompose the operation into several 1D
Convolutional Neural Network (CNN), in particular its vector-based convolutions and apply common practical
solutions. Before we describe the structure and behavior of
convolutional layers. CNN is a type of deep neural
our proposed architecture, we state the assumptions about
network applied to image processing and thus it processes the format of inputs and outputs of the system. These
a huge amount of data. Furthermore, the method requires assumptions specify the overall design of the 2D convolver.
the algorithm implementation to run in real-time and often 𝐾−1 𝐾−1
on embedded devices that achieve very restricted
resources. Therefore, the optimal algorithm 𝑦(𝑖, 𝑗) = ∑ ∑ 𝑥(𝑖 + 𝑚, 𝑗 + 𝑙) × 𝑤(𝑚, 𝑙) (1)
implementation for platforms like FPGA is crucial for the 𝑚=0 𝑙=0
overall performance of the system. A. Input conditions of the proposed model
The paper presents a particular model of 2D convolution We assume that the input feature map is a squared
with focus on the effectiveness of computation. The image with the same width and height and this shape is a
second section introduces the mathematical expression of constant known in advance. Therefore, we describe the
2D convolution, a way of its decomposition into a group input image size only via one parameter N. The same
of general one-dimensional convolutions and our selected
approach. The third and fourth sections provide a detailed
description of the proposed model structure composed of W W X X
W X
two parts – data path and control path. The processing of [1,1] [1,2] [1,3] [i,j] [i,j+1] [i,j+2] DP1
both parts and related timing control are described in the
fifth section. The last two sections contain the discussion W W W X X X
[2,1] [2,2] [2,3] [i+1,j] [i+1,j+1] [i+1,j+2] DP2
about the correctness of the proposed model, its
contribution, limits and future improvements. W W W X X X
[3,1] [3,2] [3,3] [i+2,j] [i+2,j+1] [i+2,j+2] DP3
II. 2D CONVOLUTION AND ITS DECOMPOSITION
The key operation of the convolutional layer in the CNN
model is the 2D convolution. This operation is a variant of Figure 1. Visual form of 2D convolution
assumption is made about the shape of kernels defined via convolution meets all of these requirements and therefore
the parameter K. is an exemplar candidate.
The other set of parameters defines the dimension of
inputs, weights and outputs. Because we use fixed-point III. THE STRUCTURE OF DATA PATH
arithmetic, we set the integer and fraction width for all The data path of the proposed 2D convolver design
these operands after consideration of their expected ranges consists of two main parts depicted as abstract model in Fig.
and the whole computational model. Thus, we perform 2 and as implementation model in Fig. 3. The first part
correct arithmetic computations and subsequent numeric computes all required partial sums of products. It performs
adjustments, such as rounding and truncation. a classical digital signal processing (DSP) task based on the
The changes of these parameters lead to the multiplication and subsequent addition of signals
(Multiply-&-Accumulate - MAC). As an output we get the
modification of the original model and its components. On partial inner products (DPi) that have to be properly
the other side, these changes are straightforward due to the combined into the final result afterwards (P). Because the
chosen design of systolic arrays. Usually they require inner products given in the particular moment belong to
only the insertion of additional elements while preserving different outputs, we need to provide the synchronization
the model core. For that reason, all examples displayed in accordingly.
figures assume K=3 as the elementary case. This The second part addresses the issue of timing. It
elementary case provides the simplicity and clear combines inner products and at the same time maintains the
understandability while it preserves the main principle. synchronized state. In other words, it controls the speed of
data that flow through the system. The primary focus is on
B. Architectural approach – Systolic array timing because the related inner products have to be
Systolic arrays represent an architectural approach that combined into the same output.
enables a massive parallel computing and internal
A. Part 1 – Systolic Elements
pipelined processing. Both benefits lead to the increase of
the performance. This design form is applicable to the The first computation part contains a chain of systolic
problems that exhibit simplicity and regularity hidden in elements that are connected in the sequence. Each systolic
repeated computational patterns. The modular design element (SEi) is represented by the N-tap FIR filter that
consists of a set of basic cells - processing elements (PE) computes one line of the window - 1D convolution. That
- arranged in a systematic configuration, such as chain, fact allows us to apply all known and recommended
matrix, or tree. PE units are interconnected through a approaches to realize FIR filter in the FPGA. Considering
simple regular network of links. The network controls the the implementation of the design into FPGA circuit, we
correct flow of data through the system of PE units and so refer to the traditional and time-proven methods in [7], [8]
it ensures the synchronization. Because the modular design that describe various FPGA models of FIR filters using
provides scalability and flexibility, the model created for DSP blocks. The abstract structure is straightforward and
the solution of a particular problem is also usable for a set represents a systolic array [6]. On the other hand, the
of bigger problems. They require only minimal implementation design exhibits some distinctions in the
modifications without the need to start from scratch. The synchronization associated with the structure of DSP in
systolic array pays off when applied to the computation- current FPGA (it uses 2K-1 registers instead of the original
bound problems because its structure maximizes usage of K registers as shown for K=3 in upper side of Fig. 3
inputs required for many computational operations. compared to Fig. 2).
The principle of systolic array provides various models Each DSP block executes the multiplication of the actual
that differ in the way of arrangement and interconnection input and the corresponding weight. The product is then
of PEs. The paper [6] explains the foundations of systolic pushed to the adder and attached to the cumulative output.
array with its benefits and presents several examples of The registers are inserted between the operators to enable
models appropriate for a particular task related to FIR filter the pipelining and so they increase the maximal operating
– the convolution. This digital signal processing task is an frequency. The DSP blocks are introduced in the next
example that deals with the combination of two input data section.
flows - the inputs and coefficients. Considering the 1) The Structure and Application of DSP in SE
following implementation dependent on the structure and The DSP block provides features that lead to the optimal
possibilities of DSP blocks available in FPGA, we chose implementation of FIR. Internal multiplier and post-adder
the design W2 depicted in the mentioned paper. This model support symmetric rounding and quantization of the results
makes the assets of permanent utilization of all to address the bit growth caused by the arithmetic
computation units, pipelining and continuous output flow
X
without the requirement for the adder tree to merge the z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1
products.
However, not all systems can be effectively designed as W[1,1] W[1,2] W[1,3] W[2,1] W[2,2] W[2,3] W[3,1] W[3,2] W[3,3]
BIAS
a systolic array. The suitable system has to meet particular + + +

requirements - multiple usage of each input signal in P


various operations, concurrent computation that allows +
z-(N-3) +
z-(N-3)
parallel and pipelined processing, the simple definition of
PE, and regular shape of interconnect network. The task of
Figure 2. Abstract model of 2D convolver
PART 1 W(9)
W[i,1] W[i,2] W[i,3]
W1(3) W2(3) W3(3)
X(n) X(n-5)
X

BIAS SE1 SE2 SE3


DP1 DP2 DP3

PART 2 N
SWITCHING BLOCK DPi
ADD1 ADD2 ADD3

Z-|N-6+1| Figure 4. Exemplar model of SE block


P Z-|N-6+1|
width of the image. It means, the distance between
adjacent partial results of the same output equals N time
Figure 3. Implementation model of 2D convolver
units. After we subtract the delay caused by the SE chain,
operations and thus eliminate the overflow effect. The we get the remaining required delay of N-2K+1 time units
cascade internal signals enable fast interconnection of that delay buffers are responsible for (Fig. 3 – Part 2).
adjacent DSP blocks within the same column and do not The sequential arrangement of delay buffers in
consume any FPGA resources - adders, multipliers, combination with adders substitutes the adder tree - a
storage, and delay elements [8]. A registers placed in front common choice used in current methods. The sequential
of each operator directly in the DSP enable multilevel approach brings an advantage of simple implementation
pipelining whereas the configurable multiplexers and effective time utilization. The additions are executed
contribute to the flexibility. A temporal storage for the while we wait for the remaining inner products of the
inputs during their use in the multipliers is realized through computed output and so no additional delay is present
the line of registers - data buffer - of the desired length. compared with the adder tree based alternatives.
Using these components, we can model a cascade of DSP
blocks to compute products and gradually combine them IV. STRUCTURE OF CONTROL PATH
through an adder chain into the final sum without external The data path is controlled via the control signals that
logic. determine validity of output features considering the
Considering the requirements of very high sample rate validity of inputs. The control path is realized as a Finite
and small number of coefficients, we chose a specific type State Machine (FSM) tailored to our needs. The interface
of parallel FIR filter implementation - Systolic FIR - of the FSM consists of one input and one output signal,
because of its advantages stated in [8]. The systolic FIR is which represent validity of current input and output
features.
generally considered as the optimal model for parallel
processing on FPGA. The additional latency compared The current design of the FSM exploits the determined
with the utilization of adder tree does not have any width and height of the output feature map and the constant
length of the invalid intervals to periodically setup and
noticeable impact on the performance. The cascade model check binary counters in advance. The automaton counts
significantly improves power consumption and speed. valid inputs and generated outputs and jumps between the
Furthermore, it is limited only by the total number of DSP states according to the counter values. The states represent
blocks in one column inside of FPGA [9]. This design distinct cases of the window placement in the input feature
reflects the regularity of the arrangement and direct map.
connections between DSP, BRAM and CLB blocks. The We differentiate three cases of window transition
exemplar design of SE block is shown in Fig. 4. through the input feature map:
The documents [7]–[9] provide lists of suggestions that  the transition inside of the feature map,
can highly improve the overall performance of the final
 the transition through the vertical borders - the
design when thoroughly met. They also address the
transition to the next line within the same feature
implementation including the DSP instantiation and map,
configuration in HDL.
 the transition through the horizontal borders - the
B. Part 2 – Delay Buffers and Adders transition to the next feature map.
The 2D convolution is executed on the matrix, which The results obtained during second and third enumerated
does not correspond to the shape of incoming pixels periods (illustrated in Fig. 6-9) are invalid and have to be
(lines). After the first design part computes inter-products, ignored in the further processing. Thus, only the results
the second part needs to rearrange the results to allow their obtained by the window inside of input feature map are
valid. The reason of invalidity is described in the next
combination in the desired way according to the section.
mathematical formula of 2D convolution (1). The delay
buffers (in Fig. 3 the blocks between adders) are used to
regulate data flow so that the right sets of inner products
are combined together and the synchronized state is kept.
We see in the visual representation of 2D convolution
applied on the input image that the gap between partial
results corresponding to the same window is equal the
pixel_counter_treshold <= STARTUP_DELAY;
pixel_counter_set <= ‘1’;
DEFAULT VALUES OF EACH STATE:
 ready <= ‘0’;
 valid_out <= ‘0’; DEFAULT STATE: INIT
 all signals for TIMERS (PIXEL_COUNTER INIT  this state is set on
and LINE_COUNTER) are set to 0. active RESET (RST=’1’).

pixel_counter_alert=’1’;
pixel_counter_treshold <=
NO_VALID_PIXELS_PER_LINE;
VERTICAL pixel_counter_set <= ‘1’;
pixel_counter_alert=’0’; line_counter_ce <= ‘1’;
_BORDER

START_UP
pixel_counter_alert=’0’;

pixel_counter_alert=’1’ & line_counter_alert=’0’;


pixel_counter_threshold <= INSIDE_
NO_INVALID_PIXELS_PER_LINE; IMAGE pixel_counter_alert=’1’;
pixel_counter_set <= ‘1’; pixel_counter_treshold <=
NO_VALID_PIXELS_PER_LINE;
valid_out <= ‘1’; pixel_counter_set <= ‘1’;
line_counter_treshold <=
line_counter_alert=’1’; NO_VALID_LINES_PER_IMAGE;
pixel_counter_threshold <= line_counter_set <= ‘1’;
NO_INVALID_PIXELS_PER_TRANSITION;
pixel_counter_set <= ‘1’; pixel_counter_alert=’0’ & line_counter_alert=’0’;
line_counter_clear <= ‘1’; HORIZONTAL
_BORDER pixel_counter_alert=’1’;
pixel_counter_threshold <=
NO_VALID_PIXELS_PER_LINE;
pixel_counter_set <= ‘1’;
pixel_counter_alert=’0’;

Figure 5. State diagram of FSM representing the control path

The FSM is visually displayed in the Fig. 5. It consists of all windows in the input map. Therefore, we require a
of five states that directly map beside the mentioned window to be finally covered by each block in order to
window situations also the initialization and start-up compute all its lines - partial inner products (inner sum in
periods. The controller implements two counters with equation (1)). We can imagine the task of pattern blocks as
dynamic upper limits - PIXEL_COUNTER and a coloring of the assigned line of windows. The processing
LINE_COUNTER in the VHDL code. Each counter is
of the window is complete when all its lines are colored by
responsible for the particular transition between states. The
current form of the FSM operates only when the input is the corresponding pattern blocks. Because the visual shape
valid, otherwise it waits in the last state without change of of pattern specifies different positions of blocks, inner
any attribute, such as the current value of counters. This products returned from SE blocks in the particular time
chosen behavior simplifies its design and in the same way point correspond to different windows in the original
influences the operation of the data path. In other words, feature map. Thus, they cannot be directly combined to the
data path waits without action during the invalid inputs. final output, but they have to be reorganized at first. The
rearrangement of these products introduces the
V. DESCRIPTION OF THE PROPOSED SYSTEM AND ITS synchronization performed by appropriately placed delay
TIMING CONTROL buffers. The adders finish the computation and merge all
A. Data processing by the SE chain related inner products to the common output.
To interpret data processing by the SE chain, we use a Regarding the form of the pattern (Fig. 6), blocks visit
simple visual model. Computation of 2D convolution a window sequentially in order from the left block first.
consists in scanning the input feature map by technique of The gap between two adjacent blocks is exactly N-K time
moving window that is gradually used for computation of units based on the line-by-line movement of the pattern.
individual output features. Considering this process, the Therefore, inner products of two adjacent blocks taken in
SE cluster can be visually expressed as a pattern built of K time points t and t+N-K belong to the same result. By
blocks shown in Fig. 6. Pattern blocks in the figure are applying this principle to the chain of K blocks, the inner
divided into K lines and each contains exactly one products taken from blocks in t, t+N-K, ..., t+(N-K)×(K-1)
highlighted line. The form of pattern depends on the constitute one output feature. This time shift is a reason for
assignment of weights from the original kernel to the deployment of delay buffers to adequately adjust time
individual pattern blocks. Every pattern block represents points of all inner products. Thus, the length of each delay
one SE that computes its highlighted line. In other words, buffer regarding the abstract model is equal N-K, as shown
the particular pattern block colors always the specific line in the Fig. 2.
A) B) C) D)

K-1

Figure 8. Transition through horizontal border


caused by the additional registers in the top path (Fig. 4).
Figure 6. SE blocks pattern Therefore, the top flow is slowed down to the half speed
of the bottom flow that represents the partial sum of the
The pattern sequentially moves through the input products. Because the current input and its product are
feature map as a consequence of the inputs continuously actually the last element of the corresponding output
streaming into the system data buffer. The regular (currently on the left side of the cascade), this partial sum
movement of the pattern line-by-line in the predefined has to reach products of several previous inputs (the
direction (depends on the direction of data streaming) particular number equals the number of coefficients). The
influences the validity of outputs based on various increased speed of the sum flow compared with the input
situations that are described in the section related to the flow provides always the correct results on the right side
design of the control path. of model but with a latency (the streaming of final sum to
During the pattern transition through the vertical the output takes K cycles measured from the arrival of its
border, the SE chain processes input features that form an last input).
incoherent and thus invalid window (Fig. 7 - (b) and (c)).
Similar situation occurs during the pattern transition C. Relationship between abstract and implementation
through the horizontal border to the next image. In that model
case, processed windows include data from both adjacent The model of the SE in FPGA implemented through
images and therefore are incorrect (Fig. 8 - (b) and (c)). As the systolic array requires additional registers in the input
a consequence, all results obtained through the data path that were not included in the abstract model. As
combination of invalid inner products are also the Fig. 4 shows, the total delay (the number of registers)
considered as invalid. in data path of the SE chain is equal 2K-1 instead of K.
Due to the specific values of parameters N and K, we Therefore, the overall length of delay buffers in the second
can derive valid and invalid areas of outputs. The part has to be recalculated to meet the original timing
numerical expressions are used in the control path to requirements set in the abstract model - we subtract the
clearly identify validity of outputs in advance. The Fig. 9 additional delay of data path from the original length of
depicts areas of valid and invalid outputs including their delay buffers, so each buffer delays inner products by N-
size based on N and K parameters. 2K+1 time units.
B. Data processing by the SE D. The issue and solution of negative buffer length
The principle of the introduced SE block design as a When the buffer delay length is non-negative, i.e.
systolic FIR, which performs all numerical computations, N>=2K-1, the original design exhibits the desired
lies in a different speed of the top and bottom data flows behavior. But in the case of a small input feature map, i.e.
when the inequality N<2K-1 is valid, the delay buffer
should have a negative length considering the operation
A) requirements. The buffer of negative length cannot be
practically realized and so constitutes a complication. We
INVALID – VERTICAL BORDER

B)
N-K+1

VALID – INSIDE IMAGE

C)
K-1

INVALID – HORIZONTAL BORDER

D)
K-1 N-K+1
Figure 7. Transition through vertical border Figure 9. Valid and invalid regions
need to change the perspective to overcome the stated This signal represents the activity of the
problem. If we add a delay buffer of the positive length to LINE_COUNTER, that counts the processed lines of the
the signal path, actually we slow the speed of the signal input image. It indicates the transition to other state of
that spreads inside the system. That means, the other control path in the same way as the previous alert signal.
signals will be faster than the delayed one. On the contrary, The progress of signal valid_out corresponds to the areas
the delay buffer of negative length should have an inverse in Fig. 9.
effect, hence it should speed up the corresponding signal The simulation run of the system captured in the
in relation to the other signals of the system. We can diagram matches the desired behavior of both parts of the
accomplish the same behavior corresponding to the proposed model. Simulation outputs were also compared
negative-long buffer through slowing down all signals but with results of the experiments performed on the GPU
the particular one. The synchronization state of the system under same conditions - the kernel and input images. The
stays untouched. Switching block inserted between the results are almost same considering the sufficiency of the
first and second part (Fig. 2) provides the desired accuracy and so they serve as a proof of the desired
operation. behavior of the proposed model.
The particular structure of the second part enables us to
B. Contribution of the proposed model
exploit its symmetry (all delay buffers have same length
and are connected in a raw) to design the switching block. The proposed model introduces a novel structure
In the case of negative delay, we reverse the connections considering the delay buffers position. The other common
of the inner products coming out of the SE blocks to the models insert the delay buffers before the computation
individual adders, i.e. the last inner product will be blocks to arrange the input pixels into matrix format at the
connected to the first adder; the first inner product will be entrance to the multipliers. That approach requires the
connected to the last adder. In practice, we reach it via a adder tree placed after the sets of multipliers to merge the
set of multiplexers (shown in Fig. 10) that are inner products, which increases the latency. Our approach
appropriately controlled via the selector - a signal defined inserts delay buffers after the computation blocks. The
by the sign of the statement N-2K+1. buffers are interconnected with the adders in the scheme
that does not require the adder tree at all. By applying this
DP3 construction, we save time of signals spreading through the
ADD3 adder tree and so we decrease the total latency of the
DP1 SEL outputs.
DP2 ADD2
VII. CONCLUSION AND FEATURE WORK
DP1
ADD1 The paper emphasized the important role of the
DP3 SEL convolution in various areas and the need for its continual
N improvement from the implementation aspect. Therefore,
Sign(N-6+1) we proposed a model with increased effectivity compared
to other common models mentioned in second section. The
Figure 10. Switching block model chosen approach - systolic array – suits well the
architectural structure of FPGA. The paper presented the
VI. DISCUSSION detailed structure of proposed model together with
description of its processing. Regarding the discussion
A. The desired operation of the proposed model about functionality of the model we proved its correctness
The correctness of the abstract design is grounded in in the computation of 2D convolution.
the mentioned documents [6], [7], [9]. The right behavior However, the proposed design exhibits drawbacks that
of the proposed model including control path and data path limit format of supported input data via the static N and K
was practically validated through a behavioral simulation. parameters in the whole system. The potential
The simulator integrated into Vivado [10] was used as a improvements of the model consist in providing the
simulation tool. We carried out the controlled simulation additional flexibility through the dynamic setup of N and
with stimulus and evaluated the results. K at run-time. If we assume the variability of these
Fig. 11 shows the waveform diagram of the proposed parameters with the purpose to enable reuse of 2D
model simulation. Because of the length, the waveform is convolver blocks for inputs and kernels of distinct size, we
split into two diagrams belonging to the same simulation need to adjust the model architecture. All components, that
run. The important signals of data path and control path are depend on these parameters, must be modified to preserve
displayed with their values in different time points. The their correct dynamic behavior. These components are SE
first diagram displays the transition of window through the chain and delay buffers.
vertical border with emphasis on the pixel_counter_alert In case of the SE chain, we could use the neutral element
signal. This signal represents the activity of the of summation - zero - and replace all extra inner products
PIXEL_COUNTER, that counts the processed input considering the current value of K. This functionality could
pixels. Its active state (value 1) indicates coming transition be implemented as a part of the switching block.
to other state of control path. In similar way, the second The dynamic length of delay buffer could be realized by
diagram displays the transition of window through the application of the BRAM, as described in [11]. In that
horizontal border with emphasis on the line_counter_alert. implementation of FIFO as a delay buffer, the variable
Figure 11. Waveform diagram of simulation
length represents an offset between address pointers to the
current write and read memory positions. If we change the [5] J. Qiu et al., “Going Deeper with Embedded
addresses and their offsets appropriately, the buffer length FPGA Platform for Convolutional Neural
will adapt accordingly. Network,” in Proceedings of the 2016
ACM/SIGDA International Symposium on Field-
VIII. ACKNOWLEDGMENT Programmable Gate Arrays - FPGA ’16, 2016,
This paper is supported by Faculty of management pp. 26–35.
science and informatics of University of Zilina, funded by [6] Kung and H. T., “Why systolic architectures?,”
research grant number FVG/27/2017. Computer (Long. Beach. Calif)., vol. 15, no. 1,
pp. 37–46, Jan. 1982.
REFERENCES [7] Xilinx, DSP: Designing for Optimal Results, 1.0.
[1] J.-J. Lee and G.-Y. Song, “Super-Systolic Array 2005.
for 2D Convolution,” in TENCON 2006 - 2006 [8] Xilinx, “XtremeDSP for Virtex-4 FPGAs User
IEEE Region 10 Conference, 2006, pp. 1–4. Guide.” 2008.
[2] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, [9] Xilinx, “7 Series DSP48E1 Slice User Guide.”
“CNP: An FPGA-based processor for 2016.
Convolutional Networks,” FPL 09 19th Int. Conf. [10] “Xilinx Vivado.” [Online]. Available:
F. Program. Log. Appl., vol. 1, no. 1, pp. 32–37, https://www.xilinx.com/products/design-
2009. tools/vivado.html.
[3] W. Qadeer et al., “Convolution engine,” in [11] Bailey; Donald G., Design for Embedded Image
Proceedings of the 40th Annual International Processing on FPGAs, First edit. Wiley-IEEE
Symposium on Computer Architecture - Press, 2011.
ISCA ’13, 2013, vol. 41, no. 3, pp. 24–35.
[4] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J.
Cong, “Optimizing FPGA-based Accelerator
Design for Deep Convolutional Neural
Networks,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays - FPGA ’15, 2015,
pp. 161–170.

View publication stats

Potrebbero piacerti anche