Sei sulla pagina 1di 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322942712

VHDL generator for a high performance convolutional neural network FPGA-


based accelerator

Conference Paper · December 2017


DOI: 10.1109/RECONFIG.2017.8279827

CITATION READS
1 39

2 authors, including:

Muhammad Hamdan
Iowa State University
2 PUBLICATIONS   1 CITATION   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Thesis View project

All content following this page was uploaded by Muhammad Hamdan on 05 September 2018.

The user has requested enhancement of the downloaded file.


VHDL Generator for A High Performance
Convolutional Neural Network FPGA-Based
Accelerator
Muhammad K. Hamdan and Diane T. Rover
Electrical and Computer Engineering Department
Iowa State University of Science and Technology
Ames, IA United States
{mhamdan, drover}@iastate.edu

Abstract — Convolutional Neural Network (CNN) has been Large CNNs are computationally expensive, requiring over
proven as a highly accurate and effective algorithm that has been billion operations per image, making general purpose processors
used in a variety of applications such as handwriting digit recog- inefficient in implementing CNN models, thus platforms like
nition, visual recognition, and image classification. As a matter of GPUs, ASIC and FPGAs have attracted a lot of attention be-
fact, state-of-the-art CNNs are computationally intensive; how- cause of their high performance. FPGAs particularly seem to
ever, their parallel and modular nature make platforms like well-fit the job because they are reconfigurable, take advantage
FPGAs well suited for the acceleration process. A typical CNN of the inherent parallelism in CNNs, and power efficient. In-
takes a very long development round on FPGAs, hence in this pa-
deed, many CNN accelerators have been proposed for different
per, we propose a tool which allows developers, through a config-
purposes and with different techniques and methodologies
urable user-interface, to automatically generate VHDL code for
their desired CNN model. The generated code or architecture is
[5][6] [7][8][9]. CNNs are known for their frequent data access,
modular, massively parallel, reconfigurable, scalable, fully pipe- computation complexity, and very long development round,
lined, and adaptive to different CNN models. We demonstrate the hence an efficient implementation is required. In this paper, we
automatic VHDL generator and its adaptability by implementing propose a GUI based tool to significantly speed up the process
a small-scale CNN model “LeNet” and a large-scale one of CNN development, also we highly optimize the computation
“AlexNet”. The parameters of small scale models are automati- component and efficiently manage memory accesses.
cally hard-coded as constants (part of the programmable logic) to
The key contributions of this work are as follows:
overcome the memory bottleneck issue. On a Xilinx Virtex-7 run-
ning at 200 MHz, the system is capable of processing up to 125k A. A VHDL generator with the following features:
images/s of size 28×28 for LeNet and achieved a peak performance
of 611.52 GOP/s and 414 FPS for AlexNet. • Easy configuration, support for externally pre-configured
models, and support for model checking and validation
Keywords- VHDL generator; CNNs; AlexNet; parallelism;
• Flexibility, scalability, and adaptability with small and
reconfigurable; adaptability; pipeline; scalable; FPGA.
large-scale CNN models
I. INTRODUCTION • A Test-bench, for testing and simulation purposes
In the past years, machine learning has advanced like never
before, where many algorithms were proposed to solve prob- • Compared to the HLS-based work in [10], our generated
lems like visual recognition and image classification. Convolu- optimized implementation achieved a speed up of 6.1x
tional Neural Network , a popular type of neural networks, in- • With Standard HLS tools such as Vivado HLS, users have
spired by the visual cortex of the brain and a mathematical op- to go through the lengthy development process by pro-
eration called convolution, has gained popularity in applications graming in a high-level language. By contrast, in our tool
such as image classification [1], data analysis, visual object users only have to configure the model of their choice
recognition and self-driving cars [2]. The interest in CNNs is without doing any programming.
driven by the high performance and accuracy they have shown.
For example, the AlexNet model won ImageNet Large-Scale B. Scalable, reconfigurable, fully-pipelined, and massively par-
Vision Recognition Challenge (ILSVRC) 2012 achieving a top- allel accelerator
5 accuracy of 84.7%. The popularity of CNNs would not have C. Tested the VHDL generator on two benchmarked models
been possible if it was not for the continually developed models (LeNet and AlexNet) and other hand-tuned models. The
such as LeNet [3], AlexNet [4], VGG, GoogleNet, and ResNet system can process up to 125K Images/s for LeNet and
as well as the availability of powerful computing platforms. achieved peak performance of 611.52 GOP/s for AlexNet
D. An executable of the VHDL generator will be available at:
https://github.com/mhamdan91/cnn_vhdl_generator
978-1-5386-3797-5/17/$31.00 ©2017 IEEE

1
The rest of this paper is organized as follows. Section II re-
views convolutional neural networks briefly. Section III de-
scribes the VHDL generator and its architecture. Section IV de-
scribes hardware architecture. In Section V, related work is pre-
sented. Section VI describes our implementation details. Section
VII describes future work and conclusion.

Depth
Fig. 2 AlexNet architecture : ImageNet 2012 winning CNN model. Redrawn
Hight [17]

Width

Fig. 1 A visualization of a CNN layer that arranges its neurons in three


dimensions (width, height, depth). The 3D input volume is transformed into a
3D output volume of neuron activations in every layer. Redrawn [19]

II. BACKGROUND
A Convolutional Neural Network consists of various layers
such as convolutional and fully-connected layers, where most of Fig. 3 Right: A mathematical representation of the convolution operation
followed by a nonlinearity function. Left: Input value of size 7×7×1 with
the operations are performed; and pooling layers, which are used padding of 1, a stride of 2, and receptive field of 3×3 is convolved with a filter
to avoid overfitting; and a classification layer, to classify final (In Red) of size 3×3×1 and the summed weighted inputs in addition to the bias
results into classes. A typical layer consists of 3D volumes of are stored in the 3x3x1 output neurons (In Green). Redrawn [19]
neurons as shown in Figure 1 (width, height, and depth and the
word depth refer to what is called “Feature-maps or activation- C. Pooling layer
maps” not the number of layers in the CNN).
Spatial pooling is a form of nonlinear subsampling that is
CNNs typically start with a convolutional layer, where it utilized to reduce the feature dimensions as we go deeper in the
takes the input image and decomposes it into different feature network. Max and average pooling are the most common meth-
maps such as edges, lines, curves, etc. Multiple processes are ods to perform pooling. In max pooling as adopted in AlexNet,
applied to the extracted feature maps throughout the entire net- a set of neurons are subsampled based on the size of a pooling
work. Extracted feature maps from the last layer (typically, a filter, whereas the maximum neuron value in that filter is passed
fully connected layer) are classified into output classes using a to the corresponding neuron in the next layer and the rest of neu-
classifier like SoftMax classifier. For example, the architecture rons are dropped out as shown in Equation 2 (𝐹𝑖𝑙𝑡𝑒𝑟𝑠𝑖𝑧𝑒 2 × 2).
of AlexNet [4], shown in Figure 2, classifies 224×224 colored In average pooling, the forwarded value to the corresponding
images to a 1000 different output classes. neuron in the next layer is the average of all neurons in a filter
as shown in Equation 3.
A. Convolutional Layer
The convolutional layer essentially performs a mathematical 𝑃𝑎𝑠𝑠𝑒𝑑𝑛𝑒𝑢𝑟𝑜𝑛 → max(2𝑥, 𝑥, 0.5𝑥, 3𝑥) = 3𝑥 (2)
operation called convolution that involves 3-dimensional multi-
ply-accumulate (MACC) operations. Shown in Figure 3, a ker- 𝑃𝑎𝑠𝑠𝑒𝑑𝑛𝑒𝑢𝑟𝑜𝑛 → avg(𝑥, 2𝑥, 3𝑥, 4𝑥, 5𝑥) = 3𝑥 (3)
nel of weights that is multiplied by a set of inputs (receptive-
region), and the weighted inputs are summed together. A bias D. Fully-Connected layer
whose value usually 1 is added to the summed weighted inputs The fully connected layer (FC) usually comes before the
to ensure that neurons fire. An activation function is applied to classification layer and it comprises the highest number of pa-
the accumulated sum to limit the output to a reasonable range. rameters because every neuron in this layer is connected to all
Results from the activation function are traversed to correspond- neurons in the previous layer, and parameters are translated on
ing neurons in the next layer. The computation of a feature- the connections between those neurons. Inputs in this layer are
map’s output size is shown in Equation 1. multiplied with corresponding weights, biases added respec-
(𝐼𝑛𝑝𝑢𝑡𝑤𝑖𝑑𝑡ℎ−𝐹𝑖𝑙𝑡𝑒𝑟𝑠𝑖𝑧𝑒 +2× 𝑃𝑎𝑑𝑑𝑖𝑛𝑔) tively, then nonlinearity is applied as shown in Equation 4.
𝑂𝑢𝑡𝑝𝑢𝑡𝑠𝑖𝑧𝑒 = +1 (1)
𝑆𝑡𝑟𝑖𝑑𝑒 𝐾
𝑖 𝑖𝑛𝑝𝑢𝑡
𝑂𝑈𝑇𝑛𝑒𝑢𝑟𝑜𝑛 = ∑𝑗=1 𝐼𝑁𝑃𝑈𝑇 𝑖 × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 + 𝐵𝑖𝑎𝑠 𝑖 (4)
B. Activation Function
The activation function is used to ensure nonlinearity in the The output of the nonlinearity in the last FC layer is passed
network as well as to get rid of unnecessary information. Among to a classifier, like SoftMax classifier, that converts output neu-
the various activation functions, Sigmoid, Tanh, and ReLU are rons to a probability in the range (0, 1) for the classification
the most commonly used functions. The Sigmoid and Tanh ac- layer. The classification layer “Final layer” compares labels of
tivation functions require longer training timing in CNNs [4], the top probabilities from SoftMax classifier with actual labels
unlike ReLU which converges faster during training. ReLU is of the available classes, thus gives the accuracy of the model.
defined as a zero-thresholding operation  ReLU = max (0, x).

2
III. VHDL GENERATION TOOL ARCHITECTURE in Table I. The syntax of configuration file is shown in Table II
The tool produces an optimized parameterized implementa- and parameters configuration is shown in Table III.
tion of a desired CNN model through a series of processes. We Table I Tool supported configurations
developed a VHDL based library to build the architecture of the Image Size User-defined
specified model through a GUI. Figure 4 shows the top-level Output Classifier SoftMax
tool flow for generating VHDL code. Filter Size User-defined
Feature maps User-defined
Start
No. of Classes User-defined
Activation Functions ReLU, Sigmoid, Tanh, Average and
Max Pool
Layer type Convolution, Pooling, FC, LRN
Import configuration Manual
from a text file configuration
via GUI Table II Example configuration syntax for Conv Pool  FC network
Model Specifications Row_count,3
Image_Size,28
Image_type,Colored,24
No_Classes,10
Classifier,SoftMax
Error message Model verification Convolution,2,2,0,2,ReLU,
Fix incorrect Fail and configuration Pass Automatic
configuration validation configuration Pooling,2,2,0,2,Max-Pool,
storage Fully-Connected,4,1,0,1,ReLU,

Want to Save Row_count represents the number of layers; Image_Size is


No Configuration to a Yes
Error Message
file? the input image dimension; Image_type specifies the type of im-
age if colored or grayscale and 24 represents the input data
Import parameters from Parameters
width, where 24 is for colored and 8 is for grayscale; NO_classes
a text file Inclusion
Store configuration represents the number of output classes and Classifier is the clas-
sifier function; Convolution,2,2,0,2,Max pool respectively rep-
Yes resents layer name, number of output feature maps, filter size,
padding, stride size, and used activation function; the same syn-
tax applies to pooling and fully connected layers.
No Model meets small Generate
Process parameters Yes No No
scale constrains? Test-bench?
Table III parameters (weights and biases) for the configuration in table II

Yes Convolution,1
Filter_1_1,0001,0010,0011,0010,1,$ Filter_fmap1_kernel1,weight,,,,bias,$
Match model
Generate VHDL code Generate Filter_1_2,0001,0010,0011,0010,1,$ We have 2 feature maps in our exam-
Yes
Configuration
Test-bench Filter_1_3,0001,0010,0011,0010,1,$ ple and since the image is colored, we
Filter_2_1,0001,0010,0011,0010,0,$ have 3 different kernels for each out-
Filter_2_2,0001,0010,0011,0010,0,$ put feature map. Bias value is the
Filter_2_3,0001,0010,0011,0010,0,$ same for a distinct feature map
Pooling,1
Store on Desk
End Fully-Connected,1
Filter_1_1,0101,1,$ No parameters
Filter_1_2,0101,1,$
Fig. 4 VHDL generation tool flow Filter_2_1,0111,0,$
Filter_2_2,0101,0,$ Weights= 2x4x1
Filter_3_1,0101,1,$ Biases = 4x2
The main building blocks of the tool are model configuration Filter_3_2,0101,1,$ Biases are optional depends
on trained model use for them
and validation, and parameters inclusion. Those blocks are illus- Filter_4_1,0111,0,$
Filter_4_2,0101,0,$
trated in details as follows.
A. Model Configuration and Validation B. Parameters Inclusion and VHDL files Generation
In this block, developers can load a pre-configured model Parameters are handled according to specified CNN model,
from a text file which abide by a particular configuration syntax where for small-scale models such as LeNet model which has
or they can choose to manually configure their model using the about 43.6K parameters, parameters are consolidated within the
GUI. Once configuration is complete, the user is prompted to generated VHDL code as part of the programmable logic (PL),
validate their configuration to ensure it meets standard CNN otherwise, parameters are stored in an external memory source.
configuration. On unsuccessful validation check (Incorrect con-
figuration), a prompted message is displayed to the user to in- Parameters must be formatted according to model configu-
form them of what changes they have to make to fix errors. On ration in order to have a successful VHDL generation. The user
a successful validation check, the user can proceed to the next should specify the layer name, list all kernels used in each fea-
stage which is parameters inclusion. The current version of the ture map along with their weights, specify biases value, and end
tool supports particular model configurations that are illustrated each line with a dollar sign as shown in Table III. The tool sup-

3
ports binary, decimal and hexadecimal representations of pa- the utilization of hardware resources and get over memory band-
rameters. The size of weights and biases are specified in the width limitations. In our highly parallelized implementation, the
GUI, so for our example the tool is expecting a weight size of 4- system is capable of processing up to 125K 28×28 Images/s,
bits and a bias size of 1-bit. If the parameters file does not cor- having the system running at 200 MHz.
respond to configuration, an error message will be displayed to
the user highlighting the error. Figure 5 illustrates the options Optimizing computation in CNNs can significantly improve
given to incorporate parameters. the overall performance of a CNN model. Many attempts have
been made to optimize computation through various parallelism
approaches. Authors in [15][16] use parallelism only in convo-
lution operations and output feature maps. This work imple-
ments three types of parallelism: parallelism in convolution op-
erations, parallelism in input feature maps, and parallelism in
output feature maps. In addition, the design in this work is im-
plemented in a pipelined style which helped increase the
throughput of the system, achieving a peak performance of
611.54 GOP/s and 414 FPS (224×224) for AlexNet.

V. HARDWARE ARCHITECTURE
Figure 6 describes the top-level architecture of the proposed
Fig. 5 Parameters inclusion and storage type selection system. The same architecture is used in small and large-scale
models except that in small scale models we do not use an ex-
IV. RELATED WORK ternal memory to store parameters.
The main drawback of accelerating CNNs on FPGA is the
long development round. A few implementations tackled this is-
sue, for example, In [11] authors proposed an FPGA framework,
based on Caffe framework, to map CNN layers to an FPGA plat-
form. The framework uses Xilinx FPGA SDAccel to map CNN
layers and generate the bit-stream file. To optimize computa-
tions, they increase the number of hardware units used to process
a task which in turns increase hardware resources linearly, mak-
ing it an inefficient optimization method.
HLS tools such as Vivado HLS [12] are a good escape from
low-level programming; however, such tools are not highly op-
timized to take full advantage of the available parallelism in
CNNs. In [10] authors use Vivado HLS 2014 to implement a 5-
layer accelerator for MNIST dataset. Their system is capable of Fig. 6 Top-Level architecture of the system
processing ~ 20.8K images/s, while our system is capable of pro-
cessing up to 125K images/s. A. Convolutional Layer Architecture

HDL generation for CNNs was previously proposed, where The process in this layer begins by streaming input data to a
in [13] authors use a high-level descriptive language to generate sliding window, where the sliding window has the size of
Verilog code for CNN models. They generate each layer inde- weights kernel, and it is used to perform the convolution opera-
pendently by specifying their parameters, then they combine all tion. The convolution operation is fully-pipelined and parallel-
of the layers to have a complete accelerator. They did not state ized, where all multiplication operations are performed at once
anywhere that they store parameters on-chip or hard code them, for a complete receptive region and for different feature maps.
meaning that they use an external memory for small-scale mod- An adder tree is used to add up results followed by bias-addition
els which is not an efficient way to handle parameters. Their ac- stage. The activation function(ReLU), a simple zero threshold-
celerator can achieve 222.1 GOP/s for AlexNet, while ours can ing operation, is directly applied to all extracted feature maps,
achieve 611.52 GOP/s for the same model. then the output from the ReLU (intermediate values) is stored in
buffers which feed the next layer. Figure 7 shows processing el-
In [14] authors avoid loading parameters from an external ement (PE) details. PEs are scalable to different filter sizes.
memory source by storing them in an on-chip memory. In their
implementation, they adopt a parallel-serial style to increase the B. Pooling Layer Architecture
throughput; however, this strategy does not take full advantage Pooling layer takes up values stored in buffers from the pre-
of the available parallelism in the CNN, further different layers vious layer and applies a sliding window that has the size of the
do not work concurrently. They implemented a small-scale neu- pooling filter, and a step size based on the specified stride value.
ral network that performs digits recognition on Xilinx This sliding window is similar to that one in the convolutional
XC7Z045. Under 172 MHz, their system is capable of pro- layer, except that the performed operation is max or average
cessing about 70K 28×28 images per second. In our implemen- pooling and no weights multiplication is performed. Details of
tation, we hard code parameters as part of the PL to maximize the pooling layer architecture are described in Figure 8.

4
A. LeNet Model
DIN
PE1 DIN
Reg Reg
LeNet model comprises three convolutional layers, two
<
pooling layers, and one fully connected layer. The number of
Reg

PE2 FIFO Reg Reg

< Reg
parameters required for the entire model is only 3.75x times the
< Reg parameters required for the first convolutional layer in AlexNet.
PEn
Stride-Enable
Nevertheless, this small model is good enough to perform digit
recognition with decent accuracy. Since the number of parame-
ters in LeNet is relatively small compared to AlexNet, we man-
Fig. 8 Max pooling operation architecture aged to have them hard-coded as part of the PL. This strategy
helped significantly improve the overall throughput of the sys-
C. Full Connected Layer Architecture tem as well as reduce the number of used DSPs. P&R synthesis
The architecture of the FC layer is similar to the convolu- report of used hardware resources is shown in Table IV.
tional layer architecture, but convolution is replaced with matrix
multiplication operation. The first FC in AlexNet requires about Table IV Resources utilization for LeNet Model
398 million multiplication operations. The input vector is of size
(1 × 9216) and the weights matrix is of size ((6×6×256) × 4096). Layer/Resources Slice Registers LUTs DSPs
To perform such a massive matrix multiplication operation, we Available/ VirtexVC709 866400 433200 3600
divide the input vector into small and equal (1 × 𝑋𝑛𝑖 ) vectors as CONV 1
𝑖𝑗 1784 1857 25
well as divide the weights matrix into similar ( 𝑋𝑛 × 1) vectors. POOL 1
The multiplication operation is performed as shown in Equation 966 1137 0
5. Results from the small vector multiplication are stored in a CONV 2 20848 21643 50
temporary output. When all multiplications for a complete input POOL 2 1121 1304 0
vector are done, final results are generated and stored in desig- FCs 209396 238541 0
nated outputs: 𝑌1 → 𝑌𝑗 .The multiplication operation is illus- Total
trated in Figure 9. 234115 264482 75
Utilization 27.02% 61.05% 2.08%
9216
𝑘= 𝑖𝑗 𝑗
∑𝑚=4096
𝑗=1 ∑𝑖=1 𝑛 (1 × 𝑋𝑛𝑖 ) ∗( 𝑋𝑛 × 1) = 𝑌𝑖 (5)
B. AlexNet Model
4096

𝑌𝑖1 Our implementation for AlexNet on Virtex-7 uses 16-bit


1 × 𝑋𝑛𝑖𝑗

Output Vector  1 x 4096 fixed point precision for weights representation. In memory
×
Input Vector  1 x 9216
9126

 management, we adopt the strategy presented in [17] to manage


𝑌𝑖𝑗
1 × 𝑋𝑖𝑛
memory requirements for the fully-connected layer, but we also
balance between the input transfer and weight transfer to allow
Fig. 9 Small-scale matrix multiplication room for increasing the input batch size, thus improving the
overall performance. Table V shows hardware resource utiliza-
VI. IMPLEMENTATION tion of the AlexNet model.
To demonstrate the VHDL generation tool functionality we
Table V Resources utilization for AlexNet model
implemented two benchmarked models, LeNet and AlexNet. In
AlexNet implementation, 16-bit Fixed point precision is used for Resources (VirtexVC709) FFs LUTs DSPs BRAMs
weights and intermediate values representation, and 8-bit fixed
Available 866400 433200 3600 2940
point precision is used in LeNet. The tool supports different pre-
cisions, from a single bit up to 32-bit. used 269845 287461 2070 2023
Utilization 31.14% 66.35% 57.5% 68.8%

Input Stream
PE1 DIN
Reg Reg Reg
W ei ghts K erne l-1

FIFO Reg Reg Reg

PE2 ADDER-TREE
W ei ghts Kerne l-2
FIFO Reg Reg Reg

W ei ghts Kerne l-2


PEn
W ei gh ts K ern e l-n
WEIGHTS MUL TIPLICTIO N

Fig. 7 Processing element details in a convolutional layer for a 3 x 3 filter.

5
Table VI Comparison with other implementations of AlexNet model [6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong,
Platform Frequency GOP/s FPS Processing/ “Optimizing FPGA-based Accelerator Design for Deep
(MHz) Image (ms) Convolutional Neural Networks,” Proc. 2015
Intel® Core™ i7- 2600 - 2.4 417 ACM/SIGDA Int. Symp. Field-Programmable Gate
6700HQ
[8] Arrays - FPGA ’15, pp. 161–170, 2015.
Altera Stratix-V 120 136.5 50 20.1
[7] S. Chakradhar, M. Sankaradas, V. Jakkula, and S.
[17] Virtex7- VX690T 156 565.9 391 2.56 Cadambi, “A dynamically configurable coprocessor for
[18] Stratix-V GXA7 100 114.5 - >12.5 convolutional neural networks,” ACM SIGARCH
[6] Virtex7-VX485T 100 61.62 47 21.61 Comput. Archit. News, vol. 38, no. 3, p. 247, 2010.
This Virtex7- VX690T 200 611.5 414 2.41 [8] “2016- Throughput-Optimized OpenCL-based FPGA
work Accelerator for Large-Scale.” .
[9] C. Poulet, J. Y. Han, and Y. Lecun, “CNP : AN FPGA-
Table VII Comparison with other automatic HDL generation implementations BASED PROCESSOR FOR CONVOLUTIONAL
Platform Frequency GOP/s CNN NETWORKS Cl ´,” vol. 1, no. 1.
(MHz) GMACs model [10] Y. Zhou and J. Jiang, “An FPGA-based accelerator
[11] Virtex7- VX690T 200 45.8 GOP/s AlexNet implementation for deep convolutional neural
[10] Virtex 7-VX485T 150 16.42 GMAC/s LeNet networks,” Proc. 2015 4th Int. Conf. Comput. Sci.
[13] Virtex7- VX690T 100 222.1 GOP/s AlexNet Netw. Technol. ICCSNT 2015, no. Iccsnt, pp. 829–832,
This 2016.
Virtex7- VX690T 200 611.5 GOP/s AlexNet
work (VC709)
[11] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G.
Taylor, and S. Areibi, “Caffeinated FPGAs: FPGA
Framework For Convolutional Neural Networks,”
VII. CONCLUSION AND FUTURE WORK
arXiv, 2016.
In this work, we proposed a VHDL generation tool that is [12] “Vivado High-Level Synthesis.” [Online]. Available:
optimized to generate a modular, scalable, reconfigurable, and https://www.xilinx.com/products/design-
highly parallel implementation for CNN models. We demon- tools/vivado/integration/esl-design.html. [Accessed:
strated our VHDL generator by implementing a small-scale (Le- 07-Aug-2017].
Net) and a large-scale (AlexNet) CNN models on Virtex-7 run-
[13] Z. Liu, Y. Dou, J. Jiang, and J. Xu, “Automatic Code
ning at 200 MHz. Our system is capable of processing up to
Generation of Convolutional Neural Networks in
125K images for the small-scale model and achieved 414 FPS
and 611.52 GOP/s for the large scale one. We aim to extend this FPGA Implementation,” in International Conference
work by; first, incorporate a design space exploration methodol- on Field-Programmable Technology (FPT), 2016, pp.
ogy for choosing the adequate FPGA platform along with the 61–68.
desired CNN model; second, give developers the ability to [14] J. Park and W. Sung, “Fpga Based Implementation of
choose desired parallelism methodology to meet their own hard- Deep Neural Networks Using on-Chip Memory Only,”
ware resources constraints; third, support all CNNs styles be- Icassp 2016, pp. 1011–1015, 2016.
sides the ConvPoolFC style, and support other neural net- [15] M. Sankaradas et al., “A Massively Parallel
works algorithms such as recurrent neural networks (RNNs). Coprocessor for Convolutional Neural Networks,”
Lastly, visualize the configured model through the GUI. Icasap, pp. 53–60, 2009.
[16] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar,
REFERENCES and H. P. Graf, “A programmable parallel accelerator
[1] J. D. J. Deng, W. D. W. Dong, R. Socher, and L. F.-F. for learning and classification,” Proc. 19th Int. Conf.
L. Fei-Fei, “ImageNet: A large-scale hierarchical Parallel Archit. Compil. Tech. - PACT ’10, p. 273,
image database,” 2009 IEEE Conf. Comput. Vis. 2010.
Pattern Recognit., pp. 2–9, 2009. [17] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong
[2] J. L. F. Pereira and R. J. F. Rossetti, “An integrated Zhou, and Lingli Wang, “A high performance FPGA-
architecture for autonomous vehicles simulation,” based accelerator for large-scale convolutional neural
Proc. 27th Annu. ACM Symp. Appl. Comput. - SAC ’12, networks,” 2016 26th Int. Conf. F. Program. Log.
pp. 286–292, 2012. Appl., pp. 1–9, 2016.
[3] “http://yann.lecun.com/exdb/lenet/.” [18] Y. Ma, N. Suda, Y. Cao, J. S. Seo, and S. Vrudhula,
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Scalable and modularized RTL compilation of
“ImageNet Classification with Deep Convolutional Convolutional Neural Networks onto FPGA,” FPL
Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 2016 - 26th Int. Conf. Field-Programmable Log. Appl.,
1–9, 2012. 2016.
[5] J. Qiu et al., “Going Deeper with Embedded FPGA [19] “CS231n Convolutional Neural Networks for Visual
Platform for Convolutional Neural Network,” Proc. Recognition.” [Online]. Available:
2016 ACM/SIGDA Int. Symp. Field-Programmable http://cs231n.github.io/convolutional-networks/.
Gate Arrays - FPGA ’16, pp. 26–35, 2016. [Accessed: 01-Jan-2017].

View publication stats

Potrebbero piacerti anche