Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
978-1-5090-5191-5/17/$31.00@2017 IEEE
Image
Trained weights (w) extraction was designed through a hand-crafted process by
experts in the field. For instance, for object recognition in
pixels Feature Features (x) Scores
Classification
Extraction (wTx)
computer vision, it was observed that humans are sensitive to
Scores per class
(select class based edges (i.e., gradients) in an image. As a result, many well-
on max or threshold)
known computer vision algorithms use image gradient-based
Handcrafted Features Learned Features features such as Histogram of Oriented Gradients (HOG) [13]
(e.g. HOG) (e.g. DNN)
and Scale Invariant Feature Transform (SIFT) [14]. The chal-
lenge in designing these features is to make them robust to
variations in illumination and noise.
B. Classification
Fig. 2. Inference pipeline.
The output of feature extraction is represented by a vector,
which is mapped to a score using a classifier. Depending on
C. Medical the application, the score is either compared to a threshold to
There is a strong clinical need to be able to monitor patients determine if an object is present, or compared to the other
and to collect long-term data to help either detect/diagnose scores to determine the object class.
various diseases or monitor treatment. For instance, constant Techniques often used for classification include linear meth-
monitoring of ECG or EEG signals would be helpful in ods such as support vector machine (SVM) [15] and Softmax,
identifying cardiovascular diseases or detecting the onset of a and non-linear methods such as kernel-SVM [15] and Ad-
seizure for epilepsy patients, respectively. In many cases, the aboost [16]. In many of these classifiers, the computation of
devices are either wearable or implantable, and thus the energy P a dot product of the features (~x) and the
the score is effectively
consumption must be kept to a minimum. Using embedded weights (w)
~ (i.e., i wi xi ). As a result, much of the hardware
machine learning to extract meaningful physiological signal research has been focused on reducing the cost of a multiply
and process it locally is explored in [10–12]. and accumulate (MAC).
III. M ACHINE L EARNING BASICS C. Deep Neural Networks (DNN)
Machine learning is a form of artificial intelligence (AI) that Rather than using hand-crafted features, the features can be
can perform a task without being specifically programmed. learned directly from the data, similar to the weights in the
Instead, it learns from previous examples of the given task classifier, such that the entire system is trained from end-to-
during a process called training. After learning, the task is end. These learned features are used in a popular form of
performed on new data through a process called inference. machine learning called deep neural networks (DNN), also
Machine learning is particularly useful for applications where known as deep learning [17]. DNN delivers higher accuracy
the data is difficult to model analytically. than hand-crafted features on a variety of tasks [18] by
Training involves learning a set of weights from a dataset. mapping inputs to a high dimensional space; however, it comes
When the data is labelled, it is referred to as supervised at the cost of high computational complexity.
learning, which is currently the most widely-used approach. There are many forms of DNN (e.g., convolutional neural
Inference involves performing a given task using the learned networks, recurrent neural networks, etc.). For computer vision
weights (e.g., classify an object in an image)1 . In many cases, applications, DNNs are composed of multiple convolutional
training is done in the cloud. Inference can also happen (CONV) layers [19] as shown in Fig. 3. With each layer,
in the cloud; however, as previously discussed, for certain a higher-level abstraction of the input data, called a feature
applications this is not desirable from the standpoint of com- map, is extracted to preserve essential yet unique information.
munication, latency and privacy. Instead it is preferred that the Modern DNNs are able to achieve superior performance by
inference occur locally on a device near the sensor. In these employing a very deep hierarchy of layers.
cases, the trained weights are downloaded from the cloud and Fig. 4 shows an example of a convolution in DNNs. The 3-
stored in the device. Thus, the device needs be programmable D inputs to each CONV layer are 2-D feature maps (W × H)
in order to support a reasonable range of tasks. with multiple channels (C). For the first layer, the input would
A typical machine learning pipeline for inference can be be the 2-D image itself with three channels, typically the red,
broken down into two steps as shown in Fig. 2: Feature green and blue color channels. Multiple 3-D filters (M filters
Extraction and Classification. Approaches such as deep neural with dimension R × S × C) are then convolved with the input
networks (DNN) blur the distinction between these steps. feature maps, and each filter generates a channel in the output
A. Feature Extraction 3-D feature map (E × F with M channels). The same set of
M filters is applied to a batch of N input feature maps. Thus
Feature extraction is used to transform the raw data into there are N input feature maps and N output feature maps.
meaningful inputs for the given task. Traditionally, feature In addition, a 1-D bias is added to the filtered result.
1 Machine learning can be used in a discriminative or generative manner. The output of the final CONV layer is processed by fully-
This paper focuses on the discriminative use. connected (FC) layers for classification purposes. In FC layers,
Modern Deep CNN: 5 – 1000 Layers 1 – 3 Layers TABLE I
S UMMARY OF POPULAR DNN S [20, 21, 24, 26, 27]. ACCURACY
MEASURED BASED ON T OP -5 ERROR ON I MAGE N ET [18].
Low-Level High-Level
CONV CONV FC
Layer
Features …
Layer
Features Layer Classes
LeNet AlexNet VGG GoogLeNet ResNet
Metrics
5 16 (v1) 50
Accuracy n/a 16.4 7.4 6.7 5.3
CONV Layers 2 5 16 21 49
convolu'on non-linearity normaliza'on pooling Weights 26k 2.3M 14.7M 6.0M 23.5M
MACs 1.9M 666M 15.3G 1.43G 3.86G
× FC Layers 2 3 3 1 1
Weights 405k 58.6M 124M 1M 2M
MACs 405k 58.6M 124M 1M 2M
Fig. 3. Deep Neural Networks are composed of several convolutional layers Total Weights 431k 61M 138M 7M 25.5M
followed by fully connected layers. Total MACs 2.3M 724M 15.5G 1.43G 3.9G
MNIST ImageNet
Input fmaps
Output fmaps
Filters C
M
C
H
R E
1 1 1
S W F
…
Fig. 5. MNIST (10 classes, 60k training, 10k testing) [28] vs. ImageNet
C C M (1000 classes, 1.3M training, 100k testing)[18] dataset.
R E
M H designed for digit classification, while AlexNet[26], VGG-
S
N 16[27], GoogLeNet[20], and ResNet[21] are designed for the
N F
W 1000 class image classification.
Fig. 4. Computation of a convolution in DNN. IV. C HALLENGES
The key metrics for embedded machine learning are accu-
the filter and input feature map are the same size, so that racy, energy consumption, throughput/latency, and cost.
there is a different weight for each input pixel. The number The accuracy of the machine learning algorithm should
of FC layers has been reduced from three to one in most be measured on a sufficiently large dataset. There are many
recent DNNs [20, 21]. In between CONV and FC layers, widely-used publicly-available datasets that researcher can use
additional layers can be optionally added, such as the pooling (e.g., ImageNet).
and normalization layers [22]. Each of the CONV and FC Programmability is important since the weights need to
layers is also immediately followed by an activation layer, such be updated when the environment or application changes.
as a rectified linear unit (ReLU) [23]. Convolutions account for In the case of DNNs, the processor must also be able to
over 90% of the run-time and energy consumption in DNNs. support different networks with varying number of layers,
Table I compares modern DNNs, with a popular neural filters, channels and filter sizes.
net from the 1990s, LeNet-5 [24], in terms of number layers The high dimensionality and need for programmability both
(depth), number of filters weights, and number of operations result in an increase in computation and data movement.
(i.e., MACs). Today’s DNNs are several orders of magnitude Higher dimensionality increases the amount of data generated
larger in terms of compute and storage. A more detailed and programmability means that the weights also need be
discussion on DNNs can be found in [25]. read and stored. This poses a challenge for energy-efficiency
since data movement costs more than computation [29]. In
D. Complexity versus Difficulty of Task this paper, we will discuss various methods that reduce data
It is important to factor in the difficulty of the task when movement to minimize energy consumption.
comparing different machine learning methods. For instance, The throughput is dictated by the amount of computation,
the task of classifying handwritten digits from the MNIST which also increases with the dimensionality of the data. In
dataset [28] is much simpler than classifying an object into this paper, we will discuss various methods that the data can
one of a 1000 classes as is required for the ImageNet be transformed to reduce the number of required operations.
dataset [18](Fig. 5). It is expected that the size of the classifier The cost is dictated by the amount of storage required on
or network (i.e., number of weights) and the number of MACs the chip. In this paper, we will discuss various methods to
will be larger for the more difficult task than the simpler task reduce storage costs such that the area of the chip is reduced,
and thus require more energy. For instance, LeNet-5[24] is while maintaining low off-chip memory bandwidth.
Temporal Architecture Spatial Architecture Accelerator
(SIMD/SIMT) (Dataflow Processing) PE PE Processing Engine
Off-Chip
Global
DRAM
Memory Hierarchy Memory Hierarchy Buffer PE = ALU
PE ALU
Register File
ALU ALU ALU ALU
Data Movement Energy Cost
ALU ALU ALU ALU
DRAM ALU 200×
ALU ALU ALU ALU ALU ALU ALU ALU
Buffer ALU 6×
1.5
psums
Normalized
1 weights
Energy/MAC
(b) Output Stationary pixels
0.5
0
WS OSA OSB OSC NLR RS
1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
Fig. 8. Dataflows for DNNs.
0.5 DRAM
* =
* =
* = VI. O PPORTUNITIES IN J OINT A LGORITHM AND
H ARDWARE D ESIGN
Fig. 9. Row Stationary Dataflow [34]. There is on-going research on modifying the machine learn-
ing algorithms to make them more hardware-friendly while
maintaining accuracy; specifically, the focus is on reducing
no local storage is allocated to the PE and instead all
computation, data movement and storage requirements.
that area is allocated to the global buffer to increase its
capacity (Fig. 7(c)). The trade-off is that there will be A. Reduce Precision
increased traffic on the spatial array and to the global The default size for programmable platforms such as CPUs
buffer for all data types. Examples are found in [44–46]. and GPUs is often 32 or 64 bits with floating-point repre-
• Row stationary (RS): In order to increase reuse of sentation. While this remains the case for training, during
all types of data (weights, pixels, partial sums), a row inference, it is possible to use a fixed-point representation and
stationary approach is proposed in [34]. A row of the filter substantially reduce the bitwidth for energy and area savings,
convolution remains stationary within a PE to exploit and increase in throughput. Retraining is typically required to
1-D convolutional reuse within the PE. Multiple 1-D maintain accuracy when pushing the weights and features to
rows are combined in the spatial array to exhaustively lower bitwidth.
exploit all convolutional reuse (Fig. 9), which reduces In hand-crafted approaches, the bitwidth can be drastically
accesses to the global buffer. Multiple 1-D rows from reduced to below 16-bits without impacting the accuracy. For
different channels and filters are mapped to each PE to instance, in object detection using HOG, each 36-dimension
reduce partial sum data movement and exploit filter reuse, feature vector only requires 9-bit per dimension, and each
respectively. Finally, multiple passes across the spatial weight of the SVM uses only 4-bits [48]; for object detection
array allow for additional image and filter reuse using using deformable parts models (DPM) [49], only 11-bits are
the global buffer. This dataflow is demonstrated in [47]. required per feature vector and only 5-bits are required per
The dataflows are compared on a spatial array with the SVM weight [50].
same number of PEs (256), area cost and DNN (AlexNet). Similarly for DNN inference, it is common to see accel-
Fig. 10 shows the energy consumption of each approach. erators support 16-bit fixed point [45, 47]. There has been
The row stationary approach is 1.4× to 2.5× more energy- significant research on exploring the impact of bitwidth on
Various forms of lightweight compression have been ex-
plored to reduce data movement. Lossless compression can be
used to reduce the transfer of data on and off chip [11, 53, 63].
Simple run-length coding of the activations in [64] provides up
to 1.9× bandwidth reduction, which is within 5-10% of the
theoretical entropy limit. Lossy compression such as vector
quantization can also be used on feature vectors [50] and
weights [8, 12, 65] such that they can be stored on-chip at low
cost. Generally, the cost of the compression/decompression is
on the order of a few thousand kgates with minimal energy
overhead. In the lossy compression case, it is also important
to evaluate the impact on performance accuracy.