Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Ren Wu
Shengen Yan
Yi Shan
Qingqing Dang
Gang Sun
Abstract
We present a state-of-the-art image recognition system, Deep Image, developed using endto-end deep learning. The key components
are a custom-built supercomputer dedicated to
deep learning, a highly optimized parallel algorithm using new strategies for data partitioning
and communication, larger deep neural network
models, novel data augmentation approaches,
and usage of multi-scale high-resolution images.
On one of the most challenging computer vision
benchmarks, the ImageNet classification challenge, our system has achieved the best result to
date, with a top-5 error rate of 4.58% and exceeding the human recognition performance, a relative 31% improvement over the ILSVRC 2014
winner.
1. Introduction
On May 11th, 1997, IBMs Deep Blue achieved a historic
victory by defeating world chess champion Gary Kasparov
in a six-game match (Campbell et al., 2002). It came as a
surprise to some, but with the correct algorithms, chess performance is a function of computational power (Condon &
Thompson, 1982), (Hyatt et al., 1990), (Kuszmaul, 1995).
Today, history is repeating itself: simple, scalable algorithms, given enough data and computational resources,
dominate many fields, including visual object recognition
(Ciresan et al., 2010), (Krizhevsky et al., 2012), (Szegedy
et al., 2014), speech recognition (Dahl et al., 2012), (Hannun et al., 2014) and natural language processing (Collobert & Weston, 2008), (Mnih & Hinton, 2007), (Mnih
& Hinton, 2008).
Although neural networks have been studied for many
decades, only recently have they come into their own,
All the authors are with Baidu Research, Baidu, Inc.
Ren Wu (wuren@baidu.com) is the corresponding author.
Modern deep neural networks are mostly trained by variants of stochastic gradient decent algorithms (SGD). As
SGDs contain high arithmetic density, GPUs are excellent
for this type of algorithms.
Furthermore, we would like to train very large deep neural
networks without worrying about the capacity limitation of
a single GPU or even a single machine, and so scaling up
is a required condition. Given the properties of stochastic gradient decent algorithms, it is desired to have very
high bandwidth and ultra low latency interconnects to minimize the communication costs, which is needed for the distributed version of the algorithm.
The result is the custom-built supercomputer, which we call
Minwa . It is comprised of 36 server nodes, each with 2
six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand
(56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision
floating point performance of each GPU is 4.29TFlops and
each GPU has 12GB of memory. Thanks to the GPUDirect
RDMA, the InfiniBand network interface can access the remote GPU memory without involvement from the CPU. All
the server nodes are connected to the InfiniBand switch.
Figure 1 shows the system architecture. The system runs
Linux with CUDA 6.0 and MPI MVAPICH2, which also
enables GPUDirect RDMA.
In total, Minwa has 6.9TB host memory, 1.7TB device
memory, and about 0.6PFlops theoretical single precision
GPU 2
w2
GPU 3
w3
Send to
each GPU
Forward
w
1
1
2 11
2
3 22
3
4 33
44
4
w = w - w
GPU 4
2. Hardware/Software Co-design
It is clear that different classes of algorithms would perform
differently on different computing architectures. Graphic
processors, or GPUs, often perform extremely well for
compute-intensive algorithms. Early work shows that for
clustering algorithms, a single GPU offers 10x more performance than top-of-the-line 8 cores workstations, even on
very large datasets with more than a billion data points (Wu
et al., 2009). A more recent example shows that three GPU
servers with 4 GPUs each, rival the same performance of a
1000 nodes (16000 cores) CPU cluster, used by the Google
Brain project (Coates et al., 2013).
GPU 1
w1
Backward
w4
peak performance.
3. Optimization
Our goal is to push for extreme performance from both
hardware and software for given tasks. In modern deep
convolutional neural networks, convolutional layers account for most of the computation and fully-connected layers account for most of the parameters. We implement two
parallelism strategies for our parallel deep neural network
framework, namely model-data parallelism and data parallelism. Similar strategies have been proposed in the previous work (Krizhevsky, 2014), (Yadan et al., 2013). However, the previous work mainly focuses on a single server
with multiple GPUs or small GPU clusters, so its hard
to extend directly to a large GPU cluster because of the
communication bottlenecks. In our work, we focus on optimizing parallel strategies, minimizing data transfers and
overlapping the computation and communications. This is
needed for approaching the peak performance for the large
supercomputers like Minwa.
3.1. Data Parallelism
When all the parameters can be stored in the memory of one
GPU, we exploit data parallelism at all layers. In this case,
each GPU is responsible for 1/N mini-batch of input images
and all GPUs work together on the same mini-batch. In
one forward-backward pass, the parameters and gradients
of all layers are copied on every GPU. All GPUs compute
gradients based on local training data and a local copy of
weights. They then exchange gradients and update the local
copy of weights.
Two strategies have helped us to achieve better parallelism.
The first one is the butterfly synchronization strategy,
in which all gradients are partitioned into K parts and each
GPU is responsible for its own part. At the end of gradient
1.2
1
Speedup / GPU number
0.8
0.6
0.4
45
16
GPU number
32
64
50
The parameters at convolutional layers are copied on every GPU and all images within one mini-batch are partitioned and assigned to all GPUs, just as we described
in Section 3.1. The parameters at fully-connected layers,
however, are evenly divided amongst all GPUs. All GPUs
then work together to calculate the fully-connected layers
and synchronize when necessary. This is similar to the approach presented in (Krizhevsky, 2014), even though we
are doing it in a scaled up fashion.
0.2
40
Speedup
35
30
25
20
15
10
5
0
16
32
GPU number
64
32 GPU
16 GPU
1 GPU
0.8
0.7
Accuracy
0.6
0.5
Accuracy 80%
0.4
0.3
Original photo
Vignette
More vignette
Pincushion distortion
Barrel distortion
Horizontal stretch
Vertical stretch
Speedup: 24.7x
0.2
0.1
0
0.25
0.50
1.00
2.00
4.00
8.00
16.00
32.00
Time (hours)
4. Training Data
4.1. Data Augmentation
The phrase the more you see, the more you know is true
for humans as well as neural networks, especially for modern deep neural networks.
We are now capable of building very large deep neural networks up to hundreds of billions parameters thanks to dedicated supercomputers such as Minwa. The available training data is simply not sufficient to train a network of this
size. Additionally, the examples collected are often just in
a form of good data - or a rather small and heavily biased subset of the possible space. It is desired to show the
network with more data with broad coverage.
The authors of this paper believe that data augmentation
is fundamentally important for improving the performance
of the networks. We would like the network to learn the
important features that are invariant for the object classes,
rather than the artifact of the training images. We have explored many different ways of doing augmentation, some
of which are discussed in this section.
It is clear that an object would not change its class if ambient lighting changes, or if the observer is replaced by another. More specifically, this is to say that the neural network model should be less sensitive to colors that are driven
by the illuminants of the scene, or the optical systems from
various observers. This observation has led us to focus on
some of the key augmentations, such as color casting, vignetting, and lens distortion, as shown in the Figure 5.
Different from the color shifting in (Krizhevsky et al.,
2012), we perform color casting to alter the intensities of
the RGB channels in training images. Specifically, for each
image, we generate three Boolean values to determine if the
R, G and B channels should be altered, respectively. If one
channel should be altered, we add a random integer ranging
Color casting
Vignetting
Lens distortion
Rotation
Flipping
68920
1960
260
20
2
82944(crop size is 224x224,
input image size is 512x512)
Cropping
Original image
Low-resolution model
High-resolution model
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Score
0.2287
0.0997
0.057
0.0546
0.0522
0.0307
0.0287
0.0267
0.0225
0.0198
0.0198
0 0198
0.0178
0.0171
0.0161
0.0161
0.0157
0.0148
0.0129
0.0101
0.0094
Class
ant
damselfly
nematode
chainlink fence
long-horned
walking stick
dragonfly
tiger beetle
doormat
flute
grey whale
h l
mantis
lacewing
radiator
scabbard
slide rule
fly
leafhopper
cucumber
velvet
Score
0.103
0.074
0.074
0.063
0.039
0.027
0.025
0.023
0.019
0.015
0.015
0 015
0.013
0.012
0.012
0.011
0.011
0.011
0.011
0.01
0.01
Class
lacewing
dragonfly
damselfly
walking stic
long-horned
leafhopper
nail
grasshopper
ant
mantis
fly
fl
hammer
American
gar
chainlink
padlock
tree frog
cicada
screwdriver
harvestman
As shown in Table 1, by using different augmentation approaches, the number of training examples explodes and
poses a greater challenge in terms of computational resources. However, the resulting model has better accuracy
which recoups the cost as evidenced in Figure 6.
4.2. Multi-scale Training
We also notice that multi-scale training with highresolution images works better than single-scale training,
especially for recognizing small objects.
Previous work (Krizhevsky et al., 2012), (Zeiler & Fergus, 2014) usually downsizes the images to a fixed resolution, such as 256x256, and then randomly crops out
slightly smaller areas, such as 224x224, and uses those
crops for training. While this method reduces computational costs and the multiple convolutional layers can still
capture the multi-scale statistics, the down sampling may
disturb the details of small objects, which are important
features to distinguish between different classes. If higherresolution images such as 512x512 are used, there will be
more details captured by each 224x224 crop. The model
may
learn more from such crops. Different from the work
(Simonyan & Zisserman, 2014), (Szegedy et al., 2014) that
uses scale-jittering method, we have trained separate models at different scales, including high-resolution ones (such
as 512x512), and combined them by averaging softmax
Bathtub
Isopod
Indian elephant
Ice bear
bbb Washer
Backpack
class posteriors.
As shown in Figure 7, the dragonfly is small and only occupies a small portion of the image. If a model trained by lowresolution images is used, the true class is out of the top-5
prediction. A high-resolution training will capture more
features of dragonfly and recognize the image in a second
place. The high-resolution model gives similar scores for
lacewing, dragonfly, and damselfly due to the similarity in
species. However, the low-resolution model distinguishes
them clearly by giving scores with large gap. It could be
also seen that combining the two models in Figure 7 by
simple averaging will give a good prediction. Figure 8 has
more examples where a high-resolution model gives the
correct answer while the lower-resolution one fails.
On the other hand, the small crop from a high-resolution
image may not contain any object that will mislead the
model by giving a label of the whole image. So, a model
trained by crops from low-resolution images is necessary.
Bathtub
Isopod
Indian elephant
Ice bear
Bathtub
Tricycle
Isopod
Indian elephant
Ice bear
We have trained several models with the images of different scales and they are complementary. The models trained
with 256x256 images and 512x512 images could give a
top-5 error rate of 7.96% and 7.42% separately for the ImageNet validation dataset, but the fused model by simple av-
Conv 3-4
128
Conv 8-9-10
Maxpool
512
FC 1-2 FC 2
Maxpool
6144
1000
Maxpool
Maxpool
Maxpool
Softmax
5. Experiments
One of the most challenging computer vision benchmarks
is the classification task of ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC). The task is set to evaluate the algorithms for large scale image classification.
There are more than 15 million images belonging to about
22,000 categories in the ImageNet dataset, and ILSVRC
uses a subset of about 1.2 million images which contains
1,000 categories. After the great success of convolutional
networks (ConvNets) for this challenge, there is increasing attention both from industry and academic communities to build a more accurate ConvNet system with the help
of powerful computing platforms, such as GPU and CPU
clusters.
Team
Time
SuperVision
ISI
VGG
Clarifai
NUS
ZF
GoogLeNet
VGG 3
MSRA
Andrew Howard
DeeperVision
MSRA PReLU-nets
BN-Inception
Deep Image
2012
2012
2012
2013
2013
2013
2014
2014
2014
2014
2014
2015.2
2015.2
2015.5
Place
Top-5 error
1
2
3
1
2
3
1
2
3
4
5
-
16.42%
26.17%
26.98%
11.74%
12.95%
13.51%
6.66%
7.32%
8.06%
8.11%
9.51%
4.94%
4.82%
4.58%
Top-5
4
8.0%
7.89%
5.82%
5.71%
5.40%
built by Microsoft. They trained a large network for ImageNet 22K category object classification task through asynchrony system with 62 machines in ten days and achieved
top-1 accuracy of 29.8%. They did not report their result
for the 1k classification challenge and so it is hard to compare this work to others. In (Krizhevsky, 2014) and (Yadan
et al., 2013), the authors also employed data and model parallelism or hybrid parallelization. Their systems are not
scaled up and are limited to a single server with multiple
GPUs.
Junco
Plane
Siamese cat
Common iguana
6. Related Work
With the development of deep learning, it becomes possible
to recognize and classify visual objects end-to-end without
needing to create multi-stage pipelines of extracted features
and discriminative classifiers. Due to the huge amount of
computation, a lot of effort has been made to build a distributed system to scale the network to very large models.
Dean et al. developed DistBelief (2012) to train a deep network with billions of parameters using tens of thousands of
CPU cores. Within this framework, they developed asynchronous stochastic gradient descent procedure. Coates et
al. built a GPU cluster with high speed interconnects, and
used it to train a locally-connected neural network (2013).
Similar to this work but at a much smaller scale is the work
of Paine et al. (2013), in which an asynchronous SGD algorithm based on a GPU cluster was implemented. Project
Adam (Chilimbi et al., 2014) is another distributed system
7. Conclusion
We have built a large supercomputer dedicated to training
deep neural networks. We have chosen the ImageNet classification challenge as the first test case. Compared to previous work, our model is larger, using higher-resolution images and seeing more examples. We have obtained the best
results to date.
The success of this work is driven by tremendous computational power, and can also be described as a brute force
approach. Earlier, Baidus Deep Speech used a similar approach for speech recognition and also achieved state-ofthe-art results.
It is possible that other approaches will yield the same results with less demand on the computational side. The authors of this paper argue that with more human effort being
applied, it is indeed possible to see such results. However,
Acknowledgments
The project started in October 2013, after an interesting
discussion on a high-speed train from Beijing to Suzhou.
Many thanks to the Baidu SYS group for their help with
hosting the Minwa supercomputer and to Zhiqian Wang
for helping with benchmarking the hardware components.
Thanks to Adam Coates and Andrew Ng for many insightful conversations. Thank you also to Bryan Catanzaro, Calisa Cole, and Tony Han for reviewing early drafts of this
paper.
References
Campbell, M., Hoane, A.J., and Hsu, F. Deep blue. Artificial Intelligence, 134:5759, 2002.
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman,
K. Project adam: Building an efficient and scalable deep
learning training system. In 11th USENIX Symposium on
Operating Systems Design and Implementation (OSDI
14), pp. 571582, Broomfield, CO, 2014. USENIX Association. ISBN 978-1-931971-16-4.
Ciresan, D.C., Meier, U., Gambardella, L.M., and Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.
Coates, A., Huval, B., Wang, T., Wu, D.J., Catanzaro, B.C.,
and Ng, A.Y. Deep learning with cots hpc systems. In
ICML (3)13, pp. 13371345, 2013.
Collobert, R. and Weston, J. A unified architecture for
natural language processing: deep neural networks with
multitask learning. In Machine Learning, Proceedings of
the Twenty-Fifth International Conference (ICML 2008),
Helsinki, Finland, June 5-9, 2008, pp. 160167, 2008.
Condon, J. and Thompson, K. Belle chess hardware. Advances in Computer Chess, 3, 1982.
Dahl, G.E., Yu, D., Deng, L., and Acero, A. Contextcependent pre-trained deep neural networks for largevocabulary speech recognition. IEEE Transactions on
Audio, Speech & Language Processing, 20(1):3042,
January 2012.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le,
Q.V., Mao, M.Z., Ranzato, M.A., Senior, A.W., Tucker,
P.A., Yang, K., and Ng, A.Y. Large scale distributed
deep networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural
Information Processing Systems, pp. 12321240, 2012.
CoRR,
Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition.
arXiv:1409.1556, 2014, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed,
S., Anguelov, D., Erhan, D., Vanhoucke, V., and
Rabinovich, A.
Going deeper with convolutions.
arXiv:1409.4842, 2014.
Wu, R., Zhang, B., and Hsu, M. Clustering cillions of data
points using gpus. ACM UCHPC-MAW, 2009.
Yadan, O., Adams, K., Taigman, Y., and Ranzato, M.A.
Multi-gpu training of convnets. CoRR, abs/1312.5853,
2013.
Zeiler, M.D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014 - 13th European Conference, pp. 818833,
Zurich,Switzerland, 2014.