Sei sulla pagina 1di 9

Deep Image: Scaling up Image Recognition

arXiv:1501.02876v3 [cs.CV] 11 May 2015

Ren Wu
Shengen Yan
Yi Shan
Qingqing Dang
Gang Sun

WUREN @ BAIDU . COM


YANSHENGEN @ BAIDU . COM
SHANYI @ BAIDU . COM
DANGQINGQING @ BAIDU . COM
SUNGANG 01@ BAIDU . COM

Abstract
We present a state-of-the-art image recognition system, Deep Image, developed using endto-end deep learning. The key components
are a custom-built supercomputer dedicated to
deep learning, a highly optimized parallel algorithm using new strategies for data partitioning
and communication, larger deep neural network
models, novel data augmentation approaches,
and usage of multi-scale high-resolution images.
On one of the most challenging computer vision
benchmarks, the ImageNet classification challenge, our system has achieved the best result to
date, with a top-5 error rate of 4.58% and exceeding the human recognition performance, a relative 31% improvement over the ILSVRC 2014
winner.

1. Introduction
On May 11th, 1997, IBMs Deep Blue achieved a historic
victory by defeating world chess champion Gary Kasparov
in a six-game match (Campbell et al., 2002). It came as a
surprise to some, but with the correct algorithms, chess performance is a function of computational power (Condon &
Thompson, 1982), (Hyatt et al., 1990), (Kuszmaul, 1995).
Today, history is repeating itself: simple, scalable algorithms, given enough data and computational resources,
dominate many fields, including visual object recognition
(Ciresan et al., 2010), (Krizhevsky et al., 2012), (Szegedy
et al., 2014), speech recognition (Dahl et al., 2012), (Hannun et al., 2014) and natural language processing (Collobert & Weston, 2008), (Mnih & Hinton, 2007), (Mnih
& Hinton, 2008).
Although neural networks have been studied for many
decades, only recently have they come into their own,
All the authors are with Baidu Research, Baidu, Inc.
Ren Wu (wuren@baidu.com) is the corresponding author.

thanks to the availability of larger training data sets along


with increased computation power through heterogeneous
computing (Coates et al., 2013).
Because computational power is so important to progress
in deep learning, we built a supercomputer designed for
deep learning, along with a software stack that takes full
advantage of such hardware. Using application specific
hardware-software co-design, we have a highly optimized
system. This system enables larger models to be trained on
more data, while also reducing turnaround time, allowing
us to explore ideas more rapidly.
Still, there are two shortcomings in current deep learning
practices. First, while we know that bigger models offer
more potential capacity, in practice, the size of model is often limited by either too little training data or too little time
for running experiments, which can lead both to overfitting
or underfitting. Second, the data collected is often limited
to a specific region of the potential example space. This
is especially challenging very large neural networks, which
are subject to overfitting.
In this work, we show how aggressive data augmentation can prevent overfitting. We use data augmentation in
novel ways, much more aggressively than previous work
(Howard, 2013), (Krizhevsky et al., 2012). The augmented
datasets are tens of thousands times larger, allowing the
network to become more robust to various transformations.
Additionally, we train on multi-scale images, including
high-resolution images. Most previous work (Krizhevsky
et al., 2012), (Zeiler & Fergus, 2014) operates on downsized images with a resolution of approximately 256x256.
While using downsized images reduces computational
costs without losing too much accuracy, we find that using larger images improves recognition accuracy. As we
demonstrate, there are many cases where the object size is
small, and downsizing simply loses too much information.
Training on higher-resolution images improves the classification accuracy. More importantly, models trained on highresolution images complement models trained on lowresolution images. Composing models trained on different

Deep Image: Scaling up Image Recognition

scales produces results better than any model individually.


All transmission through
MPI asynchronous mode

Modern deep neural networks are mostly trained by variants of stochastic gradient decent algorithms (SGD). As
SGDs contain high arithmetic density, GPUs are excellent
for this type of algorithms.
Furthermore, we would like to train very large deep neural
networks without worrying about the capacity limitation of
a single GPU or even a single machine, and so scaling up
is a required condition. Given the properties of stochastic gradient decent algorithms, it is desired to have very
high bandwidth and ultra low latency interconnects to minimize the communication costs, which is needed for the distributed version of the algorithm.
The result is the custom-built supercomputer, which we call
Minwa . It is comprised of 36 server nodes, each with 2
six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand
(56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision
floating point performance of each GPU is 4.29TFlops and
each GPU has 12GB of memory. Thanks to the GPUDirect
RDMA, the InfiniBand network interface can access the remote GPU memory without involvement from the CPU. All
the server nodes are connected to the InfiniBand switch.
Figure 1 shows the system architecture. The system runs
Linux with CUDA 6.0 and MPI MVAPICH2, which also
enables GPUDirect RDMA.
In total, Minwa has 6.9TB host memory, 1.7TB device
memory, and about 0.6PFlops theoretical single precision

GPU 2
w2
GPU 3

w3

Send to
each GPU

Forward

w
1
1
2 11
2
3 22
3
4 33
44
4

w = w - w

GPU 4

2. Hardware/Software Co-design
It is clear that different classes of algorithms would perform
differently on different computing architectures. Graphic
processors, or GPUs, often perform extremely well for
compute-intensive algorithms. Early work shows that for
clustering algorithms, a single GPU offers 10x more performance than top-of-the-line 8 cores workstations, even on
very large datasets with more than a billion data points (Wu
et al., 2009). A more recent example shows that three GPU
servers with 4 GPUs each, rival the same performance of a
1000 nodes (16000 cores) CPU cluster, used by the Google
Brain project (Coates et al., 2013).

GPU 1
w1

Backward

In this paper, we detail our custom designed supercomputer


for deep learning, as well as our optimized algorithms and
software stack built to capitalize on this hardware. This
system has enabled us to train bigger neural models, work
on higher-resolution images, and use more aggressive data
augmentation methods. On the widely studied 1k ImageNet classification challenge, we have obtained the current state-of-the art result, with a top-5 error rate of 4.58%.

w4

Figure 1. Example of communication strategies among 4 GPUs.

peak performance.

3. Optimization
Our goal is to push for extreme performance from both
hardware and software for given tasks. In modern deep
convolutional neural networks, convolutional layers account for most of the computation and fully-connected layers account for most of the parameters. We implement two
parallelism strategies for our parallel deep neural network
framework, namely model-data parallelism and data parallelism. Similar strategies have been proposed in the previous work (Krizhevsky, 2014), (Yadan et al., 2013). However, the previous work mainly focuses on a single server
with multiple GPUs or small GPU clusters, so its hard
to extend directly to a large GPU cluster because of the
communication bottlenecks. In our work, we focus on optimizing parallel strategies, minimizing data transfers and
overlapping the computation and communications. This is
needed for approaching the peak performance for the large
supercomputers like Minwa.
3.1. Data Parallelism
When all the parameters can be stored in the memory of one
GPU, we exploit data parallelism at all layers. In this case,
each GPU is responsible for 1/N mini-batch of input images
and all GPUs work together on the same mini-batch. In
one forward-backward pass, the parameters and gradients
of all layers are copied on every GPU. All GPUs compute
gradients based on local training data and a local copy of
weights. They then exchange gradients and update the local
copy of weights.
Two strategies have helped us to achieve better parallelism.
The first one is the butterfly synchronization strategy,
in which all gradients are partitioned into K parts and each
GPU is responsible for its own part. At the end of gradient

Deep Image: Scaling up Image Recognition

Another is the lazy update strategy. Once the gradients


are generated in the backward pass, each GPU sends its
generated gradient to the corresponding GPUs in an asynchronous way. This transfer does not need to be synchronized until its corresponding weight parameters need to be
used, which only occurs later at the forward pass. This
maximizes the overlapping between computation and communication.
Figure 1 shows an example of the transmission on four
GPUs. Theoretically, the communication overhead of our
data parallel strategy only depends on the size of the model
and is independent of the number of GPUs. We also use
device memory to cache the training data if there is free
memory space on the device after the model is loaded.

1.2
1
Speedup / GPU number

computation, GPU k receives the k-th part from all other


GPUs, accumulates there and then broadcasts the results
back to all GPUs.

0.8

0.6
0.4

While data parallelism works well for the smaller models.


It does not work if the model size cannot be fit into the
memory of a single GPU. Model-data parallelism is used
to address this problem. Here, data parallelism is still used
at convolutional layers but fully-connected layers are instead partitioned and distributed to multiple GPUs. This
works because convolutional layers have fewer parameters
but more computation.

45

We tested the scaling efficiency by training a model for an


image classification task. The network has 8 convolutional
layers and 3 fully-connected layers followed by a 1000way softmax. Measured by the epoch time, the scalability efficiency and the speedup of going through images are
shown in Figure 2 and Figure 3. For the convenience of
observing the scalability of different parallel strategies,we
fixed the number of images processed by each GPU to 64
(slices = 64). The time taken for hybrid parallelism and
data parallelism with different numbers of GPUs is shown
in Figure 2. The data parallelism performs better when
the involved GPU number is larger than 16. This is because communication overhead of the data parallel strategy

16

GPU number

32

64

Figure 2. The scalability of different parallel approaches.

50

3.3. Scaling Efficiency

slices=64, data parallel

3.2. Model-Data Parallelism

The parameters at convolutional layers are copied on every GPU and all images within one mini-batch are partitioned and assigned to all GPUs, just as we described
in Section 3.1. The parameters at fully-connected layers,
however, are evenly divided amongst all GPUs. All GPUs
then work together to calculate the fully-connected layers
and synchronize when necessary. This is similar to the approach presented in (Krizhevsky, 2014), even though we
are doing it in a scaled up fashion.

slices=64, model-data parallel

0.2

batch size 256


batch size 512

40

batch size 1024

Speedup

35
30

25
20
15
10
5
0
16

32

GPU number

64

Figure 3. The speedup of going through images.

is constant when the size of model is fixed. The speedup is


larger with larger batch size as shown in Figure 3. Compared with a single GPU, a 47x speedup of going through
images is achieved by using 64 GPUs with a mini-batch
size of 1024. As the number of GPUs increases, the total
device memory is also increasing, and more data can be
cached on the device memory. This is helpful for improving parallel efficiency.
The ultimate test for a parallel algorithm is the convergence
time, that is, the wall clock time needed for the network
to reach a certain accuracy level. Figure 4 compares the
time needed for the network to reach 80% accuracy using
various numbers of GPUs. Note that a 24.7x speedup is
obtained by using 32 GPUs.

Deep Image: Scaling up Image Recognition


0.9

32 GPU
16 GPU
1 GPU

0.8
0.7

Accuracy

0.6
0.5
Accuracy 80%

0.4

32 GPU: 8.6 hours


1 GPU: 212 hours

0.3


Original photo

Red color casting

Green color casting

Blue color casting

RGB all changed

Vignette

More vignette

Blue casting + vignette

Left rotation, crop

Right rotation, crop

Pincushion distortion

Barrel distortion

Horizontal stretch

More Horizontal stretch

Vertical stretch

More vertical stretch

Speedup: 24.7x

0.2
0.1
0
0.25

0.50

1.00

2.00

4.00

8.00

16.00

32.00

64.00 128.00 256.00

Time (hours)

Figure 4. Validation set accuracy for different numbers of GPUs.

4. Training Data
4.1. Data Augmentation
The phrase the more you see, the more you know is true
for humans as well as neural networks, especially for modern deep neural networks.
We are now capable of building very large deep neural networks up to hundreds of billions parameters thanks to dedicated supercomputers such as Minwa. The available training data is simply not sufficient to train a network of this
size. Additionally, the examples collected are often just in
a form of good data - or a rather small and heavily biased subset of the possible space. It is desired to show the
network with more data with broad coverage.
The authors of this paper believe that data augmentation
is fundamentally important for improving the performance
of the networks. We would like the network to learn the
important features that are invariant for the object classes,
rather than the artifact of the training images. We have explored many different ways of doing augmentation, some
of which are discussed in this section.
It is clear that an object would not change its class if ambient lighting changes, or if the observer is replaced by another. More specifically, this is to say that the neural network model should be less sensitive to colors that are driven
by the illuminants of the scene, or the optical systems from
various observers. This observation has led us to focus on
some of the key augmentations, such as color casting, vignetting, and lens distortion, as shown in the Figure 5.
Different from the color shifting in (Krizhevsky et al.,
2012), we perform color casting to alter the intensities of
the RGB channels in training images. Specifically, for each
image, we generate three Boolean values to determine if the
R, G and B channels should be altered, respectively. If one
channel should be altered, we add a random integer ranging

Figure 5. Effects of data augmentation.

from -20 to +20 to that channel.


Vignetting means making the periphery of an image dark
compared to the image center. In our implementation, there
are two randomly changeable variables for a vignette effect.
The first is the area to add the effect. The second is how
much brightness is reduced.
Lens distortion is a deviation from rectilinear projection
caused by the lens of camera. The type of distortion and
the degree of distortion are randomly changeable. The horizontal and vertical starching of images can also be viewed
as special kind of lens distortion.
We also adopt some augmentations that were proposed by
the previous work, such as flipping and cropping (Howard,
2013), (Krizhevsky et al., 2012). To ensure that the whole
augmented image feeds into the training net, all the augmentations are performed on the cropped image (except
cropping).
An interesting fact worth pointing out is that both professional photographers and amateurs often use color casting

Deep Image: Scaling up Image Recognition


Table 1. The number of possible changes for different augmentation ways.
Augmentation

The number of possible changes

Color casting
Vignetting
Lens distortion
Rotation
Flipping

68920
1960
260
20
2
82944(crop size is 224x224,
input image size is 512x512)

Cropping

and vignetting to add an artistic touch to their work. In fact,


most filters offered by popular apps like Instagram are no
more than various combinations of the two.

Original image

Low-resolution model

High-resolution model

Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Score
0.2287
0.0997
0.057
0.0546
0.0522
0.0307
0.0287
0.0267
0.0225
0.0198
0.0198
0 0198
0.0178
0.0171
0.0161
0.0161
0.0157
0.0148
0.0129
0.0101
0.0094

Class
ant
damselfly
nematode
chainlink fence
long-horned
walking stick
dragonfly
tiger beetle
doormat
flute
grey whale
h l
mantis
lacewing
radiator
scabbard
slide rule
fly
leafhopper
cucumber
velvet

Score
0.103
0.074
0.074
0.063
0.039
0.027
0.025
0.023
0.019
0.015
0.015
0 015
0.013
0.012
0.012
0.011
0.011
0.011
0.011
0.01
0.01

Class
lacewing
dragonfly
damselfly
walking stic
long-horned
leafhopper
nail
grasshopper
ant
mantis
fly
fl
hammer
American
gar
chainlink
padlock
tree frog
cicada
screwdriver
harvestman

Figure 7. Top-20 classification results comparison between the


models trained by low-resolution and high-resolution images.

As shown in Table 1, by using different augmentation approaches, the number of training examples explodes and
poses a greater challenge in terms of computational resources. However, the resulting model has better accuracy
which recoups the cost as evidenced in Figure 6.
4.2. Multi-scale Training
We also notice that multi-scale training with highresolution images works better than single-scale training,
especially for recognizing small objects.
Previous work (Krizhevsky et al., 2012), (Zeiler & Fergus, 2014) usually downsizes the images to a fixed resolution, such as 256x256, and then randomly crops out
slightly smaller areas, such as 224x224, and uses those
crops for training. While this method reduces computational costs and the multiple convolutional layers can still
capture the multi-scale statistics, the down sampling may
disturb the details of small objects, which are important
features to distinguish between different classes. If higherresolution images such as 512x512 are used, there will be
more details captured by each 224x224 crop. The model

may
learn more from such crops. Different from the work
(Simonyan & Zisserman, 2014), (Szegedy et al., 2014) that
uses scale-jittering method, we have trained separate models at different scales, including high-resolution ones (such
as 512x512), and combined them by averaging softmax
Bathtub

Isopod

Indian elephant

Ice bear

bbb Washer

Backpack

Little blue heron

Figure 8. Some hard cases addressed by using higher-resolution


images for training.

class posteriors.
As shown in Figure 7, the dragonfly is small and only occupies a small portion of the image. If a model trained by lowresolution images is used, the true class is out of the top-5
prediction. A high-resolution training will capture more
features of dragonfly and recognize the image in a second
place. The high-resolution model gives similar scores for
lacewing, dragonfly, and damselfly due to the similarity in
species. However, the low-resolution model distinguishes
them clearly by giving scores with large gap. It could be
also seen that combining the two models in Figure 7 by
simple averaging will give a good prediction. Figure 8 has
more examples where a high-resolution model gives the
correct answer while the lower-resolution one fails.
On the other hand, the small crop from a high-resolution
image may not contain any object that will mislead the
model by giving a label of the whole image. So, a model
trained by crops from low-resolution images is necessary.

Bathtub

Isopod

Indian elephant

Ice bear

Figure 6. Some hard cases addressed by adding our data augmentation.

Bathtub

Tricycle

Isopod

Indian elephant

Ice bear

We have trained several models with the images of different scales and they are complementary. The models trained
with 256x256 images and 512x512 images could give a
top-5 error rate of 7.96% and 7.42% separately for the ImageNet validation dataset, but the fused model by simple av-

Deep Image: Scaling up Image Recognition


Table 2. Basic configuration for one of our models.
Layers
Conv 1-2
# filters
64
Conv 5-6-7
256
Conv 11-12-13
512

Conv 3-4
128
Conv 8-9-10
Maxpool
512
FC 1-2 FC 2
Maxpool
6144
1000
Maxpool

Maxpool
Maxpool

max class posteriors. The independent training procedure


of these models makes them complementary and gives better results than each individual.
Table 3. ILSVRC classification task top-5 performance (with provided training data only).

Softmax

eraging gives an error rate of 6.97% - an even better result


than each individual model. It is worth noting that all the
validation results are also achieved by testing and combining the results of multi-scale images resized from a single
one.

5. Experiments
One of the most challenging computer vision benchmarks
is the classification task of ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC). The task is set to evaluate the algorithms for large scale image classification.
There are more than 15 million images belonging to about
22,000 categories in the ImageNet dataset, and ILSVRC
uses a subset of about 1.2 million images which contains
1,000 categories. After the great success of convolutional
networks (ConvNets) for this challenge, there is increasing attention both from industry and academic communities to build a more accurate ConvNet system with the help
of powerful computing platforms, such as GPU and CPU
clusters.

Team

Time

SuperVision
ISI
VGG
Clarifai
NUS
ZF
GoogLeNet
VGG 3
MSRA
Andrew Howard
DeeperVision
MSRA PReLU-nets
BN-Inception
Deep Image

2012
2012
2012
2013
2013
2013
2014
2014
2014
2014
2014
2015.2
2015.2
2015.5

Place

Top-5 error

1
2
3
1
2
3
1
2
3
4
5
-

16.42%
26.17%
26.98%
11.74%
12.95%
13.51%
6.66%
7.32%
8.06%
8.11%
9.51%
4.94%
4.82%
4.58%

Table 4. Single model comparison (top-5 error).


Team
VGG (Simonyan & Zisserman, 2014)
GoogLeNet (Szegedy et al., 2014)
BN-Inception (Ioffe et al., 2015)
MSRA,PReLU-net (He et al., 2015)
Deep Image

Top-5
4

8.0%
7.89%
5.82%
5.71%
5.40%

As shown in Table 3, the accuracy has been optimized a


lot during the last three years. The best result of ILSVRC
2014, top-5 error rate of 6.66%, is not far from human recognition performance of 5.1% (Russakovsky et al.,
2014). After the competition, other works (He et al., 2015),
(Ioffe et al., 2015) also improve the performance and exceed the human performance. Our work marks yet another
exciting milestone with the top-5 error rate of 4.58%, setting the new record.

Inspired by (Sermanet et al., 2013), a transformed network


with only convolutional layers is used to test images and
the class scores from the feature map of the last layer are
averaged to attain the final score. We also tested single
images at multiple scales and combined the results, which
will cover portions of the images with different sizes, and
allow the network to recognize the object in a proper scale.
The horizontal flipping of the images are also tested and
combined.

We only use the provided data from the dataset of ILSVRC


2014. Taking advantage of data augmentation and random
crops from high-resolution training images, we are able to
reduce the risk of overfitting, despite our larger models.
As listed in Table 2, one basic configuration has 16 layers and is similar with VGGs work (Simonyan & Zisserman, 2014). The number of weights in our configuration
is 212.7M. For other models we have varied the number of
the filters at convolutional layers and the neurons at fullyconnected layers up to 1024 and 8192, respectively. Six
trained models with different configurations, each focusing
on various image scales and data augmentation methods
respectively, are combined by simple averaging of the soft-

As shown in Table 3, the top-5 accuracy has been improved


a lot in last three years. Deep Image has set the new record
of 4.58% top-5 error rate for test dataset, a 31% relative improvement than the best result of the ILSVRC 2014. This is
also significantly better than previous best results obtained
early this year, 4.94% from Microsoft (He et al., 2015) and
4.82% from Google (Ioffe et al., 2015).
3
This result is obtained from paper (Russakovsky et al., 2014).
VGG team achieves top-5 test set error of 6.8% using multiple
models after the competition (Simonyan & Zisserman, 2014).
4
VGG teams single model achieves top-1 error of 24.4% and
top-5 error of 7.1% on validation set after the competition (Simonyan & Zisserman, 2014).

Deep Image: Scaling up Image Recognition

built by Microsoft. They trained a large network for ImageNet 22K category object classification task through asynchrony system with 62 machines in ten days and achieved
top-1 accuracy of 29.8%. They did not report their result
for the 1k classification challenge and so it is hard to compare this work to others. In (Krizhevsky, 2014) and (Yadan
et al., 2013), the authors also employed data and model parallelism or hybrid parallelization. Their systems are not
scaled up and are limited to a single server with multiple
GPUs.

Junco

Plane

Siamese cat

Common iguana

Figure 9. Experiments to test robustness.

This result is also better than the previous best result of


4.82% (Ioffe et al., 2015).
The single model accuracy comparison is listed in Table 4.
Our best single model achieves top-5 error rate of 5.40%.
In addition to having achieved the best classification result
to date, our system is also more robust in real world scenarios, thanks to the aggressive data augmentation and the
high-resolution images used in our training. Figure 9 shows
some of the examples. The first row is the original image
from ImageNet, others are captured by cellphone under different conditions. Our system recognizes these images correctly despite extreme transformation.

6. Related Work
With the development of deep learning, it becomes possible
to recognize and classify visual objects end-to-end without
needing to create multi-stage pipelines of extracted features
and discriminative classifiers. Due to the huge amount of
computation, a lot of effort has been made to build a distributed system to scale the network to very large models.
Dean et al. developed DistBelief (2012) to train a deep network with billions of parameters using tens of thousands of
CPU cores. Within this framework, they developed asynchronous stochastic gradient descent procedure. Coates et
al. built a GPU cluster with high speed interconnects, and
used it to train a locally-connected neural network (2013).
Similar to this work but at a much smaller scale is the work
of Paine et al. (2013), in which an asynchronous SGD algorithm based on a GPU cluster was implemented. Project
Adam (Chilimbi et al., 2014) is another distributed system

We have trained a large convolutional neural network for


the ImageNet classification challenge. This work is related to much previous work around this most popular challenge (Russakovsky et al., 2014), which has become the
standard benchmark for large-scale object classification.
The SuperVision team made a significant breakthrough in
ILSVRC 2012, where they trained a deep convolutional
neural network with 60 million parameters using an efficient GPU implementation (Krizhevsky et al., 2012). Following the success of SuperVision, the winner of the classification task in ILSVRC 2013 was Clarifai, which was
designed by using the visualization technique (Zeiler &
Fergus, 2014) to guide the adjustment of the network architectures. In ILSVRC 2014, the winner GoogLeNet
(Szegedy et al., 2014) and the runner-up VGG team (Simonyan & Zisserman, 2014) both increased the depth of
the network significantly, and achieved top-5 classification
error 6.66% and 7.32%, respectively. Besides the depth,
GoogLeNet (Szegedy et al., 2014) and VGG (Simonyan
& Zisserman, 2014) used multi-scale data to improve the
accuracy. None of these works have used data augmentation and multi-scale training as aggressively as we have,
and they have mainly used lower-resolution training images
with a smaller network than ours.

7. Conclusion
We have built a large supercomputer dedicated to training
deep neural networks. We have chosen the ImageNet classification challenge as the first test case. Compared to previous work, our model is larger, using higher-resolution images and seeing more examples. We have obtained the best
results to date.
The success of this work is driven by tremendous computational power, and can also be described as a brute force
approach. Earlier, Baidus Deep Speech used a similar approach for speech recognition and also achieved state-ofthe-art results.
It is possible that other approaches will yield the same results with less demand on the computational side. The authors of this paper argue that with more human effort being
applied, it is indeed possible to see such results. However,

Deep Image: Scaling up Image Recognition

human effort is precisely what we want to avoid.

Acknowledgments
The project started in October 2013, after an interesting
discussion on a high-speed train from Beijing to Suzhou.
Many thanks to the Baidu SYS group for their help with
hosting the Minwa supercomputer and to Zhiqian Wang
for helping with benchmarking the hardware components.
Thanks to Adam Coates and Andrew Ng for many insightful conversations. Thank you also to Bryan Catanzaro, Calisa Cole, and Tony Han for reviewing early drafts of this
paper.

References
Campbell, M., Hoane, A.J., and Hsu, F. Deep blue. Artificial Intelligence, 134:5759, 2002.
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman,
K. Project adam: Building an efficient and scalable deep
learning training system. In 11th USENIX Symposium on
Operating Systems Design and Implementation (OSDI
14), pp. 571582, Broomfield, CO, 2014. USENIX Association. ISBN 978-1-931971-16-4.
Ciresan, D.C., Meier, U., Gambardella, L.M., and Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.
Coates, A., Huval, B., Wang, T., Wu, D.J., Catanzaro, B.C.,
and Ng, A.Y. Deep learning with cots hpc systems. In
ICML (3)13, pp. 13371345, 2013.
Collobert, R. and Weston, J. A unified architecture for
natural language processing: deep neural networks with
multitask learning. In Machine Learning, Proceedings of
the Twenty-Fifth International Conference (ICML 2008),
Helsinki, Finland, June 5-9, 2008, pp. 160167, 2008.
Condon, J. and Thompson, K. Belle chess hardware. Advances in Computer Chess, 3, 1982.
Dahl, G.E., Yu, D., Deng, L., and Acero, A. Contextcependent pre-trained deep neural networks for largevocabulary speech recognition. IEEE Transactions on
Audio, Speech & Language Processing, 20(1):3042,
January 2012.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le,
Q.V., Mao, M.Z., Ranzato, M.A., Senior, A.W., Tucker,
P.A., Yang, K., and Ng, A.Y. Large scale distributed
deep networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural
Information Processing Systems, pp. 12321240, 2012.

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,


G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,
Coates, A., and Ng, A.Y. Deepspeech: Scaling up endto-end speech recognition. arXiv:1412.5567, 2014.
He, K., Zhang, X., Ren, S. and Sun, J. Delving Deep
into Rectifiers: Surpassing Human-Level Performance
on ImageNet Classification. arXiv:1502.01852 , 2015.
Howard, A.G. Some improvements on deep convolutional neural network based image classification. CoRR,
abs/1312.5402, 2013.
Hyatt, R.M., Nelson, H.L., and Gower, A.E. Cray blitz.
Computers, Chess, and Cognition, abs/1312.5402, 1990.
Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, 2015.
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997, 2014.
Krizhevsky, A., Sutskever, I., and Hinton, G.E. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems
25: 26th Annual Conference on Neural Information Processing Systems, pp. 11061114, Lake Tahoe, Nevada,
United States, 2012.
Kuszmaul, B.C. The startech massively parallel chess program. Journal of the International Computer Chess Association, 18(1), 1995.
Mnih, A. and Hinton, G.E. Three new graphical models
for statistical language modelling. In Proceedings of the
24th International Conference on Machine Learning, pp.
641648, Corvallis, Oregon, USA, 2007.
Mnih, A. and Hinton, G.E. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems 21, Proceedings of the
Twenty-Second Annual Conference on Neural Information Processing Systems, pp. 10811088, Vancouver,
British Columbia, Canada, 2008.
Paine, T., Jin, H., Yang, J., Lin, Z., and Huang, T.S. GPU
asynchronous stochastic gradient descent to speed up
neural network training. CoRR, abs/1312.6186, 2013.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z.H., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., and Fei-Fei, L. Imagenet large
scale visual recognition challenge. arXiv:1409.0575,
2014.
Sermanet, P., Eigen, D., Zhang, X., M.M., R.F., and LeCun, Y. Overfeat: Integrated recognition, localization

Deep Image: Scaling up Image Recognition

and detection using convolutional networks.


abs/1312.6229, 2013.

CoRR,

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition.
arXiv:1409.1556, 2014, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed,
S., Anguelov, D., Erhan, D., Vanhoucke, V., and
Rabinovich, A.
Going deeper with convolutions.
arXiv:1409.4842, 2014.
Wu, R., Zhang, B., and Hsu, M. Clustering cillions of data
points using gpus. ACM UCHPC-MAW, 2009.
Yadan, O., Adams, K., Taigman, Y., and Ranzato, M.A.
Multi-gpu training of convnets. CoRR, abs/1312.5853,
2013.
Zeiler, M.D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014 - 13th European Conference, pp. 818833,
Zurich,Switzerland, 2014.

Potrebbero piacerti anche