Sei sulla pagina 1di 16

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros

Berkeley AI Research (BAIR) Laboratory


University of California, Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
arXiv:1611.07004v1 [cs.CV] 21 Nov 2016

Labels to Street Scene Labels to Facade BW to Color

input output
Aerial to Map

input output input output


Day to Night Edges to Photo

input output input output input output

Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.

Abstract may be expressed in either English or French, a scene may


be rendered as an RGB image, a gradient field, an edge map,
We investigate conditional adversarial networks as a a semantic label map, etc. In analogy to automatic language
general-purpose solution to image-to-image translation translation, we define automatic image-to-image translation
problems. These networks not only learn the mapping from as the problem of translating one possible representation of
input image to output image, but also learn a loss func- a scene into another, given sufficient training data (see Fig-
tion to train this mapping. This makes it possible to apply ure 1). One reason language translation is difficult is be-
the same generic approach to problems that traditionally cause the mapping between languages is rarely one-to-one
would require very different loss formulations. We demon- – any given concept is easier to express in one language
strate that this approach is effective at synthesizing photos than another. Similarly, most image-to-image translation
from label maps, reconstructing objects from edge maps, problems are either many-to-one (computer vision) – map-
and colorizing images, among other tasks. As a commu- ping photographs to edges, segments, or semantic labels,
nity, we no longer hand-engineer our mapping functions, or one-to-many (computer graphics) – mapping labels or
and this work suggests we can achieve reasonable results sparse user inputs to realistic images. Traditionally, each of
without hand-engineering our loss functions either. these tasks has been tackled with separate, special-purpose
machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]),
despite the fact that the setting is always the same: predict
Many problems in image processing, computer graphics,
pixels from pixels. Our goal in this paper is to develop a
and computer vision can be posed as “translating” an input
common framework for all these problems.
image into a corresponding output image. Just as a concept

1
The community has already taken significant steps in this sification or regression [26, 42, 17, 23, 46]. These for-
direction, with convolutional neural nets (CNNs) becoming mulations treat the output space as “unstructured” in the
the common workhorse behind a wide variety of image pre- sense that each output pixel is considered conditionally in-
diction problems. CNNs learn to minimize a loss function – dependent from all others given the input image. Condi-
an objective that scores the quality of results – and although tional GANs instead learn a structured loss. Structured
the learning process is automatic, a lot of manual effort still losses penalize the joint configuration of the output. A large
goes into designing effective losses. In other words, we still body of literature has considered losses of this kind, with
have to tell the CNN what we wish it to minimize. But, popular methods including conditional random fields [2],
just like Midas, we must be careful what we wish for! If the SSIM metric [40], feature matching [6], nonparametric
we take a naive approach, and ask the CNN to minimize losses [24], the convolutional pseudo-prior [41], and losses
Euclidean distance between predicted and ground truth pix- based on matching covariance statistics [19]. Our condi-
els, it will tend to produce blurry results [29, 46]. This is tional GAN is different in that the loss is learned, and can, in
because Euclidean distance is minimized by averaging all theory, penalize any possible structure that differs between
plausible outputs, which causes blurring. Coming up with output and target.
loss functions that force the CNN to do what we really want Conditional GANs We are not the first to apply GANs
– e.g., output sharp, realistic images – is an open problem in the conditional setting. Previous works have conditioned
and generally requires expert knowledge. GANs on discrete labels [28], text [32], and, indeed, im-
It would be highly desirable if we could instead specify ages. The image-conditional models have tackled inpaint-
only a high-level goal, like “make the output indistinguish- ing [29], image prediction from a normal map [39], image
able from reality”, and then automatically learn a loss func- manipulation guided by user constraints [49], future frame
tion appropriate for satisfying this goal. Fortunately, this is prediction [27], future state prediction [48], product photo
exactly what is done by the recently proposed Generative generation [43], and style transfer [25]. Each of these meth-
Adversarial Networks (GANs) [14, 5, 30, 36, 47]. GANs ods was tailored for a specific application. Our framework
learn a loss that tries to classify if the output image is real differs in that nothing is application-specific. This makes
or fake, while simultaneously training a generative model our setup considerably simpler than most others.
to minimize this loss. Blurry images will not be tolerated Our method also differs from these prior works in sev-
since they look obviously fake. Because GANs learn a loss eral architectural choices for the generator and discrimina-
that adapts to the data, they can be applied to a multitude of tor. Unlike past work, for our generator we use a “U-Net”-
tasks that traditionally would require very different kinds of based architecture [34], and for our discriminator we use a
loss functions. convolutional “PatchGAN” classifier, which only penalizes
In this paper, we explore GANs in the conditional set- structure at the scale of image patches. A similar Patch-
ting. Just as GANs learn a generative model of data, condi- GAN architecture was previously proposed in [25], for the
tional GANs (cGANs) learn a conditional generative model purpose of capturing local style statistics. Here we show
[14]. This makes cGANs suitable for image-to-image trans- that this approach is effective on a wider range of problems,
lation tasks, where we condition on an input image and gen- and we investigate the effect of changing the patch size.
erate a corresponding output image.
GANs have been vigorously studied in the last two 2. Method
years and many of the techniques we explore in this pa-
GANs are generative models that learn a mapping from
per have been previously proposed. Nonetheless, ear-
random noise vector z to output image y: G : z → y
lier papers have focused on specific applications, and
[14]. In contrast, conditional GANs learn a mapping from
it has remained unclear how effective image-conditional
observed image x and random noise vector z, to y: G :
GANs can be as a general-purpose solution for image-to-
{x, z} → y. The generator G is trained to produce outputs
image translation. Our primary contribution is to demon-
that cannot be distinguished from “real” images by an ad-
strate that on a wide variety of problems, conditional
versarially trained discrimintor, D, which is trained to do as
GANs produce reasonable results. Our second contri-
well as possible at detecting the generator’s “fakes”. This
bution is to present a simple framework sufficient to
training procedure is diagrammed in Figure 2.
achieve good results, and to analyze the effects of sev-
eral important architectural choices. Code is available at 2.1. Objective
https://github.com/phillipi/pix2pix.
The objective of a conditional GAN can be expressed as
1. Related work LcGAN (G, D) =Ex,y∼pdata (x,y) [log D(x, y)]+
Structured losses for image modeling Image-to-image Ex∼pdata (x),z∼pz (z) [log(1 − D(x, G(x, z))],
translation problems are often formulated as per-pixel clas- (1)
Positive examples Negative examples

Real or fake pair? Real or fake pair?

D D

Encoder-decoder U-Net

G
Figure 3: Two choices for the architecture of the generator. The
G tries to synthesize fake “U-Net” [34] is an encoder-decoder with skip connections be-
images that fool D tween mirrored layers in the encoder and decoder stacks.

D tries to identify the fakes


this strategy effective – the generator simply learned to ig-
nore the noise – which is consistent with Mathieu et al. [27].
Figure 2: Training a conditional GAN to predict aerial photos from Instead, for our final models, we provide noise only in the
maps. The discriminator, D, learns to classify between real and
form of dropout, applied on several layers of our generator
synthesized pairs. The generator learns to fool the discriminator.
Unlike an unconditional GAN, both the generator and discrimina-
at both training and test time. Despite the dropout noise, we
tor observe an input image. observe very minor stochasticity in the output of our nets.
Designing conditional GANs that produce stochastic out-
put, and thereby capture the full entropy of the conditional
where G tries to minimize this objective against an ad- distributions they model, is an important question left open
versarial D that tries to maximize it, i.e. G∗ = by the present work.
arg minG maxD LcGAN (G, D).
To test the importance of conditioning the discrimintor, 2.2. Network architectures
we also compare to an unconditional variant in which the We adapt our generator and discriminator architectures
discriminator does not observe x: from those in [30]. Both generator and discriminator use
LGAN (G, D) =Ey∼pdata (y) [log D(y)]+ modules of the form convolution-BatchNorm-ReLu [18].
Details of the architecture are provided in the appendix,
Ex∼pdata (x),z∼pz (z) [log(1 − D(G(x, z))].
with key features discussed below.
(2)
Previous approaches to conditional GANs have found it 2.2.1 Generator with skips
beneficial to mix the GAN objective with a more traditional
loss, such as L2 distance [29]. The discriminator’s job re- A defining feature of image-to-image translation problems
mains unchanged, but the generator is tasked to not only is that they map a high resolution input grid to a high resolu-
fool the discriminator but also to be near the ground truth tion output grid. In addition, for the problems we consider,
output in an L2 sense. We also explore this option, using the input and output differ in surface appearance, but both
L1 distance rather than L2 as L1 encourages less blurring: are renderings of the same underlying structure. Therefore,
structure in the input is roughly aligned with structure in the
LL1 (G) = Ex,y∼pdata (x,y),z∼pz (z) [ky − G(x, z)k1 ]. (3) output. We design the generator architecture around these
considerations.
Our final objective is
Many previous solutions [29, 39, 19, 48, 43] to problems
G∗ = arg min max LcGAN (G, D) + λLL1 (G). (4) in this area have used an encoder-decoder network [16]. In
G D
such a network, the input is passed through a series of lay-
Without z, the net could still learn a mapping from x to ers that progressively downsample, until a bottleneck layer,
y, but would produce deterministic outputs, and therefore at which point the process is reversed (Figure 3). Such a
fail to match any distribution other than a delta function. network requires that all information flow pass through all
Past conditional GANs have acknowledged this and pro- the layers, including the bottleneck. For many image trans-
vided Gaussian noise z as an input to the generator, in addi- lation problems, there is a great deal of low-level informa-
tion to x (e.g., [39]). In initial experiments, we did not find tion shared between the input and output, and it would be
desirable to shuttle this information directly across the net. and we apply batch normalization [18] using the statistics of
For example, in the case of image colorizaton, the input and the test batch, rather than aggregated statistics of the train-
output share the location of prominent edges. ing batch. This approach to batch normalization, when the
To give the generator a means to circumvent the bot- batch size is set to 1, has been termed “instance normaliza-
tleneck for information like this, we add skip connections, tion” and has been demonstrated to be effective at image
following the general shape of a “U-Net” [34] (Figure 3). generation tasks [38]. In our experiments, we use batch size
Specifically, we add skip connections between each layer i 1 for certain experiments and 4 for others, noting little dif-
and layer n − i, where n is the total number of layers. Each ference between these two conditions.
skip connection simply concatenates all channels at layer i
with those at layer n − i. 3. Experiments
To explore the generality of conditional GANs, we test
2.2.2 Markovian discriminator (PatchGAN) the method on a variety of tasks and datasets, including both
graphics tasks, like photo generation, and vision tasks, like
It is well known that the L2 loss – and L1, see Fig- semantic segmentation:
ure 4 – produces blurry results on image generation prob-
lems [22]. Although these losses fail to encourage high- • Semantic labels↔photo, trained on the Cityscapes
frequency crispness, in many cases they nonetheless accu- dataset [4].
rately capture the low frequencies. For problems where this • Architectural labels→photo, trained on the CMP Fa-
is the case, we do not need an entirely new framework to cades dataset [31].
enforce correctness at the low frequencies. L1 will already • Map↔aerial photo, trained on data scraped from
do. Google Maps.
This motivates restricting the GAN discriminator to only • BW→color photos, trained on [35].
model high-frequency structure, relying on an L1 term to • Edges→photo, trained on data from [49] and [44]; bi-
force low-frequency correctness (Eqn. 4). In order to model nary edges generated using the HED edge detector [42]
high-frequencies, it is sufficient to restrict our attention to plus postprocessing.
the structure in local image patches. Therefore, we design • Sketch→photo: tests edges→photo models on human-
a discriminator architecture – which we term a PatchGAN drawn sketches from [10].
– that only penalizes structure at the scale of patches. This • Day→night, trained on [21].
discriminator tries to classify if each N × N patch in an Details of training on each of these datasets are pro-
image is real or fake. We run this discriminator convoluta- vided in the Appendix. In all cases, the input and out-
tionally across the image, averaging all responses to provide put are simply 1-3 channel images. Qualitative results
the ultimate output of D. are shown in Figures 8, 9, 10, 11, 12, 14, 15, 16,
In Section 3.4, we demonstrate that N can be much and 13. Several failure cases are highlighted in Fig-
smaller than the full size of the image and still produce ure 17. More comprehensive results are available at
high quality results. This is advantageous because a smaller https://phillipi.github.io/pix2pix/.
PatchGAN has fewer parameters, runs faster, and can be Data requirements and speed We note that decent re-
applied on arbitrarily large images. sults can often be obtained even on small datasets. Our fa-
Such a discriminator effectively models the image as a cade training set consists of just 400 images (see results in
Markov random field, assuming independence between pix- Figure 12), and the day to night training set consists of only
els separated by more than a patch diameter. This con- 91 unique webcams (see results in Figure 13). On datasets
nection was previously explored in [25], and is also the of this size, training can be very fast: for example, the re-
common assumption in models of texture [8, 12] and style sults shown in Figure 12 took less than two hours of training
[7, 15, 13, 24]. Our PatchGAN can therefore be understood on a single Pascal Titan X GPU. At test time, all models run
as a form of texture/style loss. in well under a second on this GPU.
2.3. Optimization and inference 3.1. Evaluation metrics
To optimize our networks, we follow the standard ap- Evaluating the quality of synthesized images is an open
proach from [14]: we alternate between one gradient de- and difficult problem [36]. Traditional metrics such as per-
scent step on D, then one step on G. We use minibatch pixel mean-squared error do not assess joint statistics of the
SGD and apply the Adam solver [20]. result, and therefore do not measure the very structure that
At inference time, we run the generator net in exactly structured losses aim to capture.
the same manner as during the training phase. This differs In order to more holistically evaluate the visual qual-
from the usual protocol in that we apply dropout at test time, ity of our results, we employ two tactics. First, we run
L1 L1+cGAN
Loss Per-pixel acc. Per-class acc. Class IOU

Encoder-decoder
L1 0.44 0.14 0.10
GAN 0.22 0.05 0.01
cGAN 0.61 0.21 0.16
L1+GAN 0.64 0.19 0.15
L1+cGAN 0.63 0.21 0.16
Ground truth 0.80 0.26 0.21
Table 1: FCN-scores for different losses, evaluated on Cityscapes

U-Net
labels↔photos.

“real vs fake” perceptual studies on Amazon Mechanical Figure 5: Adding skip connections to an encoder-decoder to create
Turk (AMT). For graphics problems like colorization and a “U-Net” results in much higher quality results.
photo generation, plausibility to a human observer is often
the ultimate goal. Therefore, we test our map generation, Discriminator
aerial photo generation, and image colorization using this receptive field Per-pixel acc. Per-class acc. Class IOU
approach. 1×1 0.44 0.14 0.10
16×16 0.62 0.20 0.16
Second, we measure whether or not our synthesized 70×70 0.63 0.21 0.16
256×256 0.47 0.18 0.13
cityscapes are realistic enough that off-the-shelf recognition
system can recognize the objects in them. This metric is Table 2: FCN-scores for different receptive field sizes of the dis-
criminator, evaluated on Cityscapes labels→photos.
similar to the “inception score” from [36], the object detec-
tion evaluation in [39], and the “semantic interpretability”
measure in [46]. 3.2. Analysis of the objective function
AMT perceptual studies For our AMT experiments, we Which components of the objective in Eqn. 4 are impor-
followed the protocol from [46]: Turkers were presented tant? We run ablation studies to isolate the effect of the L1
with a series of trials that pitted a “real” image against a term, the GAN term, and to compare using a discriminator
“fake” image generated by our algorithm. On each trial, conditioned on the input (cGAN, Eqn. 1) against using an
each image appeared for 1 second, after which the images unconditional discriminator (GAN, Eqn. 2).
disappeared and Turkers were given unlimited time to re- Figure 4 shows the qualitative effects of these variations
spond as to which was fake. The first 10 images of each on two labels→photo problems. L1 alone leads to reason-
session were practice and Turkers were given feedback. No able but blurry results. The cGAN alone (setting λ = 0 in
feedback was provided on the 40 trials of the main experi- Eqn. 4) gives much sharper results, but results in some arti-
ment. Each session tested just one algorithm at a time, and facts in facade synthesis. Adding both terms together (with
Turkers were not allowed to complete more than one ses- λ = 100) reduces these artifacts.
sion. ∼ 50 Turkers evaluated each algorithm. All images We quantify these observations using the FCN-score on
were presented at 256 × 256 resolution. Unlike [46], we the cityscapes labels→photo task (Table 1): the GAN-based
did not include vigilance trials. For our colorization ex- objectives achieve higher scores, indicating that the synthe-
periments, the real and fake images were generated from sized images include more recognizable structure. We also
the same grayscale input. For map↔aerial photo, the real test the effect of removing conditioning from the discrimi-
and fake images were not generated from the same input, in nator (labeled as GAN). In this case, the loss does not pe-
order to make the task more difficult and avoid floor-level nalize mismatch between the input and output; it only cares
results. that the output look realistic. This variant results in very
FCN-score While quantitative evaluation of generative poor performance; examining the results reveals that the
models is known to be challenging, recent works [36, generator collapsed into producing nearly the exact same
39, 46] have tried using pre-trained semantic classifiers to output regardless of input photograph. Clearly it is impor-
measure the discriminability of the generated images as a tant, in this case, that the loss measure the quality of the
pseudo-metric. The intuition is that if the generated images match between input and output, and indeed cGAN per-
are realistic, classifiers trained on real images will be able forms much better than GAN. Note, however, that adding
to classify the synthesized image correctly as well. To this an L1 term also encourages that the output respect the in-
end, we adopt the popular FCN-8s [26] architecture for se- put, since the L1 loss penalizes the distance between ground
mantic segmentation, and train it on the cityscapes dataset. truth outputs, which match the input, and synthesized out-
We then score synthesized photos by the classification accu- puts, which may not. Correspondingly, L1+GAN is also
racy against the labels these photos were synthesized from. effective at creating realistic renderings that respect the in-
Input Ground truth L1 cGAN L1 + cGAN

Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see
https://phillipi.github.io/pix2pix/ for additional examples.

L1 1x1 16x16 70x70 256x256

Figure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become
blurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16
PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70x70 PatchGAN forces
outputs that are sharp, even if incorrect, in both the spatial and spectral (coforfulness) dimensions. The full 256x256 ImageGAN produces
results that are visually similar to the 70x70 PatchGAN, but somewhat lower quality according to our FCN-score metric (Table 2). Please
see https://phillipi.github.io/pix2pix/ for additional examples.

put label maps. Combining all terms, L1+cGAN, performs butions over output color values in Lab color space. The
similarly well. ground truth distributions are shown with a dotted line. It
is apparent that L1 leads to a narrower distribution than the
Colorfulness A striking effect of conditional GANs is ground truth, confirming the hypothesis that L1 encourages
that they produce sharp images, hallucinating spatial struc- average, grayish colors. Using a cGAN, on the other hand,
ture even where it does not exist in the input label map. One pushes the output distribution closer to the ground truth.
might imagine cGANs have a similar effect on “sharpening”
in the spectral dimension – i.e. making images more color- 3.3. Analysis of the generator architecture
ful. Just as L1 will incentivize a blur when it is uncertain
where exactly to locate an edge, it will also incentivize an A U-Net architecture allows low-level information to
average, grayish color when it is uncertain which of several shortcut across the network. Does this lead to better results?
plausible color values a pixel should take on. Specially, L1 Figure 5 compares the U-Net against an encoder-decoder on
will be minimized by choosing the median of of the con- cityscape generation U-Net. The encoder-decoder is created
ditional probability density function over possible colors. simply by severing the skip connections in the U-Net. The
An adversarial loss, on the other hand, can in principle be- encoder-decoder is unable to learn to generate realistic im-
come aware that grayish outputs are unrealistic, and encour- ages in our experiments, and indeed collapses to producing
age matching the true color distribution [14]. In Figure 7, nearly identical results for each input label map. The advan-
we investigate if our cGANs actually achieve this effect on tages of the U-Net appear not to be specific to conditional
the Cityscapes dataset. The plots show the marginal distri- GANs: when both U-Net and encoder-decoder are trained
L L L L
−1 −1
−1 −1
CVPR CVPR
−1−1
CVPR
CVPR −3 −3 −3
CVPR
CVPR
#385 #385
−3−3 −3
#385
#385 #385
#385
−5 −5 −5
−5−5
CVPR
CVPR 2016
CVPR2016 Submission
2016Submission
Submission #385.
#385. CONFIDENTIAL
−5
CONFIDENTIAL
#385. REVIEW
REVIEW
CONFIDENTIAL COPY.
COPY.
REVIEW DO
COPY. DO NOT
NOT
DO DISTRIBUTE.
DISTRIBUTE.
NOT DISTRIBUTE.
−7
−7 −7 −7 −7
−7
L1L1
L1 L1 L1L1
cGAN
cGAN cGAN
cGAN
cGAN cGAN
−9
−9−9 L1+cGAN −9 −9
−9 L1+cGAN
L1+cGAN
L1+cGAN
L1+cGAN L1+cGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
648
648
648
−11
−11
−11
0 00 2020 4040 6060
Ground
Ground
Ground
8080
truth
truthtruth
100
−11−11
−11
0 00 20 20 40 40 60 60
Ground
Ground truthtruth
Ground
80
truth
100 702702
702
20 40 60 80 100100 20 40 60 80 80 100 100
649
649
649 L LL b bb a aa b bb a aa
703703
703
650
650
650 −1
−1−1
−1
−1−1 −1−1
−1 −1 −1
−1 −1 −1
−1
704704
704
651
651
−3
−3 −3−3 −3 −3 −3 −3 Histogram
Histogram
Histogram intersection
intersection
intersection 705705
651 705
−3 −3 −3
Histogram intersection
−3−3 −3 −3
(L)
(L)

(a)
against ground truth

log P (a)
against ground truth

(b)
log P (b)
P(L)

logPP(a)
against ground truth

logPP(b)
652
652
652
−5
−5
−5
−5
−5
−5
−5−5
−5
−5 −5
−5
−5 −5
−5
Loss
Loss L L
against
a a
ground truth
b bb 706706
706
logPP

Loss
Loss L
L a
a b
log

log
log
653
653 707707
−7 −7 −7 −7
log

−7
−7 −7−7
653
−7
−7
−7 −7
L1L1
−7 −7 −7
L1L1L1 0.81 0.81
0.810.69 0.69
0.69 0.700.70
0.70 707
L1
cGAN
cGAN −9 −9 −9 −9
−9 cGAN
L1
cGAN 0.87 0.81
0.87 0.74 0.69
0.74 0.70
0.840.84
654
654 708708
−9
−9 −9−9
654
−9
−9
−9 −9 cGAN
L1+cGAN
L1+cGAN
L1+cGAN
−9 −9
cGAN
cGAN 0.87
0.870.84 0.74 0.84
0.84
0.74 0.820.82 708
−11 −11
L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN −11 −11−11 −11 −11 L1+cGAN
L1+cGAN
L1+cGAN 0.86 0.86
0.86 0.84
0.84 0.82
655 709709
−11 −11
655 −11 Ground
Groundtruth
truth
655 −11
00
−11
7070
20 9090
40 110
60110
Ground truth
130
130 100
8080 150
100150
−11
7070 90 90 110110 130130
−11
70 70 90 90 110110 130 130 150 150
−11
70 70 L1+cGAN
70 PixelGAN
PixelGAN
90 90 110 110 0.86
0.83 130 130 0.84
110 0.83 0.68 0.68 0.78 0.82
0.78 709
PixelGAN 0.83 0.68
20 40 60
0 7020
L
90
40 110
60 130
90
a a aa
80 100150
110 13070 90
b110bb 130 70 150 90 130
0.78
656
656 bL
b
L a(b)a PixelGAN 0.83 0.68 0.78 710710
656 (a)
(a)b
(a) (b)
(b) (c)(c)
(c) (d)(d) 710
Figure 5: (a) distributionmatching (b) the (c) (d)GAN
657 Figure 5:Color
5: Colordistribution
Color distribution matching property
propertyofof
property of the cGAN,
cGAN, tested
testedonon Cityscapes.
Cityscapes. (c.f. Figure
(c.f. Figure1 of
11ofthe original
ofthe GAN
original paper
paper[14]). Note
[14]). Note 711711
−1 −1−1
657
657
−1
Figure
−1
matching
−1
the cGAN, tested on Cityscapes. (c.f. Figure the (d)
original GAN paper [14]). Note 711
658
658 that
that−3 the histogram intersection
−3
that−3 the histogram intersection
the histogram
−3−3
intersection scores
scoresare
aredominated
dominated by bydifferences
differences in the
in high
the high probability
probability region,
−3 scores are dominated by differences in the high probability region, which are imperceptible which
region, are
which imperceptible
are in in
imperceptible the
in plots,
the
theplots,
plots, 712712
658 Figure −57: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [14]). Note 712
659 which
which
which show
showlog
show log
log probability
probabilityand
probability and −5therefore
−5−5
and therefore
thereforeemphasize
emphasize
emphasize differences
differences
differences inin
the
in thelowlowprobability
probability regions.
regions. 713713
in the
the lowhighprobability regions.
−5
659
659 that the
−5
713
−7 histogram intersection scores
−7−7 are dominated by differences probability region, which are imperceptible in the plots,
660
660
−7
−7 −7 714714
660 which show log probability
L1 and therefore emphasize
1x1 differences in the low
16x16 probability regions. 70x70 256x256 714
661
661
−9
−9
−9 L1 L1
−9−9
1x1
−9 1x1 16x16 16x16 70x70 256x256
70x70 256x256 715715
661 −11 −11
715
662
662
−11
−11
−11
−11 716716
662 716
70
70 90
90 110
110 130
130 150
150 7070 9090 110
110 130
130
70 90 110 130 150 70 90 110 130
663
663 with an L1 loss, the U-Net again achieves the superior re-
663 Classification Ours 717717
717
664 L2 [46] (rebal.) [46] (L1 + cGAN) Ground truth 718718
664 sults (Figure 5).
664 718
665
665 719719
665 719
666 Figure
Figure 6: Patch size variations. Uncertainty ininthe output manifests itself differently for different loss functions. Uncertain regions become 720720
666
666 Figure 6: 6: Patch
Patch sizesize variations.
variations. Uncertainty
Uncertainty inthe theoutput
outputmanifests
manifestsitself itselfdifferently
differentlyfor fordifferent
differentloss lossfunctions.
functions.Uncertain
Uncertainregionsregionsbecome
become 720
667
667
3.4. From
blurry
blurry and
blurry and
PixelGANs
desaturated
and desaturated
desaturated under
to
under PatchGans
L1.
under L1.
The
L1. The
1x1
The 1x1
to ImageGANs
PixelGAN
1x1 PixelGAN
encourages
PixelGAN encourages
greater
encourages greater
color
greatercolor
diversity
colordiversity
diversitybut
but has
buthas
no
hasno
effect
noeffect
on
effecton
spatial
onspatial
statistics.
spatialstatistics.
The
statistics. The
16x16
The16x16
16x16 721721
667 PatchGAN creates locally sharp results, but also leads tototiling artifacts beyond thethescale it it
can observe. The 70x70 PatchGAN forces 721
668
668 PatchGAN
PatchGAN creates
creates locally
locally sharp
sharp results,
results, but
but also
also leads
leads to tiling
tiling artifacts
artifacts beyond
beyond the scale
scale it can
can observe.
observe. The
The 70x70
70x70 PatchGAN
PatchGAN forces
forces 722722
668 We test the
outputs
outputs that effect
are sharp, of varying the patch
even ififincorrect, size the
in both of ourand
N spatial dis-spectral (coforfulness) dimensions. The fullfull256x256 ImageGAN produces 722
669
669 outputs that that areare sharp,
sharp, even even if incorrect,
incorrect, in inboth
boththe thespatial
spatialand andspectral
spectral(coforfulness)
(coforfulness)dimensions.
dimensions.The The full256x256
256x256ImageGAN
ImageGANproduces produces 723723
results
669 criminator that are visually
receptive fields, similarfrom totothe 170x70 PatchGAN, buttosomewhat lower quality according to our FCN-score metric (Table 2). 723
670
670
results
results that
that are are visually
visually similar
similar toathe
the ×
70x70 1 “PixelGAN”
70x70 PatchGAN,
PatchGAN,but asomewhat
butsomewhat lower
lowerquality
qualityaccording
accordingtotoour ourFCN-score
FCN-scoremetricmetric(Table
(Table2). 2). 724724
670 full 256 × 256 “ImageGAN”1 . Figure 6 shows qualitative 724
671
671 725725
671 results of this analysis and Table 2 quantifies the effects us- 725
672 Classification
Classification Ours
Ours Input
Input Ground
Ground truth L1L1 cGAN 726726
672
672 L2
673 ing the FCN-score.
L2 [44]
[44] Classification
(rebal.)
Note
(rebal.) that [44]
elsewhere
[44] (L1
(L1 ++ Ours
cGAN)
incGAN)
this paper, Ground
Ground truth
unless truth Input Groundtruth truth L1 cGAN
cGAN 727
726
673 L2 [44] (rebal.) [44] (L1 + cGAN) Ground truth
673 specified, all experiments use 70 × 70 PatchGANs, and for 727
727
674 728
674
674 728
728
675 this section all experiments use an L1+cGAN loss. 729
675
675 729
729
676 The PixelGAN has no effect on spatial sharpness, but 730
676
676 730
730
677 does increase the colorfulness of the results (quantified in 731
677
677 731
678 Figure 7). For example, the bus in Figure 6 is painted gray Figure 8: Applying a conditional GAN to semantic segmentation. 732 731
678 Figure 8: 732
678
679 The Figure
cGAN 8: Applying
Applyingasharp
produces aconditional
conditional GAN totosemantic
images GANthat looksemantic
at glance
segmentation.
segmentation.
like the 733 732
679 when the net is trained with an L1 loss, but becomes red The cGAN produces sharp images that look atat glance like 733
679
680 The
ground cGAN
truth, butproduces
in fact sharp
include images
many that
small, look
hallucinated glance like the
objects. the 734 733
680
680
with the PixelGAN loss. Color histogram matching is a ground truth, but in fact include many small, hallucinated objects. 734
681 ground truth, but in fact include many small, hallucinated objects. 735 734
681
681
common problem in image processing [33], and PixelGANs 735
682 736 735
682 may be a promising lightweight solution.
682
683
736
737 736
683 nearly discrete, rather than “images”, with their continuous- 737
683
684 Using a 16×16 PatchGAN is sufficient to promote sharp nearly discrete,
discrete,rather
nearlyvariation. ratherthan “images”,
thancGANs
“images”, with
withtheir
theircontinuous-
continuous- 738 737
684 valued Although achieve some success, 738
684 outputs, but also leads to tiling artifacts. The 70 × 70 Patch-
685 valued
valued variation.
variation. Although
Although cGANs
cGANs achieve
achieve some
some success,
success, 739 738
685 they are far from the best available method for solving this 739
685 GAN alleviates these artifacts. Scaling beyond this, to the
686 they
they are
are far
far from
from the
the best
best available
available method
method for
for solving
solving this
this 740 739
686 problem:
Figure simply
9: Colorization using L1 regression
resultsL1ofregression gets
conditionalgets better
GANs scores
versus than
the L2 740
686 full 256 × 256 ImageGAN, does not appear to improve the
687 problem:
problem: simply
simply using
using L1 regression gets better
better scores
scores than
than 741 740
687 using a cGAN,
regression from as shown
[46] and the in full
Table 4. We(classification
method argue that forwith vision re- 741
687 visual quality of the results, and in fact gets a considerably
688 using
usingaafrom
problems,
cGAN,
cGAN,
the [48].
goal
as
asshown
shown
(i.e.
in
inTable
Table
predicting
4.4. We
Weargue
output argue
close
that
that for
forvision
to ground vision 742 741
688 Figure 7: Colorization balancing) The cGANs can produce compelling col- 742
688 lower
689 FCN-score (Tableresults 2). This of conditional
may be because GANs versus the the
Im- L2 problems,
problems, the
the goal
goal (i.e.
(i.e. predicting output close
close to to ground 743 742
689 Figure
Figure
regression 7:
7: Colorization
Colorization
from [44] and results
results
the of
fullof conditional
conditional
method GANs
GANs
(classification versus
versus
with the
the L2
re-L2 truth)
orizations may(firstbe less
two ambiguous
rows), butpredicting
than
have a commonoutputfailure
graphics tasks, and
mode ground
re-of 743
690 ageGAN
689
regressionhas many from moreand
[44] parameters
the full and greater
method depth than
(classification with re- truth)
truth)
construction may
may be
be
lossesless
less ambiguous
like ambiguous
L1 are than
than
mostly graphics
graphics
sufficient. tasks,
tasks, and
and re-
re- 744 743
690 regression
balancing) from
from [44]
[46]. and
The the
cGANs full method
can (classification
produce compelling withcol- re- producing a grayscale or desaturated result (last row). 744
690 the 70
691 × 70 PatchGAN,
balancing) from [46]. and
The may
cGANs be harder
can to train.
produce compelling col- construction
construction losses
losses like
like L1
L1 are
are mostly
mostly sufficient.
sufficient. 745 744
691 orizations
balancing)(first fromtwo [46].rows), Thebut cGANs have acan common
producefailure mode of
compelling col- 745
691
692 orizations 746 745
692 orizationsa(first
Fully-convolutional
producing grayscale
(first two
two rows), but
ortranslationbut have
desaturated
rows), have Anaa common
result advantage
(last
common failure
row).failureofmodethe of
mode of Photo → Map Map → Photo 746
692
693 producing 747 746
PatchGAN
producing is athat
a grayscale
a fixed-size
grayscale or
or desaturated patch result
desaturated (last
(lastrow).
discriminator
result row). can be Loss % Turkers labeled real % Turkers labeled real
693
693
694
applied to arbitrarily large images. We may also apply the
4. Conclusion
L1 2.8% ± 1.0% 0.8% ± 0.3% 748747
747
694
694
695 To begin to test this, we train a cGAN (with/without L1
4. Conclusion
4.L1+cGAN
Conclusion6.1% ± 1.3% 18.9% ± 2.5% 749748
748
generator convolutionally, on larger images than those on L1
695
696
695 loss)To Toonbegin to
cityscape
begin to test this,
this, we
testphoto!labels. we traintrain aaFigure
cGAN
cGAN8(with/without
shows qualita-L1
(with/without Table 3: AMTin“real
The results this vs fake”suggest
paper test on maps↔aerial
that conditional photos.
adver- 750749
749
which it was
loss)results, trained.
on cityscape We test
photo!labels.this on the map↔aerial 8 showsphoto
Figureaccuracies qualita-
696
697
696 tive and quantitative classification
loss) on cityscape photo!labels. Figure 8 shows qualita- are re- sarialThe resultsare
networks in this paper suggest
a promising approach thatforconditional
many image-
The results in this paper suggest that conditional adver- adver- 751750
750
task.tive
After training a generator onclassification
256×256 images, we test
697
698
697 tive results,
portedresults, and
and4.quantitative
in Table Interestingly,
quantitative cGANs, trained
classification accuracies
without
accuracies are
are re-
there- sarial
sarial networks
to-image translation
networks
Method are aa promising
aretasks, especially
promising approach
those
approach
% Turkers for
realmany
involving
labeled for many image-
highly
image- 752751
751
it on 512 ×
ported 512
in images.
Table 4. The results
Interestingly, in Figure
cGANs, 8 demonstrate
trained without the to-image
698
699
698 L1 loss, are able to solve this problem at a reasonable degree
ported in Table 4. Interestingly, cGANs, trained without the to-image translation tasks, especially those involvinghighly
structured L2translation
graphical
regression tasks,
outputs.
from [46] especially
These those
networks
16.3% ± 2.4% involving
learn a loss
highly 753752
752
the effectiveness
L1 loss, of this approach.
loss, are
areable to
tosolve this
thisproblem
this isat aareasonable degree structured graphical
et al. 2016outputs. These networks
699
700
699 ofL1accuracy. To
able our knowledge,
solve problem the
at first demonstra-
reasonable degree adapted toZhang
structured the task and data
graphical
[46] 27.8% ±
at hand,
outputs. which
These makeslearn
2.7%
networks them
learn aap-
a loss
loss 754753
753
Ours 22.5% ± 1.6%
700
701
700 of
tion accuracy.
of GANs To our knowledge,
successfully this
generating is the first
“labels”,
of accuracy. To our knowledge, this is the first demonstra- demonstra-
which are adapted
adapted to the task and data at hand, which makes themap-
plicable into
a the
wide task and
variety data
of at hand,
settings. which makes them ap- 755754
754
tion of GANs successfully generating “labels”, which are Table 4:a AMT
plicable “real vs of fake” test on colorization.
701
701 1 Wetion
achieve this variation in patch size by adjusting the depth of the are
of GANs successfully generating “labels”, which plicablein in awide
widevariety
variety ofsettings.
settings. 755
755
GAN discriminator. Details of this process, and the discriminator architec-
tures are provided in the appendix 7
77
To begin to test this, we train a cGAN (with/without L1
Loss Per-pixel acc. Per-class acc. Class IOU
L1 0.86 0.42 0.35 loss) on cityscape photo→labels. Figure 10 shows qualita-
cGAN 0.74 0.28 0.22 tive results, and quantitative classification accuracies are re-
L1+cGAN 0.83 0.36 0.29 ported in Table 5. Interestingly, cGANs, trained without the
Table 5: Performance of photo→labels on cityscapes. L1 loss, are able to solve this problem at a reasonable degree
Input Ground truth L1 cGAN of accuracy. To our knowledge, this is the first demonstra-
tion of GANs successfully generating “labels”, which are
nearly discrete, rather than “images”, with their continuous-
valued variation2 . Although cGANs achieve some success,
they are far from the best available method for solving this
problem: simply using L1 regression gets better scores than
using a cGAN, as shown in Table 5. We argue that for vision
problems, the goal (i.e. predicting output close to ground
truth) may be less ambiguous than graphics tasks, and re-
construction losses like L1 are mostly sufficient.

Figure 10: Applying a conditional GAN to semantic segmenta-


4. Conclusion
tion. The cGAN produces sharp images that look at glance like The results in this paper suggest that conditional adver-
the ground truth, but in fact include many small, hallucinated ob- sarial networks are a promising approach for many image-
jects.
to-image translation tasks, especially those involving highly
structured graphical outputs. These networks learn a loss
3.5. Perceptual validation adapted to the task and data at hand, which makes them ap-
plicable in a wide variety of settings.
We validate the perceptual realism of our results on the
tasks of map↔aerial photograph and grayscale→color. Re-
Acknowledgments: We thank Richard Zhang and Deepak Pathak
sults of our AMT experiment for map↔photo are given in
for helpful discussions. This work was supported in part by NSF SMA-
Table 3. The aerial photos generated by our method fooled
1514512, NGA NURI, IARPA via Air Force Research Laboratory, Intel
participants on 18.9% of trials, significantly above the L1
Corp, and hardware donations by nVIDIA. Disclaimer: The views and
baseline, which produces blurry results and nearly never
conclusions contained herein are those of the authors and should not be in-
fooled participants. In contrast, in the photo→map direc-
terpreted as necessarily representing the official policies or endorsements,
tionm our method only fooled participants on 6.1% of tri-
either expressed or implied, of IARPA, AFRL or the U.S. Government.
als, and this was not significantly different than the perfor-
mance of the L1 baseline (based on bootstrap test). This
may be because minor structural errors are more visible
in maps, which have rigid geometry, than in aerial pho-
tographs, which are more chaotic.
We trained colorization on ImageNet [35], and tested
on the test split introduced by [46, 23]. Our method, with
L1+cGAN loss, fooled participants on 22.5% of trials (Ta-
ble 4). We also tested the results of [46] and a variant of
their method that used an L2 loss (see [46] for details). The
conditional GAN scored similarly to the L2 variant of [46]
(difference insignificant by bootstrap test), but fell short of
[46]’s full method, which fooled participants on 27.8% of
trials in our experiment. We note that their method was
specifically engineered to do well on colorization.

3.6. Semantic segmentation


Conditional GANs appear to be effective on problems
where the output is highly detailed or photographic, as is
common in image processing and graphics tasks. What 2 Note that the label maps we train on are not exactly discrete valued,
about vision problems, like semantic segmentation, where as they are resized from the original maps using bilinear interpolation and
the output is instead less complex than the input? saved as jpeg images, with some compression artifacts.
Aerial photo to map Map to aerial photo

input output input output

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256x256 resolution, and run convolu-
tionally on the larger images at test time). Contrast adjusted for clarity.

Input Ground truth Output Input Ground truth Output

Figure 11: Example results of our method on Cityscapes labels→photo, compared to ground truth.
Input Ground truth Output Input Ground truth Output

Figure 12: Example results of our method on facades labels→photo, compared to ground truth
Input Ground truth Output Input Ground truth Output

Figure 13: Example results of our method on day→night, compared to ground truth.

Input Ground truth Output Input Ground truth Output

Figure 14: Example results of our method on automatically detected edges→handbags, compared to ground truth.
Input Ground truth Output Input Ground truth Output

Figure 15: Example results of our method on automatically detected edges→shoes, compared to ground truth.

Input Output Input Output Input Output Input Output

Figure 16: Example results of the edges→photo models applied to human-drawn sketches from [10]. Note that the models were trained on
automatically detected edges, but generalize to human drawings
Day Night Labels Facade Labels Street scene

Edges Shoe Edges Handbag Sketch Shoe Sketch Handbag

Figure 17: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some
of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling
unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results.
References [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
[1] A. Buades, B. Coll, and J.-M. Morel. A non-local algo- 2015. 3, 4
rithm for image denoising. In CVPR, volume 2, pages 60–65.
[19] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
IEEE, 2005. 1
real-time style transfer and super-resolution. 2016. 2, 3
[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and [20] D. Kingma and J. Ba. Adam: A method for stochastic opti-
A. L. Yuille. Semantic image segmentation with deep con- mization. ICLR, 2015. 4
volutional nets and fully connected crfs. In ICLR, 2015. 2
[21] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
[3] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. attributes for high-level understanding and editing of outdoor
Sketch2photo: internet image montage. ACM Transactions scenes. ACM Transactions on Graphics (TOG), 33(4):149,
on Graphics (TOG), 28(5):124, 2009. 1 2014. 1, 4, 16
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [22] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen-
R. Benenson, U. Franke, S. Roth, and B. Schiele. The coding beyond pixels using a learned similarity metric. arXiv
cityscapes dataset for semantic urban scene understanding. preprint arXiv:1512.09300, 2015. 4
In CVPR), 2016. 4, 16 [23] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-
[5] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera- resentations for automatic colorization. ECCV, 2016. 2, 8,
tive image models using a laplacian pyramid of adversarial 16
networks. In NIPS, pages 1486–1494, 2015. 2 [24] C. Li and M. Wand. Combining markov random fields and
[6] A. Dosovitskiy and T. Brox. Generating images with per- convolutional neural networks for image synthesis. CVPR,
ceptual similarity metrics based on deep networks. arXiv 2016. 2, 4
preprint arXiv:1602.02644, 2016. 2 [25] C. Li and M. Wand. Precomputed real-time texture synthe-
[7] A. A. Efros and W. T. Freeman. Image quilting for tex- sis with markovian generative adversarial networks. ECCV,
ture synthesis and transfer. In SIGGRAPH, pages 341–346. 2016. 2, 4
ACM, 2001. 1, 4 [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[8] A. A. Efros and T. K. Leung. Texture synthesis by non- networks for semantic segmentation. In CVPR, pages 3431–
parametric sampling. In ICCV, volume 2, pages 1033–1038. 3440, 2015. 1, 2, 5
IEEE, 1999. 4 [27] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale
[9] D. Eigen and R. Fergus. Predicting depth, surface normals video prediction beyond mean square error. ICLR, 2016. 2,
and semantic labels with a common multi-scale convolu- 3
tional architecture. In Proceedings of the IEEE International [28] M. Mirza and S. Osindero. Conditional generative adversar-
Conference on Computer Vision, pages 2650–2658, 2015. 1 ial nets. arXiv preprint arXiv:1411.1784, 2014. 2
[10] M. Eitz, J. Hays, and M. Alexa. How do humans sketch [29] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.
objects? SIGGRAPH, 31(4):44–1, 2012. 4, 12 Efros. Context encoders: Feature learning by inpainting.
[11] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. CVPR, 2016. 2, 3
Freeman. Removing camera shake from a single photograph. [30] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
In ACM Transactions on Graphics (TOG), volume 25, pages sentation learning with deep convolutional generative adver-
787–794. ACM, 2006. 1 sarial networks. arXiv preprint arXiv:1511.06434, 2015. 2,
[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis 3, 16
and the controlled generation of natural stimuli using convo- [31] R. Š. Radim Tyleček. Spatial pattern templates for recogni-
lutional neural networks. arXiv preprint arXiv:1505.07376, tion of objects with regular structure. In Proc. GCPR, Saar-
12, 2015. 4 brucken, Germany, 2013. 4, 16
[13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer [32] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
using convolutional neural networks. CVPR, 2016. 4 H. Lee. Generative adversarial text to image synthesis. arXiv
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, preprint arXiv:1605.05396, 2016. 2
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- [33] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color
erative adversarial nets. In NIPS, 2014. 2, 4, 6, 7 transfer between images. IEEE Computer Graphics and Ap-
[15] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. plications, 21:34–41, 2001. 7
Salesin. Image analogies. In SIGGRAPH, pages 327–340. [34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
ACM, 2001. 1, 4 tional networks for biomedical image segmentation. In MIC-
[16] G. E. Hinton and R. R. Salakhutdinov. Reducing the CAI, pages 234–241. Springer, 2015. 2, 3, 4
dimensionality of data with neural networks. Science, [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
313(5786):504–507, 2006. 3 S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[17] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be et al. Imagenet large scale visual recognition challenge.
Color!: Joint End-to-end Learning of Global and Local Im- IJCV, 115(3):211–252, 2015. 4, 8, 16
age Priors for Automatic Image Colorization with Simulta- [36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
neous Classification. ACM Transactions on Graphics (TOG), ford, and X. Chen. Improved techniques for training gans.
35(4), 2016. 2 arXiv preprint arXiv:1606.03498, 2016. 2, 4, 5
[37] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven
hallucination of different times of day from a single outdoor
photo. ACM Transactions on Graphics (TOG), 32(6):200,
2013. 1
[38] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-
ization: The missing ingredient for fast stylization. arXiv
preprint arXiv:1607.08022, 2016. 4
[39] X. Wang and A. Gupta. Generative image modeling using
style and structure adversarial networks. ECCV, 2016. 2, 3,
5
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004. 2
[41] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc-
tured labeling with convolutional pseudoprior. 2015. 2
[42] S. Xie and Z. Tu. Holistically-nested edge detection. In
ICCV, 2015. 1, 2, 4
[43] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
level domain transfer. ECCV, 2016. 2, 3
[44] A. Yu and K. Grauman. Fine-Grained Visual Comparisons
with Local Learning. In CVPR, 2014. 4
[45] A. Yu and K. Grauman. Fine-grained visual comparisons
with local learning. In CVPR, pages 192–199, 2014. 16
[46] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
tion. ECCV, 2016. 1, 2, 5, 7, 8, 16
[47] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based genera-
tive adversarial network. arXiv preprint arXiv:1609.03126,
2016. 2
[48] Y. Zhou and T. L. Berg. Learning temporal transformations
from time-lapse videos. In ECCV, 2016. 2, 3, 7
[49] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
Generative visual manipulation on the natural image mani-
fold. In ECCV, 2016. 2, 4, 16
5. Appendix 256 × 256 discriminator:
C64-C128-C256-C512-C512-C512
5.1. Network architectures
We adapt our network architectures from those Note the the 256 × 256 discriminator has receptive fields
in [30]. Code for the models is available at that could cover up to 574 × 574 pixels, if they were avail-
https://github.com/phillipi/pix2pix. able, but since the input images are only 256 × 256 pixels,
Let Ck denote a Convolution-BatchNorm-ReLU layer only 256×256 pixels are seen, and so we refer to this setting
with k filters. CDk denotes a a Convolution-BatchNorm- as the 256 × 256 discriminator.
Dropout-ReLU layer with a dropout rate of 50%. All convo-
lutions are 4 × 4 spatial filters applied with stride 2. Convo- 5.2. Training details
lutions in the encoder, and in the discriminator, downsample Random jitter was applied by resizing the 256×256 input
by a factor of 2, whereas in the decoder they upsample by a images to 286 × 286 and then randomly cropping back to
factor of 2. size 256 × 256.
All networks were trained from scratch. Weights were
5.1.1 Generator architectures initialized from a Gaussian distribution with mean 0 and
standard deviation 0.02.
The encoder-decoder architecture consists of: Semantic labels→photo 2975 training images from the
encoder: Cityscapes training set [4], trained for 200 epochs, batch
C64-C128-C256-C512-C512-C512-C512-C512 size 1, with random jitter and mirroring. We used the
decoder: Cityscapes val set for testing.
CD512-CD512-CD512-C512-C512-C256-C128 Architectural labels→photo 400 training images from
-C64 [31], trained for 200 epochs, batch size 1, with random jit-
After the last layer in the decoder, a convolution is ap- ter and mirroring. Data from was split into train and test
plied to map to the number of output channels (3 in general, randomly.
except in colorization, where it is 2), followed by a Tanh Maps↔aerial photograph 1096 training images
function. As an exception to the above notation, Batch- scraped from Google Maps, trained for 200 epochs, batch
Norm is not applied to the first C64 layer in the encoder. size 1, with random jitter and mirroring. Images were
All ReLUs in the encoder are leaky, with slope 0.2, while sampled from in and around New York City. Data was then
ReLUs in the decoder are not leaky. split into train and test about the median latitude of the
The U-Net architecture is identical except with skip con- sampling region (with a buffer region added to ensure that
nections between each layer i in the encoder and layer n − i no training pixel appeared in the test set).
in the decoder, where n is the total number of layers. The BW→color 1.2 million training images (Imagenet train-
skip connections concatenate activations from layer i to ing set [35]), trained for ∼ 6 epochs, batch size 4, with only
layer n − i. This changes the number of channels in the mirroring, no random jitter. Tested on subset of Imagenet
decoder: val set, following protocol of [46] and [23].
U-Net decoder: Edges→shoes 50k training images from UT Zappos50K
CD512-CD1024-CD1024-C1024-C1024-C512 dataset [45] trained for 15 epochs, batch size 4. Data from
-C256-C128 was split into train and test randomly.
Edges→Handbag 137K Amazon Handbag images from
5.1.2 Discriminator architectures [49], trained for 15 epochs, batch size 4. Data from was split
into train and test randomly.
The 70 × 70 discriminator architecture is:
Day→night 17823 training images extracted from 91
C64-C128-C256-C512
webcams, from [21] trained for 17 epochs, batch size 4,
After the last layer, a convolution is applied to map to a 1
with random jitter and mirroring. We use 91 webcams as
dimensional output, followed by a Sigmoid function. As an
training, and 10 webcams for test.
exception to the above notation, BatchNorm is not applied
to the first C64 layer. All ReLUs are leaky, with slope 0.2.
All other discriminators follow the same basic architec-
ture, with depth varied to modify the receptive field size:
1 × 1 discriminator:
C64-C128 (note, in this special case, all convolutions are
1 × 1 spatial filters)
16 × 16 discriminator:
C64-C128

Potrebbero piacerti anche