Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

Deep Unsupervised Learning using
Nonequilibrium Thermodynamics
Jascha Sohl-Dickstein JASCHA @ STANFORD . EDU

Stanford University
Eric A. Weiss EAWEISS @ BERKELEY. EDU
University of California, Berkeley
arXiv:1503.03585v8 [cs.LG] 18 Nov 2015
Niru Maheswaranathan NIRUM @ STANFORD . EDU

Stanford University
Surya Ganguli SGANGULI @ STANFORD . EDU
Stanford University
Abstract these models are unable to aptly describe structure in rich

datasets. On the other hand, models that are flexible can be
A central problem in machine learning involves
molded to fit structure in arbitrary data. For example, we
modeling complex data-sets using highly flexi-
can define models in terms of any (non-negative) function
ble families of probability distributions in which
learning, sampling, inference, and evaluation (x) yielding the flexible distribution p (x) = (x)
Z , where
are still analytically or computationally tractable. Z is a normalization constant. However, computing this
Here, we develop an approach that simultane- normalization constant is generally intractable. Evaluating,
ously achieves both flexibility and tractability. training, or drawing samples from such flexible models typ-
The essential idea, inspired by non-equilibrium ically requires a very expensive Monte Carlo process.
statistical physics, is to systematically and slowly A variety of analytic approximations exist which amelio-
destroy structure in a data distribution through rate, but do not remove, this tradeofffor instance mean
an iterative forward diffusion process. We then field theory and its expansions (T, 1982; Tanaka, 1998),
learn a reverse diffusion process that restores variational Bayes (Jordan et al., 1999), contrastive diver-
structure in data, yielding a highly flexible and gence (Welling & Hinton, 2002; Hinton, 2002), minimum
tractable generative model of the data. This ap- probability flow (Sohl-Dickstein et al., 2011b;a), minimum
proach allows us to rapidly learn, sample from, KL contraction (Lyu, 2011), proper scoring rules (Gneit-
and evaluate probabilities in deep generative ing & Raftery, 2007; Parry et al., 2012), score matching
models with thousands of layers or time steps, (Hyvarinen, 2005), pseudolikelihood (Besag, 1975), loopy
as well as to compute conditional and posterior belief propagation (Murphy et al., 1999), and many, many
probabilities under the learned model. We addi- more. Non-parametric methods (Gershman & Blei, 2012)
tionally release an open source reference imple- can also be very effective1 .
mentation of the algorithm.
1.1. Diffusion probabilistic models
1. Introduction We present a novel way to define probabilistic models that

allows:
Historically, probabilistic models suffer from a tradeoff be-
tween two conflicting objectives: tractability and flexibil- 1. extreme flexibility in model structure,
ity. Models that are tractable can be analytically evaluated 2. exact sampling,
and easily fit to data (e.g. a Gaussian or Laplace). However, 1
Non-parametric methods can be seen as transitioning
nd smoothly between tractable and flexible models. For instance,
Proceedings of the 32 International Conference on Machine
a non-parametric Gaussian mixture model will represent a small
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
amount of data using a single Gaussian, but may represent infinite
right 2015 by the author(s).
data as a mixture of an infinite number of Gaussians.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
3. easy multiplication with other distributions, e.g. in or- 2. We show how to easily multiply the learned distribu-
der to compute a posterior, and tion with another probability distribution (eg with a
4. the model log likelihood, and the probability of indi- conditional distribution in order to compute a poste-
vidual states, to be cheaply evaluated. rior)
3. We address the difficulty that training the inference
Our method uses a Markov chain to gradually convert one
model can prove particularly challenging in varia-
distribution into another, an idea used in non-equilibrium
tional inference methods, due to the asymmetry in the
statistical physics (Jarzynski, 1997) and sequential Monte
objective between the inference and generative mod-
Carlo (Neal, 2001). We build a generative Markov chain
els. We restrict the forward (inference) process to a
which converts a simple known distribution (e.g. a Gaus-
simple functional form, in such a way that the re-
sian) into a target (data) distribution using a diffusion pro-
verse (generative) process will have the same func-
cess. Rather than use this Markov chain to approximately
tional form.
evaluate a model which has been otherwise defined, we ex-
plicitly define the probabilistic model as the endpoint of the 4. We train models with thousands of layers (or time
Markov chain. Since each step in the diffusion chain has an steps), rather than only a handful of layers.
analytically evaluable probability, the full chain can also be 5. We provide upper and lower bounds on the entropy
analytically evaluated. production in each layer (or time step)
Learning in this framework involves estimating small per- There are a number of related techniques for training prob-
turbations to a diffusion process. Estimating small pertur- abilistic models (summarized below) that develop highly
bations is more tractable than explicitly describing the full flexible forms for generative models, train stochastic tra-
distribution with a single, non-analytically-normalizable, jectories, or learn the reversal of a Bayesian network.
potential function. Furthermore, since a diffusion process Reweighted wake-sleep (Bornschein & Bengio, 2015) de-
exists for any smooth target distribution, this method can velops extensions and improved learning rules for the orig-
capture data distributions of arbitrary form. inal wake-sleep algorithm. Generative stochastic networks
(Bengio & Thibodeau-Laufer, 2013; Yao et al., 2014) train
We demonstrate the utility of these diffusion probabilistic a Markov kernel to match its equilibrium distribution to
models by training high log likelihood models for a two- the data distribution. Neural autoregressive distribution
dimensional swiss roll, binary sequence, handwritten digit estimators (Larochelle & Murray, 2011) (and their recur-
(MNIST), and several natural image (CIFAR-10, bark, and rent (Uria et al., 2013a) and deep (Uria et al., 2013b) ex-
dead leaves) datasets. tensions) decompose a joint distribution into a sequence
of tractable conditional distributions over each dimension.
1.2. Relationship to other work Adversarial networks (Goodfellow et al., 2014) train a gen-
The wake-sleep algorithm (Hinton, 1995; Dayan et al., erative model against a classifier which attempts to dis-
1995) introduced the idea of training inference and gen- tinguish generated samples from true data. A similar ob-
erative probabilistic models against each other. This jective in (Schmidhuber, 1992) learns a two-way map-
approach remained largely unexplored for nearly two ping to a representation with marginally independent units.
decades, though with some exceptions (Sminchisescu et al., In (Rippel & Adams, 2013; Dinh et al., 2014) bijective
2006; Kavukcuoglu et al., 2010). There has been a re- deterministic maps are learned to a latent representation
cent explosion of work developing this idea. In (Kingma with a simple factorial density function. In (Stuhlmuller
& Welling, 2013; Gregor et al., 2013; Rezende et al., 2014; et al., 2013) stochastic inverses are learned for Bayesian
Ozair & Bengio, 2014) variational learning and inference networks. Mixtures of conditional Gaussian scale mix-
algorithms were developed which allow a flexible genera- tures (MCGSMs) (Theis et al., 2012) describe a dataset
tive model and posterior distribution over latent variables using Gaussian scale mixtures, with parameters which de-
to be directly trained against each other. pend on a sequence of causal neighborhoods. There is
additionally significant work learning flexible generative
The variational bound in these papers is similar to the one mappings from simple latent distributions to data distribu-
used in our training objective and in the earlier work of tions early examples including (MacKay, 1995) where
(Sminchisescu et al., 2006). However, our motivation and neural networks are introduced as generative models, and
model form are both quite different, and the present work (Bishop et al., 1998) where a stochastic manifold mapping
retains the following differences and advantages relative to is learned from a latent space to the data space. We will
these techniques: compare experimentally against adversarial networks and
MCGSMs.
1. We develop our framework using ideas from physics,
quasi-static processes, and annealed importance sam- Related ideas from physics include the Jarzynski equal-
pling rather than from variational Bayesian methods. ity (Jarzynski, 1997), known in machine learning as An-
T
t=0 t= 2 t=T
2 2 2
0 0 0
q x(0T )
2 2 2
2 0 2 2 0 2 2 0 2
2 2 2
0 0 0
p x(0T )
2 2 2
2 0 2 2 0 2 2 0 2

f x(t) , t x(t)
Figure
1. The proposed modeling framework trained on 2-d swiss roll data. The top row shows time slices from the forward trajectory
(0T )
q x . The data distribution (left) undergoes Gaussian diffusion, which gradually transforms it into an identity-covariance Gaus-

sian (right). The middle row shows the corresponding time slices from the trained reverse trajectory p x(0T ) . An identity-covariance
Gaussian (right) undergoes a Gaussian diffusion process with learned mean and covariance
functions, and is gradually transformed back
into the data distribution (left). The bottom row shows the drift term, f x(t) , t x(t) , for the same reverse diffusion process.
nealed Importance Sampling (AIS) (Neal, 2001), which how the reverse, generative diffusion process can be trained
uses a Markov chain which slowly converts one distribu- and used to evaluate probabilities. We also derive entropy
tion into another to compute a ratio of normalizing con- bounds for the reverse process, and show how the learned
stants. In (Burda et al., 2014) it is shown that AIS can also distributions can be multiplied by any second distribution
be performed using the reverse rather than forward trajec- (e.g. as would be done to compute a posterior when in-
tory. Langevin dynamics (Langevin, 1908), which are the painting or denoising an image).
stochastic realization of the Fokker-Planck equation, show
how to define a Gaussian diffusion process which has any 2.1. Forward Trajectory
target distribution as its equilibrium. In (Suykens & Vande-
walle, 1995) the Fokker-Planck equation is used to perform We label the data distribution q x(0) . The data distribu-
stochastic optimization. Finally, the Kolmogorov forward tion is gradually converted into a well behaved (analyti-
and backward equations (Feller, 1949) show that for many cally tractable) distribution (y) by repeated application
forward diffusion processes, the reverse diffusion processes of a Markov diffusion kernel T (y|y0 ; ) for (y), where
can be described using the same functional form. is the diffusion rate,
2. Algorithm
Our goal is to define a forward (or inference) diffusion pro-
cess which converts any complex data distribution into a Z
simple, tractable, distribution, and then learn a finite-time
(y) = dy0 T (y|y0 ; ) (y0 ) (1)
reversal of this diffusion process which defines our gener-
ative model distribution (See Figure 1). We first describe
q x(t) |x(t1) = T x(t) |x(t1) ; t . (2)
the forward, inference diffusion process. We then show
T
t=0 t= 2 t=T
0 0 0
5 5 5
10 10 10
Sample
Sample
Sample

p x(0T )
15 15 15
20 20 20
0 5 10 15 0 5 10 15 0 5 10 15
Bin Bin Bin
Figure 2. Binary sequence learning via binomial diffusion. A binomial diffusion model was trained on binary heartbeat data, where a
pulse occurs every 5th bin. Generated samples (left) are identical to the training data. The sampling procedure consists of initialization
at independent binomial noise (right), which is then transformed into the data distribution by a binomial diffusion process, with trained
bit flip probabilities. Each row contains an independent sample. For ease of visualization, all samples have been shifted so that a pulse
occurs in the first column. In the raw sequence data, the first pulse is uniformly distributed over the first five bins.
(a) (b)
(c) (d)
Figure 3. The proposed framework trained on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset. (a) Example holdout data (similar
to training data). (b) Holdout data corrupted with Gaussian noise of variance 1 (SNR = 1). (c) Denoised images, generated by sampling
from the posterior distribution over denoised images conditioned on the images in (b). (d) Samples generated by the diffusion model.

The forward trajectory, corresponding to starting at the data For the experiments shown below, q x(t) |x(t1) corre-
distribution and performing T steps of diffusion, is thus sponds to either Gaussian diffusion into a Gaussian distri-
bution with identity-covariance, or binomial diffusion into
an independent binomial distribution. Table App.1 gives
T
Y the diffusion kernels for both Gaussian and binomial distri-
q x(0T ) = q x(0) q x(t) |x(t1) (3) butions.
t=1
2.2. Reverse Trajectory This can be evaluated rapidly by averaging over samples
from the forward trajectory q x(1T ) |x(0) . For infinites-
The generative distribution will be trained to describe the
imal the forward and reverse distribution over trajecto-
same trajectory, but in reverse,
ries can be made identical (see Section 2.2). If they are
identical then only a single sample from q x(1T ) |x(0)
p x(T ) = x(T ) (4) is required to exactly evaluate the above integral, as can
T
Y be seen by substitution. This corresponds to the case of a
p x(0T ) = p x(T ) p x(t1) |x(t) . (5) quasi-static process in statistical physics (Spinney & Ford,
t=1 2013; Jarzynski, 2011).
For both Gaussian and binomial diffusion, for continuous
2.4. Training
diffusion (limit of small step size ) the reversal of the
diffusion process has the identical functional form as the Training amounts to maximizing the model log likelihood,
forward process (Feller, 1949). Since q x(t) |x(t1) is a Z
Gaussian (binomial)
distribution, and if t is small, then L = dx(0) q x(0) log p x(0) (10)
q x(t1) |x(t) will also be a Gaussian (binomial) distribu- Z
tion. The longer the trajectory the smaller the diffusion rate = dx(0) q x(0)
can be made. R
dx(1T ) q x(1T ) |x(0)
During learning only the mean and covariance for a Gaus- log QT p(x(t1) |x(t) ) , (11)
sian diffusion kernel, or the bit flip probability for a bi- p x(T ) t=1 q (x(t) |x(t1) )
nomial kernel, need be estimated. As shown in Table
App.1, f x(t) , t and f x(t) , t are functions defining which has a lower bound provided by Jensens inequality,
the mean and covariance of the reverse Markov transitions Z
for a Gaussian, and fb x(t) , t is a function providing the

L dx(0T ) q x(0T )
bit flip probability for a binomial distribution. The compu-
tational cost of running this algorithm is the cost of these "
YT
p x(t1) |x(t)
#
(T )
functions, times the number of time-steps. For all results in log p x . (12)
this paper, multi-layer perceptrons are used to define these t=1
q x(t) |x(t1)
functions. A wide range of regression or function fitting
techniques would be applicable however, including nonpa- As described in Appendix B, for our diffusion trajectories
rameteric methods. this reduces to,
LK (13)
2.3. Model Probability
T Z
X
The probability the generative model assigns to the data is K = dx(0) dx(t) q x(0) , x(t)
t=2
Z
p x(0) = dx(1T ) p x(0T ) . (6) DKL q x(t1) |x(t) , x(0) p x(t1) |x(t)

Naively this integral is intractable but taking a cue from + Hq X(T ) |X(0) Hq X(1) |X(0) Hp X(T ) .
annealed importance sampling and the Jarzynski equality, (14)
we instead evaluate the relative probability of the forward
and reverse trajectories, averaged over forward trajectories, where the entropies and KL divergences can be analyt-
ically computed. The derivation of this bound parallels
(1T ) (0)

(0)
Z
(1T )

(0T ) q x
|x the derivation of the log likelihood bound in variational
p x = dx p x (7)
Bayesian methods.
q x(1T ) |x(0)
Z p x(0T ) As in Section 2.3 if the forward and reverse trajectories are
(1T ) (1T ) (0)
= dx q x |x identical, corresponding to a quasi-static process, then the
q x(1T ) |x(0)
(8) inequality in Equation 13 becomes an equality.
Training consists of finding the reverse Markov transitions
Z
= dx(1T ) q x(1T ) |x(0) which maximize this lower bound on the log likelihood,
T
p x(t1) |x(t)
Y
p x (T ) . (9) p x(t1) |x(t) = argmax K. (15)
t=1
q x(t) |x(t1) p(x(t1) |x(t) )
The specific targets of estimation for Gaussian and bino- to multiply distributions in the context of diffusion proba-
mial diffusion are given in Table App.1. bilistic models.
Thus, the task of estimating a probability distribution has
2.5.1. M ODIFIED M ARGINAL D ISTRIBUTIONS
been reduced to the task of performing regression on the
functions which set the mean and covariance of a sequence First, in order to compute p x(0) , we multiply each of
of Gaussians (or set the state flip probability for a sequence the intermediate
distributions by a corresponding function
of Bernoulli trials). r x(t) . We use a tilde above a distribution or Markov
transition to denote that it belongs to a trajectory that has
2.4.1. S ETTING THE D IFFUSION R ATE t been modified in this way. p x(0T ) is the modified re-

The choice of t in the forward trajectory is important for verse trajectory, which starts at the distribution p x(T ) =
1

the performance of the trained model. In AIS, the right ZT
p x(T ) r x(T ) and proceeds through the sequence of
schedule of intermediate distributions can greatly improve intermediate distributions
the accuracy of the log partition function estimate (Grosse
1 (t) (t)
p x(t) = p x r x , (16)
et al., 2013). In thermodynamics the schedule taken when Zt
moving between equilibrium distributions determines how
where Zt is the normalizing constant for the tth intermedi-
much free energy is lost (Spinney & Ford, 2013; Jarzynski,
ate distribution.
2011).
In the case of Gaussian diffusion, we learn2 the forward 2.5.2. M ODIFIED D IFFUSION S TEPS
diffusion schedule 2T by gradient ascent on K. The
variance 1 of the first step is fixed to a small constant The Markov kernel p x(t) | x(t+1) for the reverse diffu-
to prevent overfitting. The dependence of samples from sion process obeys the equilibrium condition
Z
q x(1T ) |x(0) on 1T is made explicit by using frozen

p x(t = dx(t+1) p xt) | x(t+1) p xt+1) . (17)
noise as in (Kingma & Welling, 2013) the noise is treated
as an additional auxiliary variable, and held constant while
We wish the perturbed Markov kernel p x(t) | x(t+1) to
computing partial derivatives of K with respect to the pa-
instead obey the equilibrium condition for the perturbed
rameters.
distribution,
For binomial diffusion, the discrete state space makes gra-
(t)
Z
dient ascent with frozen noise impossible. We instead p x = dx(t+1) p x(t) | x(t+1) p xt+1) ,
choose the forward diffusion schedule 1T to erase a con-
(18)
stant fraction T1 of the original signal per diffusion step,
(t) (t)
p x r x
Z
1
yielding a diffusion rate of t = (T t + 1) . dx(t+1) p x(t) | x(t+1)
=
Zt

2.5. Multiplying Distributions, and Computing p x(t+1) r x(t+1)
Posteriors , (19)
Zt+1
Z
Tasks such as computing a posterior in order to do signal

p x(t) = dx(t+1) p x(t) | x(t+1)
denoising or inference of missing values requires multipli-
cation of the model distribution p x(0) with a second dis-

Zt r x(t+1) (t+1)
p x .

tribution, or bounded positive function, r x(0) , producing
Zt+1 r x(t)
a new distribution p x(0) p x(0) r x(0) . (20)
Multiplying distributions is costly and difficult for many
techniques, including variational autoencoders, GSNs, Equation 20 will be satisfied if
NADEs, and most graphical models. However, under a dif- (t)

(t) (t+1) Zt+1 r x

(t) (t+1)
fusion model it is straightforward, since the second distri- p x |x = p x |x . (21)
Zt r x(t+1)
bution can be treated either as a small perturbation to each
step in the diffusion process, or often exactly multiplied Equation 21 may not correspond to a normalized proba-
into each diffusion step. Figures 3 and 5 demonstrate the bility distribution, so we choose p x(t) |x(t+1) to be the
use of a diffusion model to perform denoising and inpaint- corresponding normalized distribution
ing of natural images. The following sections describe how 1
p x(t) |x(t+1) = p x(t) |x(t+1) r x(t) ,
2
Recent experiments suggest that it is just as effective to in- Zt x(t+1)
stead use the same fixed t schedule as for binomial diffusion. (22)
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250

(a) 0 50 100 150 200 250 (b) 0 50 100 150 200 250 (c) 0 50 100 150 200 250
Figure 4. The proposed framework trained on dead leaf images (Jeulin, 1997; Lee et al., 2001). (a) Example training image. (b) A sample
from the previous state of the art natural image model (Theis et al., 2012) trained on identical data, reproduced here with permission.
(c) A sample generated by the diffusion model. Note that it demonstrates fairly consistent occlusion relationships, displays a multiscale
distribution over object sizes, and produces circle-like objects, especially at smaller scales. As shown in Table 2, the diffusion model has
the highest log likelihood on the test set.
T t
where Zt x(t+1) is the normalization constant.

Another convenient choice is r x(t) = r x(0) T . Un-

For a Gaussian, each diffusion step is typically very sharply der this second choice r x(t) makes no contribution to the

peaked relative to r x(t) , due to its small variance. This starting distribution for the reverse trajectory. Thisguaran-
r (x(t) ) tees that drawing the initial sample from p x(T ) for the
means that r x(t+1) can be treated as a small perturbation reverse trajectory remains straightforward.
( )
to p x(t) |x(t+1) . A small perturbation to a Gaussian ef-
fects the mean, but not the normalization constant, so in 2.6. Entropy of Reverse Process
this case Equations 21 and 22 are equivalent (see Appendix
Since the forward process is known, we can derive upper
C).
and lower bounds on the conditional entropy of each step
in the reverse trajectory, and thus on the log likelihood,
2.5.3. A PPLYING r x(t)

If r x(t) is sufficiently smooth, then it can be treated Hq X(t) |X(t1) + Hq X(t1) |X(0) Hq X(t) |X(0)
as a small perturbation to the reverse diffusion kernel
Hq X(t1) |X(t) Hq X(t) |X(t1) ,
p x(t) |x(t+1) . In this case p x(t) |x(t+1) will have an

identical functional form to p x(t) |x(t+1) , but with per- (24)
turbed mean for the Gaussian kernel, or with perturbed flip where both the upper and lower bounds depend only on
rate for the binomial kernel. The perturbed diffusion ker- q x(1T ) |x(0) , and can be analytically computed. The
nels are given in Table App.1, and are derived for the Gaus- derivation is provided in Appendix A.
sian in Appendix C.

If r x(t) can be multiplied with a Gaussian (or binomial) 3. Experiments
distribution in closed form, then it can be directly multi-
We train diffusion probabilistic models on a variety of con-

plied with the reverse diffusion kernel p x(t) |x(t+1) in

closed form. This applies in the case where r x(t) continuous datasets, and a binary dataset. We then demonstrate
sists of a delta function for some subset of coordinates, as sampling from the trained model and inpainting of miss-
in the inpainting example in Figure 5. ing data, and compare model performance against other
techniques. In all cases the objective function and gradi-
2.5.4. C HOOSING r x(t)
ent were computed using Theano (Bergstra & Breuleux,
2010). Model training was with SFO (Sohl-Dickstein et al.,
Typically, r x(t) should be chosen to change slowly over 2014), except for CIFAR-10. CIFAR-10 results used the
the course of the trajectory. For the experiments in this 3
An earlier version of this paper reported higher log likeli-
paper we chose it to be constant,
hood bounds on CIFAR-10. These were the result of the model
learning the 8-bit quantization of pixel values in the CIFAR-10
dataset. The log likelihood bounds reported here are instead for
r x(t) = r x(0) . (23) data that has been pre-processed by adding uniform noise to re-
move pixel quantization, as recommended in (Theis et al., 2015).
0 0 0
50 50 50
100 100 100
150 150 150
200 200 200
250 250 250
(a) 3000 50 100 150 200 250 300 (b) 3000 50 100 150 200 250 300 (c) 3000 50 100 150 200 250 300
Figure 5. Inpainting. (a) A bark image from (Lazebnik et al.,

2005). (b) The same image with the central 100100 pixel region replaced
(T )
with isotropic Gaussian noise. This is the initialization p x for the reverse trajectory. (c) The central 100100 region has been
inpainted using a diffusion probabilistic model trained on images of bark, by sampling from the posterior distribution over the missing
region conditioned on the rest of the image. Note the long-range spatial structure, for instance in the crack
entering
on the left side of the
inpainted region. The sample from the posterior was generated as described in Section 2.5, where r x(0) was set to a delta function
for known data, and a constant for missing data.
Model Log Likelihood

Dataset K K Lnull Dead Leaves
MCGSM 1.244 bits/pixel
Swiss Roll 2.35 bits 6.45 bits
Binary Heartbeat -2.414 bits/seq. 12.024 bits/seq. Diffusion 1.489 bits/pixel
Bark -0.55 bits/pixel 1.5 bits/pixel MNIST
Dead Leaves 1.489 bits/pixel 3.536 bits/pixel Stacked CAE 174 2.3 bits
CIFAR-103 5.4 0.2 bits/pixel 11.5 0.2 bits/pixel DBN 199 2.9 bits
MNIST See table 2 Deep GSN 309 1.6 bits
Diffusion 317 2.7 bits
Table 1. The lower bound K on the log likelihood, computed on a Adversarial net 325 2.9 bits
holdout set, for each of the trained models. See Equation 12. The Perfect model 349 3.3 bits
right column is the improvement relative to an isotropic Gaussian
or independent binomial distribution. Lnull is the log likelihood Table 2. Log likelihood comparisons to other algorithms. Dead

of x (0)
. All datasets except for Binary Heartbeat were scaled leaves images were evaluated using identical training and test data
as in (Theis et al., 2012). MNIST log likelihoods were estimated
by a constant to give them variance 1 before computing log like-
using the Parzen-window code from (Goodfellow et al., 2014),
lihood.
with values given in bits, and show that our performance is com-
parable to other recent techniques. The perfect model entry was
computed by applying the Parzen code to samples from the train-
ing data.
open source implementation of the algorithm, and RM-

Sprop for optimization. The lower bound on the log like-
lihood provided by our model is reported for all datasets 3.1.2. B INARY H EARTBEAT D ISTRIBUTION
in Table 1. A reference implementation of the algorithm
A diffusion probabilistic model was trained on simple bi-
utilizing Blocks (van Merrienboer et al., 2015) is avail-
nary sequences of length 20, where a 1 occurs every 5th
able at https://github.com/Sohl-Dickstein/
time bin, and the remainder of the bins are 0, using a multi-
Diffusion-Probabilistic-Models.
layer perceptron to generate the Bernoulli rates fb x(t) , t
of the reverse trajectory.
The log likelihood under the true
3.1. Toy Problems
distribution is log2 15 = 2.322 bits per sequence. As
3.1.1. S WISS ROLL can be seen in Figure 2 and Table 1 learning was nearly
perfect. See Appendix Section D.1.2 for more details.
A diffusion probabilistic model was built of a two dimen-
sional swiss roll distribution, using a radial basis
function
3.2. Images
network to generate f x(t) , t and f x(t) , t . As illus-
trated in Figure 1, the swiss roll distribution was success- We trained Gaussian diffusion probabilistic models on sev-
fully learned. See Appendix Section D.1.1 for more details. eral image datasets. The multi-scale convolutional archi-
tecture shared by these experiments is described in Ap- the number of steps is made large, the reversal distribution
pendix Section D.2.1, and illustrated in Figure D.1. of each diffusion step becomes simple and easy to estimate.
The result is an algorithm that can learn a fit to any data dis-
3.2.1. DATASETS tribution, but which remains tractable to train, exactly sam-
ple from, and evaluate, and under which it is straightfor-
MNIST In order to allow a direct comparison against
ward to manipulate conditional and posterior distributions.
previous work on a simple dataset, we trained on MNIST
digits (LeCun & Cortes, 1998). Log likelihoods relative to
(Bengio et al., 2012; Bengio & Thibodeau-Laufer, 2013; Acknowledgements
Goodfellow et al., 2014) are given in Table 2. Samples
We thank Lucas Theis, Subhaneil Lahiri, Ben Poole,
from the MNIST model are given in Appendix Figure
Diederik P. Kingma, Taco Cohen, Philip Bachman, and
App.1. Our training algorithm provides an asymptotically
Aaron van den Oord for extremely helpful discussion, and
consistent lower bound on the log likelihood. However
Ian Goodfellow for Parzen-window code. We thank Khan
most previous reported results on continuous MNIST log
Academy and the Office of Naval Research for funding
likelihood rely on Parzen-window based estimates com-
Jascha Sohl-Dickstein, and we thank the Office of Naval
puted from model samples. For this comparison we there-
Research and the Burroughs-Wellcome, Sloan, and James
fore estimate MNIST log likelihood using the Parzen-
S. McDonnell foundations for funding Surya Ganguli.
window code released with (Goodfellow et al., 2014).
CIFAR-10 A probabilistic model was fit to the training References

images for the CIFAR-10 challenge dataset (Krizhevsky & Barron, J. T., Biggin, M. D., Arbelaez, P., Knowles, D. W.,
Hinton, 2009). Samples from the trained model are pro- Keranen, S. V., and Malik, J. Volumetric Semantic Seg-
vided in Figure 3. mentation Using Pyramid Context Features. In 2013
IEEE International Conference on Computer Vision, pp.
Dead Leaf Images Dead leaf images (Jeulin, 1997; Lee 34483455. IEEE, December 2013. ISBN 978-1-4799-
et al., 2001) consist of layered occluding circles, drawn 2840-8. doi: 10.1109/ICCV.2013.428.
from a power law distribution over scales. They have an an-
Bengio, Y. and Thibodeau-Laufer, E. Deep genera-
alytically tractable structure, but capture many of the statis-
tive stochastic networks trainable by backprop. arXiv
tical complexities of natural images, and therefore provide
preprint arXiv:1306.1091, 2013.
a compelling test case for natural image models. As illus-
trated in Table 2 and Figure 4, we achieve state of the art Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Bet-
performance on the dead leaves dataset. ter Mixing via Deep Representations. arXiv preprint
arXiv:1207.4404, July 2012.
Bark Texture Images A probabilistic model was trained
on bark texture images (T01-T04) from (Lazebnik et al., Bergstra, J. and Breuleux, O. Theano: a CPU and GPU
2005). For this dataset we demonstrate that it is straightfor- math expression compiler. Proceedings of the Python
ward to evaluate or generate from a posterior distribution, for Scientific Computing Conference (SciPy), 2010.
by inpainting a large region of missing data using a sample Besag, J. Statistical Analysis of Non-Lattice Data. The
from the model posterior in Figure 5. Statistician, 24(3), 179-195, 1975.
Bishop, C., Svensen, M., and Williams, C. GTM: The gen-

4. Conclusion
erative topographic mapping. Neural computation, 1998.
We have introduced a novel algorithm for modeling proba-
bility distributions that enables exact sampling and evalua- Bornschein, J. and Bengio, Y. Reweighted Wake-Sleep.
tion of probabilities and demonstrated its effectiveness on a International Conference on Learning Representations,
variety of toy and real datasets, including challenging natu- June 2015.
ral image datasets. For each of these tests we used a similar Burda, Y., Grosse, R. B., and Salakhutdinov, R. Accu-
basic algorithm, showing that our method can accurately rate and Conservative Estimates of MRF Log-likelihood
model a wide variety of distributions. Most existing den- using Reverse Annealing. arXiv:1412.8566, December
sity estimation techniques must sacrifice modeling power 2014.
in order to stay tractable and efficient, and sampling or
evaluation are often extremely expensive. The core of our Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. The
algorithm consists of estimating the reversal of a Markov helmholtz machine. Neural computation, 7(5):889904,
diffusion chain which maps data to a noise distribution; as 1995.
Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast infer-
Independent Components Estimation. arXiv:1410.8516, ence in sparse coding algorithms with applications to ob-
pp. 11, October 2014. ject recognition. arXiv preprint arXiv:1010.3467, 2010.
Feller, W. On the theory of stochastic processes, with par- Kingma, D. P. and Welling, M. Auto-Encoding Variational
ticular reference to applications. In Proceedings of the Bayes. International Conference on Learning Represen-
[First] Berkeley Symposium on Mathematical Statistics tations, December 2013.
and Probability. The Regents of the University of Cali-
fornia, 1949. Krizhevsky, A. and Hinton, G. Learning multiple layers of
features from tiny images. Computer Science Depart-
Gershman, S. J. and Blei, D. M. A tutorial on Bayesian ment University of Toronto Tech. Rep., 2009.
nonparametric models. Journal of Mathematical Psy-
chology, 56(1):112, 2012. Langevin, P. Sur la theorie du mouvement brownien. CR
Acad. Sci. Paris, 146(530-533), 1908.
Gneiting, T. and Raftery, A. E. Strictly proper scoring rules,
prediction, and estimation. Journal of the American Sta- Larochelle, H. and Murray, I. The neural autoregressive
tistical Association, 102(477):359378, 2007. distribution estimator. Journal of Machine Learning Re-
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., search, 2011.
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Lazebnik, S., Schmid, C., and Ponce, J. A sparse texture
Y. Generative Adversarial Nets. Advances in Neural representation using local affine regions. Pattern Analy-
Information Processing Systems, 2014. sis and Machine Intelligence, IEEE Transactions on, 27
Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wier- (8):12651278, 2005.
stra, D. Deep AutoRegressive Networks. arXiv preprint
LeCun, Y. and Cortes, C. The MNIST database of hand-
arXiv:1310.8499, October 2013.
written digits. 1998.
Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. An-
nealing between distributions by averaging moments. In Lee, A., Mumford, D., and Huang, J. Occlusion models for
Advances in Neural Information Processing Systems, pp. natural images: A statistical study of a scale-invariant
27692777, 2013. dead leaves model. International Journal of Computer
Vision, 2001.
Hinton, G. E. Training products of experts by minimiz-
ing contrastive divergence. Neural Computation, 14(8): Lyu, S. Unifying Non-Maximum Likelihood Learning Ob-
17711800, 2002. jectives with Minimum KL Contraction. Advances in
Neural Information Processing Systems 24, pp. 6472,
Hinton, G. E. The wake-sleep algorithm for unsupervised 2011.
neural networks ). Science, 1995.
MacKay, D. Bayesian neural networks and density net-
Hyvarinen, A. Estimation of non-normalized statistical works. Nuclear Instruments and Methods in Physics Re-
models using score matching. Journal of Machine search Section A: Accelerators, Spectrometers, Detec-
Learning Research, 6:695709, 2005. tors and Associated Equipment, 1995.
Jarzynski, C. Equilibrium free-energy differences from
Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy be-
nonequilibrium measurements: A master-equation ap-
lief propagation for approximate inference: An empiri-
proach. Physical Review E, January 1997.
cal study. In Proceedings of the Fifteenth conference on
Jarzynski, C. Equalities and inequalities: irreversibility Uncertainty in artificial intelligence, pp. 467475. Mor-
and the second law of thermodynamics at the nanoscale. gan Kaufmann Publishers Inc., 1999.
Annu. Rev. Condens. Matter Phys., 2011.
Neal, R. Annealed importance sampling. Statistics and
Jeulin, D. Dead leaves models: from space tesselation to Computing, January 2001.
random functions. Proc. of the Symposium on the Ad-
vances in the Theory and Applications of Random Sets, Ozair, S. and Bengio, Y. Deep Directed Generative Au-
1997. toencoders. arXiv:1410.0630, October 2014.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, Parry, M., Dawid, A. P., Lauritzen, S., and Others. Proper
L. K. An introduction to variational methods for graphi- local scoring rules. The Annals of Statistics, 40(1):561
cal models. Machine learning, 37(2):183233, 1999. 592, 2012.
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas- Theis, L., van den Oord, A., and Bethge, M. A note
tic Backpropagation and Approximate Inference in Deep on the evaluation of generative models. arXiv preprint
Generative Models. Proceedings of the 31st Inter- arXiv:1511.01844, 2015.
national Conference on Machine Learning (ICML-14),
January 2014. Uria, B., Murray, I., and Larochelle, H. RNADE:
The real-valued neural autoregressive density-estimator.
Rippel, O. and Adams, R. P. High-Dimensional Advances in Neural Information Processing Systems,
Probability Estimation with Deep Density Models. 2013a.
arXiv:1410.8516, pp. 12, February 2013.
Uria, B., Murray, I., and Larochelle, H. A Deep and
Schmidhuber, J. Learning factorial codes by predictability Tractable Density Estimator. arXiv:1310.1757, pp. 9,
minimization. Neural Computation, 1992. October 2013b.
Sminchisescu, C., Kanaujia, A., and Metaxas, D. Learning van Merrienboer, B., Chorowski, J., Serdyuk, D., Bengio,
joint top-down and bottom-up processes for 3D visual Y., Bogdanov, D., Dumoulin, V., and Warde-Farley, D.
inference. In Computer Vision and Pattern Recognition, Blocks and Fuel. Zenodo, May 2015. doi: 10.5281/
2006 IEEE Computer Society Conference on, volume 2, zenodo.17721.
pp. 17431752. IEEE, 2006.
Welling, M. and Hinton, G. A new learning algorithm for
Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. New mean field Boltzmann machines. Lecture Notes in Com-
Method for Parameter Estimation in Probabilistic Mod- puter Science, January 2002.
els: Minimum Probability Flow. Physical Review Let-
ters, 107(22):1114, November 2011a. ISSN 0031- Yao, L., Ozair, S., Cho, K., and Bengio, Y. On the Equiv-
9007. doi: 10.1103/PhysRevLett.107.220601. alence Between Deep NADE and Generative Stochastic
Networks. In Machine Learning and Knowledge Discov-
Sohl-Dickstein, J., Battaglino, P. B., and DeWeese, ery in Databases, pp. 322336. Springer, 2014.
M. R. Minimum Probability Flow Learning. Interna-
tional Conference on Machine Learning, 107(22):11
14, November 2011b. ISSN 0031-9007. doi: 10.1103/
PhysRevLett.107.220601.
Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-
scale optimization by unifying stochastic gradient and
quasi-Newton methods. In Proceedings of the 31st Inter-
national Conference on Machine Learning (ICML-14),
pp. 604612, 2014.
Spinney, R. and Ford, I. Fluctuation Relations : A Peda-
gogical Overview. arXiv preprint arXiv:1201.6381, pp.
356, 2013.
Stuhlmuller, A., Taylor, J., and Goodman, N. Learning
stochastic inverses. Advances in Neural Information
Processing Systems, 2013.
Suykens, J. and Vandewalle, J. Nonconvex optimization
using a Fokker-Planck learning machine. In 12th Euro-
pean Conference on Circuit Theory and Design, 1995.
T, P. Convergence condition of the TAP equation for the
infinite-ranged Ising spin glass model. J. Phys. A: Math.
Gen. 15 1971, 1982.
Tanaka, T. Mean-field theory of Boltzmann machine learn-
ing. Physical Review Letters E, January 1998.
Theis, L., Hosseini, R., and Bethge, M. Mixtures of condi-
tional Gaussian scale mixtures applied to multiscale im-
age representations. PloS one, 7(7):e39857, 2012.
Appendix
A. Conditional Entropy Bounds Derivation

The conditional entropy Hq X(t1) |X(t) of a step in the reverse trajectory is

Hq X(t1) , X(t) = Hq X(t) , X(t1) (25)

Hq X(t1) |X(t) + Hq X(t) = Hq X(t) |X(t1) + Hq X(t1) (26)

Hq X(t1) |X(t) = Hq X(t) |X(t1) + Hq X(t1) Hq X(t) (27)
An upper bound on the entropy change can be constructed by observing that (y) is the maximum entropy distribution.
This holds without qualification for the binomial distribution, and holds for variance 1 training data for the Gaussian case.
For the Gaussian case, training data must therefore be scaled to have unit norm for the following equalities to hold. It need
not be whitened. The upper bound is derived as follows,

Hq X(t) Hq X(t1) (28)

Hq X(t1) Hq X(t) 0 (29)

Hq X(t1) |X(t) Hq X(t) |X(t1) . (30)
A lower bound on the entropy difference can be established by observing that additional steps in a Markov chain do not
increase the information available about the initial state in the chain, and thus do not decrease the conditional entropy of
the initial state,

Hq X(0) |X(t) Hq X(0) |X(t1) (31)

Hq X(t1) Hq X(t) Hq X(0) |X(t1) + Hq X(t1) Hq X(0) |X(t) Hq X(t) (32)

Hq X(t1) Hq X(t) Hq X(0) , X(t1) Hq X(0) , X(t) (33)

Hq X(t1) Hq X(t) Hq X(t1) |X(0) Hq X(t) |X(0) (34)

Hq X(t1) |X(t) Hq X(t) |X(t1) + Hq X(t1) |X(0) Hq X(t) |X(0) . (35)
Combining these expressions, we bound the conditional entropy for a single step,

Hq X(t) |X(t1) Hq X(t1) |X(t) Hq X(t) |X(t1) + Hq X(t1) |X(0) Hq X(t) |X(0) , (36)

where both the upper and lower bounds depend only on the conditional forward trajectory q x(1T ) |x(0) , and can be
analytically computed.
B. Log Likelihood Lower Bound

The lower bound on the log likelihood is
LK (37)
" T #
p x(t1) |x(t)
Z Y
(0T ) (0T ) (T )
K = dx q x log p x (38)
t=1
q x(t) |x(t1)
(39)

B.1. Entropy of p X(T )

We can peel off the contribution from p X(T ) , and rewrite it as an entropy,
T # Z "
(t1) (t)
p x |x
Z X
K = dx(0T ) q x(0T ) log + dx(T ) q x(T ) log p x(T ) (40)
t=1
q x(t) |x(t1)
T
" # Z
p x(t1) |x(t)
Z X
(0T ) (0T ) + dx(T ) q x(T ) log xT

= dx q x log (t) (t1)
(41)
t=1
q x |x
. (42)

By design, the cross entropy to x(t) is constant under our diffusion kernels, and equal to the entropy of p x(T ) .
Therefore,
T Z
" #
X
(0T )

(0T )
p x(t1) |x(t)
K= dx q x log Hp X(T ) . (43)
(t)
q x |x (t1)
t=1
B.2. Remove the edge effect at t = 0

In order to avoid edge effects, we set the final step of the reverse trajectory to be identical to the corresponding forward
diffusion step,
x(0)
(0) (1) (1) (0) = T x(0) |x(1) ; 1 .
p x |x =q x |x (44)
x(1)
We then use this equivalence to remove the contribution of the first time-step in the sum,
T Z
" # Z " #
X
(0T ) (0T )
p x(t1) |x(t) (0) (1)

(0) (1)
q x(1) |x(0) x(0)
K= dx q x log + dx dx q x , x log Hp X(T )
t=2
q x(t) |x(t1) q x(1) |x(0) x(1)
(45)
T
" #
p x(t1) |x(t)
X Z
= dx(0T ) q x(0T ) log Hp X(T ) , (46)
t=2
q x(t) |x(t1)

dx(t) q x(t) log x(t) = Hp X(T ) is a constant for all t.
R
where we again used the fact that by design

B.3. Rewrite in terms of posterior q x(t1) |x(0)
Because the forward trajectory is a Markov process,
T Z
" #
X
(0T )

(0T )
p x(t1) |x(t)
t=2
q x(t) |x(t1) , x(0)
Using Bayes rule we can rewrite this in terms of a posterior and marginals from the forward trajectory,
T Z
" #
X
(0T )

(0T )
p x(t1) |x(t) q x(t1) |x(0)
t=2
q x(t1) |x(t) , x(0) q x(t) |x(0)
Figure App.1. Samples from a diffusion probabilistic model trained on MNIST digits. Note that unlike many MNIST sample figures,
these are true samples rather than the mean of the Gaussian or binomial distribution from which samples would be drawn.
B.4. Rewrite in terms of KL divergences and entropies

We then recognize that several terms are conditional entropies,
T Z
" # T h
(t1) (t)
X
(0T )

(0T )
p x |x X
(t) (0)

(t1) (0)
i
(T )

K= dx q x log + Hq X |X H q X |X H p X
t=2
q x(t1) |x(t) , x(0) t=2
(49)
T
" #
p x(t1) |x(t)
X Z
= dx(0T ) q x(0T ) log + Hq X(T ) |X(0) Hq X(1) |X(0) Hp X(T ) .
t=2
q x(t1) |x(t) , x(0)
(50)
Finally we transform the log ratio of probability distributions into a KL divergence,

T Z
X
K= dx(0) dx(t) q x(0) , x(t) DKL q x(t1) |x(t) , x(0) p x(t1) |x(t) (51)

t=2

+ Hq X(T ) |X(0) Hq X(1) |X(0) Hp X(T ) .
Note that the entropies can be analytically computed, and the KL divergence can be analytically computed given x(0) and
x(t) .
Gaussian Binomial

Well behaved (analytically x(T ) = N x(T ) ; 0, I B x(T ) ; 0.5
tractable) distribution

Forward diffusion kernel q x(t) |x(t1) = N x(t) ; x(t1) 1 t , It B x(t) ; x(t1) (1 t ) + 0.5t

Reverse diffusion kernel p x(t1) |x(t) = N x(t1) ; f x(t) , t , f x(t) , t B x(t1) ; fb x(t) , t

Training targets f x(t) , t , f x(t) , t , 1T fb x(t) , t
QT
Forward distribution q x(0T ) = q x(0) t=1 q x(t) |x(t1)
QT
Reverse distribution p x(0T ) = x (T )
x(t1) |x(t)
t=1 p

dx(0) q x(0) log p x(0)
R
Log likelihood L=
PT
Lower bound on log likelihood K= t=2 Eq(x(0) ,x(t) ) DKL q x(t1) |x(t) , x(0) p x(t1) |x(t) + Hq X(T ) |X(0) Hq X(1) |X(0) Hp X(T )
!
log r x(t1)0

ct1 dt1

(t1)
Perturbed reverse diffusion kernel p x(t1) |x(t) = N x(t1) (t) (t)
; f x , t + f x , t x (t1)0

(t1)0 , f x (t)
, t B x i ; t1 t1
i i
t1 t1
xi di +(1ci )(1di )
x =f (x(t) ,t)
Table App.1. The key equations in this paper for the specific cases of Gaussian and binomial diffusion processes. N (u; , ) is a Gaussian distribution with mean and covariance
. B (u; r) is the distribution for a single Bernoulli
h with u = 1 occurring
trial,i probability r, and u = 0 occurring with probability 1 r. Finally, for the perturbed Bernoulli
with
(t)
trials bti = x(t1) (1 t ) + 0.5t , cti = fb x(t+1) , t , and dti = r xi = 1 , and the distribution is given for a single bit i.
i
C. Perturbed Gaussian Transition

We wish to compute p x(t1) | x(t) . For notational simplicity, let = f x(t) , t , = f x(t) , t , and y = x(t1) .
Using this notation,

p y | x(t) p y | x(t) r (y) (52)
= N (y; , ) r (y) . (53)
We can rewrite this in terms of energy functions, where Er (y) = log r (y),

p y | x(t) exp [E (y)] (54)
1 T
E (y) = (y ) 1 (y ) + Er (y) . (55)
2
T
If Er (y) is smooth relative to 21 (y ) 1 (y ), then we can approximate it using its Taylor expansion around .
One sufficient condition is that the eigenvalues of the Hessian of Er (y) are everywhere much smaller magnitude than the
eigenvalues of 1 . We then have
Er (y) Er () + (y ) g (56)

Er (y0 )
where g = y0

. Plugging this in to the full energy,
y0 =
1 T T
E (y) (y ) 1 (y ) + (y ) g + constant (57)
2
1 1 1 1 1
= yT 1 y yT 1 T 1 y + yT 1 g + gT 1 y + constant (58)
2 2 2 2 2
1 T 1
= (y + g) (y + g) + constant. (59)
2
This corresponds to a Gaussian,

p y | x(t) N (y; g, ) . (60)
Substituting back in the original formalism, this is,

0

log r x(t1)
p x(t1) | x(t) N x(t1) ; f x(t) , t + f x(t) , t , f x(t) , t . (61)

x(t1)0

x(t1)0 =f (x(t) ,t)

D. Experimental Details for each pixel i,

D.1. Toy Problems J
X
zi =
yij gj (t) . (62)
D.1.1. S WISS ROLL j=1
A probabilistic model was built of a two dimensionalswiss The bump functions consist of
roll distribution. The generative model p x(0T ) con-
sisted of 40 time steps of Gaussian diffusion initialized 2
exp 2w1 2 (t j )
at an identity-covariance Gaussian distribution. A (nor- gj (t) = P , (63)
J 1 2
malized) radial basis function network with a single hid- k=1 exp 2w 2 (t k )
den layer and 16 hidden units was trained to generate the
mean and covariance functions f x(t) , t and a diago- where j (0, T ) is the bump center, and w is the spacing
between bump centers. z is generated in an identical way,

nal f x(t) , t for the reverse trajectory. The top, read-
out, layer for each function was learned independently for but using y .
each time step, but for all other layers weights were shared For all image experiments a number of timesteps T = 1000
across all time steps
and both functions. The top layer out- was used, except for the bark dataset which used T = 500.
put of f x(t) , t was passed through a sigmoid to restrict
it between 0 and 1. As can be seen in Figure 1, the swiss Mean and Variance Finally, these outputs are combined
roll distribution was successfully learned. to produce a diffusion mean and variance prediction for
each pixel i,
D.1.2. B INARY H EARTBEAT D ISTRIBUTION
ii = zi + 1 (t ) ,

(64)
A probabilistic model was trained on simple binary se-
quences of length 20, where a 1 occurs every 5th time i = (xi zi ) (1 ii ) + zi . (65)
bin, and the remainder of the bins are 0. The generative
where both and are parameterized as a perturbation
model consisted of 2000 time steps of binomial diffusion
around the forward diffusion kernel T x(t) |x(t1) ; t ,
initialized at an independent binomial
distribution
with the
(T ) and zi is the mean of the equilibrium distribution that
same mean activity as the data (p xi = 1 = 0.2). A would result from applying p x(t1) |x(t) many times.
multilayer perceptron with sigmoid nonlinearities, 20 in- is restricted to be a diagonal matrix.
put units and three hidden layers with 50 units each was
trained to generate the Bernoulli rates fb x(t) , t of the re- Multi-Scale Convolution We wish to accomplish goals
verse trajectory. The top, readout, layer was learned inde- that are often achieved with pooling networks specif-
pendently for each time step, but for all other layers weights ically, we wish to discover and make use of long-range
were shared across all time steps. The top layer output was and multi-scale dependencies in the training data. How-
passed through a sigmoid to restrict it between 0 and 1. As ever, since the network output is a vector of coefficients
can be seen in Figure 2, the heartbeat distribution was suc- for every pixel it is important to generate a full resolution
cessfully learned. The log
likelihood under the true gener- rather than down-sampled feature map. We therefore define
ating process is log2 51 = 2.322 bits per sequence. As multi-scale-convolution layers that consist of the following
can be seen in Figure 2 and Table 1 learning was nearly steps:
perfect.
1. Perform mean pooling to downsample the image to
D.2. Images multiple scales. Downsampling is performed in pow-
ers of two.
D.2.1. A RCHITECTURE
2. Performing convolution at each scale.
Readout In all cases, a convolutional network was used 3. Upsample all scales to full resolution, and sum the re-
to produce a vector of outputs yi R2J for each image sulting images.
pixel i. The entries in yi are divided into two equal sized 4. Perform a pointwise nonlinear transformation, con-
subsets, y and y . sisting of a soft relu (log [1 + exp ()]).
The composition of the first three linear operations resem-

bles convolution by a multiscale convolution kernel, up to
Temporal Dependence The convolution output y is blocking artifacts introduced by upsampling. This method
used as per-pixel weighting coefficients in a sum over time- of achieving multiscale convolution was described in (Bar-
dependent bump functions, generating an output zi R ron et al., 2013).
Mean Covariance
image image
Temporal Temporal
coefficients coefficients
Convolution
1x1 kernel
Convolution
1x1 kernel
Multi-scale
Dense
convolution
Multi-scale
Dense
convolution
Input

Figure D.1. Network architecture for mean function f x(t) , t

and covariance function f x(t) , t , for experiments in Section
3.2. The input image x(t) passes through several layers of multi-
scale convolution (Section D.2.1). It then passes through several
convolutional layers with 1 1 kernels. This is equivalent to a
dense transformation performed on each pixel. A linear transfor-
mation generates coefficients for readout of both mean (t) and
covariance (t) for each pixel. Finally, a time dependent readout
function converts those coefficients into mean and covariance im-
ages, as described in Section D.2.1. For CIFAR-10 a dense (or
fully connected) pathway was used in parallel to the multi-scale
convolutional pathway. For MNIST, the dense pathway was used
to the exclusion of the multi-scale convolutional pathway.
Dense Layers Dense (acting on the full image vector)

and kernel-width-1 convolutional (acting separately on the
feature vector for each pixel) layers share the same form.
They consist of a linear transformation, followed by a tanh
nonlinearity.

Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

Caricato da

Copyright:

Formati disponibili

Deep Unsupervised Learning using

Jascha Sohl-Dickstein JASCHA @ STANFORD . EDU

Niru Maheswaranathan NIRUM @ STANFORD . EDU

Abstract these models are unable to aptly describe structure in rich

1. Introduction We present a novel way to define probabilistic models that

100 100 100

150 150 150

200 200 200

250 250 250

100 100 100

150 150 150

200 200 200

250 250 250

Figure 5. Inpainting. (a) A bark image from (Lazebnik et al.,

Model Log Likelihood

open source implementation of the algorithm, and RM-

CIFAR-10 A probabilistic model was fit to the training References

Bishop, C., Svensen, M., and Williams, C. GTM: The gen-

B. Log Likelihood Lower Bound

B.2. Remove the edge effect at t = 0

B.4. Rewrite in terms of KL divergences and entropies

Finally we transform the log ratio of probability distributions into a KL divergence,

C. Perturbed Gaussian Transition

Substituting back in the original formalism, this is,

D. Experimental Details for each pixel i,

The composition of the first three linear operations resem-

Dense Layers Dense (acting on the full image vector)

Potrebbero piacerti anche