Sei sulla pagina 1di 161

Automatic Differentiation Variational Inference

Automatic Differentiation
Variational Inference
Philip Schulz and Wilker Aziz
https:
//github.com/philschulz/VITutorial

1 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs:

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective:

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
I cannot be computed exactly

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
I cannot be computed exactly
we resort to Monte Carlo estimation

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
I cannot be computed exactly
we resort to Monte Carlo estimation
I But the MC estimator is not differentiable

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
I cannot be computed exactly
we resort to Monte Carlo estimation
I But the MC estimator is not differentiable
I Score function estimator: applicable to any model

2 / 37
Automatic Differentiation Variational Inference

What we know so far

I DGMs: probabilistic models parameterised by


neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
I cannot be computed exactly
we resort to Monte Carlo estimation
I But the MC estimator is not differentiable
I Score function estimator: applicable to any model
I Reparameterised gradients
so far seems applicable only to Gaussian variables

2 / 37
Automatic Differentiation Variational Inference

Multivariate calculus recap

Reparameterised gradients revisited

ADVI

Example

3 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Multivariate calculus recap

Let x ∈ RK and let T : RK → RK be differentiable


and invertible
I y = T (x)
I x = T −1 (y )

4 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Jacobian

The Jacobian matrix JT (x) of T


assessed at x is the matrix of partial derivatives
∂yi
Jij =
∂xj

5 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Jacobian

The Jacobian matrix JT (x) of T


assessed at x is the matrix of partial derivatives
∂yi
Jij =
∂xj

Inverse function theorem

JT −1 (y ) = (JT (x))−1

5 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Differential (or inifinitesimal)


The differential dx of x
refers to an infinitely small change in x

6 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Differential (or inifinitesimal)


The differential dx of x
refers to an infinitely small change in x
We can relate the differential dy of y = T (x) to dx

6 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Differential (or inifinitesimal)


The differential dx of x
refers to an infinitely small change in x
We can relate the differential dy of y = T (x) to dx
I Scalar case
dy d
dy = T 0 (x)dx = dx = T (x)dx
dx dx
where dy /dx is the derivative of y wrt x

6 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Differential (or inifinitesimal)


The differential dx of x
refers to an infinitely small change in x
We can relate the differential dy of y = T (x) to dx
I Scalar case
dy d
dy = T 0 (x)dx = dx = T (x)dx
dx dx
where dy /dx is the derivative of y wrt x
I Multivariate case
dy = |det JT (x)|dx
the absolute value absorbs the orientation 6 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Integration by substitution
We can integrate a function g (x)
by substituting x = T −1 (y )
Z
g (x)dx

7 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Integration by substitution
We can integrate a function g (x)
by substituting x = T −1 (y )
Z Z
g (x)dx = g (T −1 (y )) |det JT −1 (y )|dy
| {z } | {z }
x dx

7 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Integration by substitution
We can integrate a function g (x)
by substituting x = T −1 (y )
Z Z
g (x)dx = g (T −1 (y )) |det JT −1 (y )|dy
| {z } | {z }
x dx

and similarly for a function h(y )


Z
h(y )dy

7 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Integration by substitution
We can integrate a function g (x)
by substituting x = T −1 (y )
Z Z
g (x)dx = g (T −1 (y )) |det JT −1 (y )|dy
| {z } | {z }
x dx

and similarly for a function h(y )


Z Z
h(y )dy = h(T (x))|det JT (x)|dx

7 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Change of density
Let X take on values in RK with density pX (x)

8 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Change of density
Let X take on values in RK with density pX (x)
and recall that y = T (x) and x = T −1 (y )

8 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Change of density
Let X take on values in RK with density pX (x)
and recall that y = T (x) and x = T −1 (y )

Then T induces a density pY (y ) expressed as

pY (y ) = pX (T −1 (y ))|det JT −1 (y )|

8 / 37
Automatic Differentiation Variational Inference
Multivariate calculus recap

Change of density
Let X take on values in RK with density pX (x)
and recall that y = T (x) and x = T −1 (y )

Then T induces a density pY (y ) expressed as

pY (y ) = pX (T −1 (y ))|det JT −1 (y )|

and then it follows that

pX (x) = pY (T (x))|det JT (x)|

8 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation


a transformation Sλ : RK → RK such that

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation


a transformation Sλ : RK → RK such that
Sλ (z) ∼ π()

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation


a transformation Sλ : RK → RK such that
Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation


a transformation Sλ : RK → RK such that
Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)

I π() does not depend on parameters λ


we call it a base density

9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Revisiting reparameterised gradients


Let Z take on values in RK with pdf q(z|λ)

The idea is to count on a reparameterisation


a transformation Sλ : RK → RK such that
Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)

I π() does not depend on parameters λ


we call it a base density
I Sλ (z) absorbs dependency on λ
9 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Eq(z|λ) [g (z)]

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z
= π()

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z −1
= π() det JSλ−1 ()

| {z }
inv func theorem

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z −1
= π() det JSλ−1 () g (Sλ−1 ())

| {z } | {z
z
}
inv func theorem

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z −1
−1
= π() det JSλ−1 () g (Sλ ()) det JSλ−1 () d

| {z } | {z
z
} | {z }
inv func theorem change of var

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z −1
−1
= π() det JSλ−1 () g (Sλ ()) det JSλ−1 () d

| {z } | {z
z
} | {z }
inv func theorem change of var
Z
= π()g (Sλ−1 ())d
10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised expectations
If we are interested in
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
Z −1
−1
= π() det JSλ−1 () g (Sλ ()) det JSλ−1 () d

| {z } | {z
z
} | {z }
inv func theorem change of var
Z
= π()g (Sλ−1 ())d = Eπ() g (Sλ−1 ())
 

10 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients
For optimisation, we need tractable gradients
∂ ∂
Eπ() g (Sλ−1 ())
 
Eq(z|λ) [g (z)] =
∂λ ∂λ

11 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients
For optimisation, we need tractable gradients
∂ ∂
Eπ() g (Sλ−1 ())
 
Eq(z|λ) [g (z)] =
∂λ ∂λ
since now the density does not depend on λ, we can
obtain a gradient estimate
 
∂ ∂
Eq(z|λ) [g (z)] = Eπ() g (Sλ−1 ())
∂λ ∂λ

11 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients
For optimisation, we need tractable gradients
∂ ∂
Eπ() g (Sλ−1 ())
 
Eq(z|λ) [g (z)] =
∂λ ∂λ
since now the density does not depend on λ, we can
obtain a gradient estimate
 
∂ ∂
Eq(z|λ) [g (z)] = Eπ() g (Sλ−1 ())
∂λ ∂λ
M
MC 1 X ∂
≈ g (Sλ−1 (i ))
M ∂λ
i=1
i ∼π()
11 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients: Gaussian


We have seen one case, namely,
if  ∼ N (0, I ) and Z ∼ N (µ, σ 2 )

12 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients: Gaussian


We have seen one case, namely,
if  ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and

EN (z|µ,σ2 ) [g (z)]
∂λ

12 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients: Gaussian


We have seen one case, namely,
if  ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and

EN (z|µ,σ2 ) [g (z)]
∂λ  

= EN (0,I ) g (z = µ + σ)
∂λ

12 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Reparameterised gradients: Gaussian


We have seen one case, namely,
if  ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and

EN (z|µ,σ2 ) [g (z)]
∂λ  

= EN (0,I ) g (z = µ + σ)
∂λ
 
∂ ∂z
= EN (0,I ) g (z = µ + σ)
∂z ∂λ
12 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond

Many interesting densities cannot easily be


reparameterised

13 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond
Many interesting densities cannot easily be
reparameterised

Beta

13 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond
Many interesting densities cannot easily be
reparameterised

Gamma

13 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond
Many interesting densities cannot easily be
reparameterised

Weibull

13 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond
Many interesting densities cannot easily be
reparameterised

Dirichlet

13 / 37
Automatic Differentiation Variational Inference
Reparameterised gradients revisited

Beyond
Many interesting densities cannot easily be
reparameterised

von Mises-Fisher

13 / 37
Automatic Differentiation Variational Inference
ADVI

Automatic Differentiation VI
Motivation
I many models have intractable posteriors
their normalising constants (evidence) lack
analytic solutions

14 / 37
Automatic Differentiation Variational Inference
ADVI

Automatic Differentiation VI
Motivation
I many models have intractable posteriors
their normalising constants (evidence) lack
analytic solutions
I but many models are differentiable
that’s the main constraint for using NNs

14 / 37
Automatic Differentiation Variational Inference
ADVI

Automatic Differentiation VI
Motivation
I many models have intractable posteriors
their normalising constants (evidence) lack
analytic solutions
I but many models are differentiable
that’s the main constraint for using NNs
Reparameterised gradients are a step towards
automatising VI for differentiable models

14 / 37
Automatic Differentiation Variational Inference
ADVI

Automatic Differentiation VI
Motivation
I many models have intractable posteriors
their normalising constants (evidence) lack
analytic solutions
I but many models are differentiable
that’s the main constraint for using NNs
Reparameterised gradients are a step towards
automatising VI for differentiable models
I but not every model of interest employs rvs for
which a reparameterisation is known
14 / 37
Automatic Differentiation Variational Inference
ADVI

Example: Weibull-Poisson model

Suppose we have some ordinal data which we


assume to be Poisson-distributed

X |z ∼ Poisson(z) z ∈ R>0

15 / 37
Automatic Differentiation Variational Inference
ADVI

Example: Weibull-Poisson model

Suppose we have some ordinal data which we


assume to be Poisson-distributed

X |z ∼ Poisson(z) z ∈ R>0

and suppose we want to impose a Weibull prior on


the Poisson rate

15 / 37
Automatic Differentiation Variational Inference
ADVI

Example: Weibull-Poisson model

Suppose we have some ordinal data which we


assume to be Poisson-distributed
z|r , k ∼ Weibull(r , k) r ∈ R>0 , k ∈ R>0
X |z ∼ Poisson(z) z ∈ R>0

and suppose we want to impose a Weibull prior on


the Poisson rate

15 / 37
Automatic Differentiation Variational Inference
ADVI

VI for Weibull-Poisson model


Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)

16 / 37
Automatic Differentiation Variational Inference
ADVI

VI for Weibull-Poisson model


Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
p(x|r , k) = p(z|r , k)p(x|ρ)dz
R>0

16 / 37
Automatic Differentiation Variational Inference
ADVI

VI for Weibull-Poisson model


Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
p(x|r , k) = p(z|r , k)p(x|ρ)dz
R>0

ELBO
Eq(z|λ) [log p(x, z|r , k)] + H (q(z))

16 / 37
Automatic Differentiation Variational Inference
ADVI

VI for Weibull-Poisson model


Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
p(x|r , k) = p(z|r , k)p(x|ρ)dz
R>0

ELBO
Eq(z|λ) [log p(x, z|r , k)] + H (q(z))
Can we make q(z|λ) Gaussian?

16 / 37
Automatic Differentiation Variational Inference
ADVI

VI for Weibull-Poisson model


Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
p(x|r , k) = p(z|r , k)p(x|ρ)dz
R>0

ELBO
Eq(z|λ) [log p(x, z|r , k)] + H (q(z))
Can we make q(z|λ) Gaussian?
No! supp(N (z|µ, σ 2 )) = R
16 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( |r , k) Poisson(x| )
| {z } | {z }
z z

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))
| {z } | {z }
z z

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z

ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z

ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
Can we use a Gaussian approximate posterior?

17 / 37
Automatic Differentiation Variational Inference
ADVI

Strategy
Build a change of variable into the model

p(x, z|r , k) = p(z|r , k)p(x|ρ)


= Weibull(z|r , k) Poisson(x|z)
= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z

ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
Can we use a Gaussian approximate posterior? Yes!

17 / 37
Automatic Differentiation Variational Inference
ADVI

Differentiable models
We focus on differentiable probability models

p(x, z) = p(x|z)p(z)

18 / 37
Automatic Differentiation Variational Inference
ADVI

Differentiable models
We focus on differentiable probability models

p(x, z) = p(x|z)p(z)

I members of this class have continuous latent


variables z

18 / 37
Automatic Differentiation Variational Inference
ADVI

Differentiable models
We focus on differentiable probability models

p(x, z) = p(x|z)p(z)

I members of this class have continuous latent


variables z
I and the gradient ∇z log p(x, z) is valid within
the support of the prior
supp(p(z)) = {z ∈ RK : p(z) > 0} ⊆ RK
18 / 37
Automatic Differentiation Variational Inference
ADVI

Why do we need differentiable models?


Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ

19 / 37
Automatic Differentiation Variational Inference
ADVI

Why do we need differentiable models?


Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ

Reparameterisation requires ∂z

19 / 37
Automatic Differentiation Variational Inference
ADVI

Why do we need differentiable models?


Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ

Reparameterisation requires ∂z


Eq(z;λ) [log p(x, z)]
∂λ

19 / 37
Automatic Differentiation Variational Inference
ADVI

Why do we need differentiable models?


Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ

Reparameterisation requires ∂z
 
∂ ∂
Eq(z;λ) [log p(x, z)] = Eπ() log p(x, z = Sλ−1 ())
∂λ ∂λ

19 / 37
Automatic Differentiation Variational Inference
ADVI

Why do we need differentiable models?


Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ

Reparameterisation requires ∂z
 
∂ ∂
Eq(z;λ) [log p(x, z)] = Eπ() log p(x, z = Sλ−1 ())
∂λ ∂λ
 
∂ ∂ −1
= Eπ() log p(x, z) Sλ ()
∂z ∂λ
19 / 37
Automatic Differentiation Variational Inference
ADVI

VI optimisation problem
Let’s focus on the design and optimisation of the
variational approximation
arg min KL (q(z) || p(z|x))
q(z)

20 / 37
Automatic Differentiation Variational Inference
ADVI

VI optimisation problem
Let’s focus on the design and optimisation of the
variational approximation
arg min KL (q(z) || p(z|x))
q(z)

To automatise the search for a variational


approximation q(z) we must ensure that

supp(q(z)) ⊆ supp(p(z|x))

20 / 37
Automatic Differentiation Variational Inference
ADVI

VI optimisation problem
Let’s focus on the design and optimisation of the
variational approximation
arg min KL (q(z) || p(z|x))
q(z)

To automatise the search for a variational


approximation q(z) we must ensure that

supp(q(z)) ⊆ supp(p(z|x))

I otherwise KL is not a real number


def
KL (q || p) = Eq [log q] − Eq [log p] = ∞
20 / 37
Automatic Differentiation Variational Inference
ADVI

Support matching constraint


So let’s constrain q(z) to a family Q whose support
is included in the support of the posterior
arg min KL (q(z) || p(z|x))
q(z)∈Q

where
Q = {q(z) : supp(q(z)) ⊆ supp(p(z|x))}

21 / 37
Automatic Differentiation Variational Inference
ADVI

Support matching constraint


So let’s constrain q(z) to a family Q whose support
is included in the support of the posterior
arg min KL (q(z) || p(z|x))
q(z)∈Q

where
Q = {q(z) : supp(q(z)) ⊆ supp(p(z|x))}
But what is the support of p(z|x)?

21 / 37
Automatic Differentiation Variational Inference
ADVI

Support matching constraint


So let’s constrain q(z) to a family Q whose support
is included in the support of the posterior
arg min KL (q(z) || p(z|x))
q(z)∈Q

where
Q = {q(z) : supp(q(z)) ⊆ supp(p(z|x))}
But what is the support of p(z|x)?
I typically the same as the support of p(z)

21 / 37
Automatic Differentiation Variational Inference
ADVI

Support matching constraint


So let’s constrain q(z) to a family Q whose support
is included in the support of the posterior
arg min KL (q(z) || p(z|x))
q(z)∈Q

where
Q = {q(z) : supp(q(z)) ⊆ supp(p(z|x))}
But what is the support of p(z|x)?
I typically the same as the support of p(z)
as long as p(x, z) > 0 if p(z) > 0
21 / 37
Automatic Differentiation Variational Inference
ADVI

Parametric family
So let’s constrain q(z) to a family Q whose support
is included in the support of the prior

arg min KL (q(z) || p(z|x))


q(z)∈Q

where

Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

22 / 37
Automatic Differentiation Variational Inference
ADVI

Parametric family
So let’s constrain q(z) to a family Q whose support
is included in the support of the prior

arg min KL (q(z) || p(z|x))


q(z)∈Q

where

Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}


I a parameter vector λ picks out a member of
the family
22 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ

23 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
subject to
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

23 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
subject to
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

Often there can be two constraints here

23 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
subject to
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

Often there can be two constraints here


I support matching constraint

23 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
subject to
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

Often there can be two constraints here


I support matching constraint
I Λ may be constrained to a subset of RD

23 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
subject to
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}

Often there can be two constraints here


I support matching constraint
I Λ may be constrained to a subset of RD
e.g. univariate Gaussian location lives in R but
scale lives in R>0
23 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ
I σ = exp(λσ )

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ
I σ = exp(λσ ) or σ = softplus(λσ )

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ
I σ = exp(λσ ) or σ = softplus(λσ )
The vMF distribution is parameterised by a
unit-norm vector v
how can we get v from λv ∈ Rd ?

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ
I σ = exp(λσ ) or σ = softplus(λσ )
The vMF distribution is parameterised by a
unit-norm vector v
how can we get v from λv ∈ Rd ?
I v = kλλvk
v 2

24 / 37
Automatic Differentiation Variational Inference
ADVI

Parameters in real coordinate space


Consider the Gaussian case: Z ∼ N (µ, σ)
how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
I µ = λµ
I σ = exp(λσ ) or σ = softplus(λσ )
The vMF distribution is parameterised by a
unit-norm vector v
how can we get v from λv ∈ Rd ?
I v = kλλvk
v 2
It is typically possible to work with unconstrained
parameters, it only takes an appropriate activation
24 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO

arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))


λ∈RD

25 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO

arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))


λ∈RD

subject to

Q = {q(z; λ) : λ ∈ RD , supp(q(z; λ)) ⊆ supp(p(z))}


There is one constraint left

25 / 37
Automatic Differentiation Variational Inference
ADVI

Constrained optimisation for the ELBO


We maximise the ELBO

arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))


λ∈RD

subject to

Q = {q(z; λ) : λ ∈ RD , supp(q(z; λ)) ⊆ supp(p(z))}


There is one constraint left
I support of q(z; λ) depends on the choice of prior
and thus may be a subset of RK
25 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK
I Pick a variational family over the entire real
coordinate space

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK
I Pick a variational family over the entire real
coordinate space
I basically, pick a Gaussian!

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK
I Pick a variational family over the entire real
coordinate space
I basically, pick a Gaussian!
3. Intractable expectations

26 / 37
Automatic Differentiation Variational Inference
ADVI

ADVI
A gradient-based black-box VI procedure
1. Custom parameter space
I Appropriate transformations of unconstrained
parameters!
2. Custom supp(p(z))
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK
I Pick a variational family over the entire real
coordinate space
I basically, pick a Gaussian!
3. Intractable expectations
I Reparameterised Gradients!
26 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK

27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

Recall that we have a joint density p(x, z)

27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

Recall that we have a joint density p(x, z)


which we can use to construct p(x, ζ)

27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

Recall that we have a joint density p(x, z)


which we can use to construct p(x, ζ)
p(x, ζ) = p(x, )|det J ( )|
| {z }
z
27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

Recall that we have a joint density p(x, z)


which we can use to construct p(x, ζ)
p(x, ζ) = p(x, T −1 (ζ))|det J ( )|
| {z }
z
27 / 37
Automatic Differentiation Variational Inference
ADVI

Joint model in real coordinate space


Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
and define a transformed variable ζ ∈ RK
ζ = T (z)

Recall that we have a joint density p(x, z)


which we can use to construct p(x, ζ)
p(x, ζ) = p(x, T −1 (ζ))|det JT −1 (ζ)|
| {z }
z
27 / 37
Automatic Differentiation Variational Inference
ADVI

VI in real coordinate space


We can design a posterior approximation whose
support is RK

28 / 37
Automatic Differentiation Variational Inference
ADVI

VI in real coordinate space


We can design a posterior approximation whose
support is RK

q(ζ; λ)

28 / 37
Automatic Differentiation Variational Inference
ADVI

VI in real coordinate space


We can design a posterior approximation whose
support is RK
K
Y
q(ζ; λ) = q(ζk ; λ)
|k=1 {z }
mean field

28 / 37
Automatic Differentiation Variational Inference
ADVI

VI in real coordinate space


We can design a posterior approximation whose
support is RK
K
Y K
Y
q(ζ; λ) = q(ζk ; λ) = N (ζk |µk , σk2 )
|k=1 {z } k=1
mean field
where
I µk = λµk for λµk ∈ RK
I σk = softplus(λσk ) for λσk ∈ RK
28 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space

log p(x)

29 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space


Z
log p(x) = log p(x, z)dz

29 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space


Z
log p(x) = log p(x, z)dz
Z
= log p(x, T −1 (ζ))|det JT −1 (ζ)|dζ

29 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space


Z
log p(x) = log p(x, z)dz
Z
= log p(x, T −1 (ζ))|det JT −1 (ζ)|dζ
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)

29 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space


Z
log p(x) = log p(x, z)dz
Z
= log p(x, T −1 (ζ))|det JT −1 (ζ)|dζ
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
JI
≥ q(ζ) log dζ
q(ζ)

29 / 37
Automatic Differentiation Variational Inference
ADVI

ELBO in real coordinate space


Z
log p(x) = log p(x, z)dz
Z
= log p(x, T −1 (ζ))|det JT −1 (ζ)|dζ
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
JI
≥ q(ζ) log dζ
q(ζ)
= Eq(ζ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ))
 

29 / 37
Automatic Differentiation Variational Inference
ADVI

Reparameterised ELBO

Recall that for Gaussians we have a standardisation


procedure Sλ (ζ) ∼ N (|0, I )

Eq(ζ;λ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ; λ))


 

30 / 37
Automatic Differentiation Variational Inference
ADVI

Reparameterised ELBO

Recall that for Gaussians we have a standardisation


procedure Sλ (ζ) ∼ N (|0, I )

Eq(ζ;λ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ; λ))


 
 
ζ
z }| {
= EN (|0,I ) log p(x, T −1 (Sλ−1 ())) + log det JT −1 (Sλ−1 ()) 
 
| {z }
z

+ H (q(ζ; λ))

30 / 37
Automatic Differentiation Variational Inference
ADVI

Gradient estimate
For i ∼ N (0, I )


ELBO(λ)
∂λ

31 / 37
Automatic Differentiation Variational Inference
ADVI

Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
ELBO(λ) ≈ log p(x|T −1 (Sλ−1 (i )))
∂λ M i=1 ∂λ | {z }
likelihood

31 / 37
Automatic Differentiation Variational Inference
ADVI

Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
ELBO(λ) ≈ log p(x|T −1 (Sλ−1 (i )))
∂λ M i=1 ∂λ | {z }
likelihood

+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior

31 / 37
Automatic Differentiation Variational Inference
ADVI

Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
ELBO(λ) ≈ log p(x|T −1 (Sλ−1 (i )))
∂λ M i=1 ∂λ | {z }
likelihood

+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior

log det JT −1 (Sλ−1 (i ))

+
∂λ | {z }
change of volume

31 / 37
Automatic Differentiation Variational Inference
ADVI

Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
ELBO(λ) ≈ log p(x|T −1 (Sλ−1 (i )))
∂λ M i=1 ∂λ | {z }
likelihood

+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior

log det JT −1 (Sλ−1 (i ))

+
∂λ | {z }
change of volume

+ H (q(ζ; λ))
∂λ | {z }
analaytic
31 / 37
Automatic Differentiation Variational Inference
ADVI

Practical tips

Many software packages know how to transform the


support of various distributions
I Stan
I Tensorflow tf.probability
I Pytorch torch.distributions

32 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull( |r , k) Poisson(x| )
| {z } | {z }
z z

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ))
| {z } | {z }
z z

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ)) det Jlog−1 (ζ)

| {z } | {z }
z z

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ)) det Jlog−1 (ζ)

| {z } | {z }
z z
−1

= p(x, z = log (ζ)) det Jlog−1 (ζ)

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ)) det Jlog−1 (ζ)

| {z } | {z }
z z
−1

= p(x, z = log (ζ)) det Jlog−1 (ζ)

ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ)) det Jlog−1 (ζ)

| {z } | {z }
z z
−1

= p(x, z = log (ζ)) det Jlog−1 (ζ)

ELBO
Eq(ζ|λ) log p(x, z = log−1 (ζ)) det Jlog−1 (ζ) + H (q(ζ))
 

33 / 37
Automatic Differentiation Variational Inference
Example

Weibull-Poisson model
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(z|r , k) Poisson(x|z)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ)) det Jlog−1 (ζ)

| {z } | {z }
z z
−1

= p(x, z = log (ζ)) det Jlog−1 (ζ)

ELBO
Eq(ζ|λ) log p(x, z = log−1 (ζ)) det Jlog−1 (ζ) + H (q(ζ))
 

Eφ() log p(x, z = log−1 (S −1 ())) det Jlog−1 (S −1 ()) + H (q(ζ))


 

33 / 37
Automatic Differentiation Variational Inference
Example

Visualisation

Images from Kucukelbir et al. (2017) 34 / 37


Automatic Differentiation Variational Inference
Example

Wait... no deep learning?

Sure! Parameters may well be predicted by NNs


I approximate posterior location and scale
I Weibull rate and shape
Everything is now differentiable, reparameterisable,
and the optimisation is unconstrained!

35 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI

36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space

36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space
I now we also know how to map latent variables
to unconstrained real coordinate space

36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space
I now we also know how to map latent variables
to unconstrained real coordinate space
I it takes a change of variable built into the
model

36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space
I now we also know how to map latent variables
to unconstrained real coordinate space
I it takes a change of variable built into the
model
Think of ADVI as reparameterised gradients and
autodiff expanded to many more models!

36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space
I now we also know how to map latent variables
to unconstrained real coordinate space
I it takes a change of variable built into the
model
Think of ADVI as reparameterised gradients and
autodiff expanded to many more models!
What’s left?
36 / 37
Automatic Differentiation Variational Inference
Example

Summary
ADVI is a big step towards blackbox VI
I we knew how to map parameters to the
unconstrained real coordinate space
I now we also know how to map latent variables
to unconstrained real coordinate space
I it takes a change of variable built into the
model
Think of ADVI as reparameterised gradients and
autodiff expanded to many more models!
What’s left? Our posteriors are still rather simple,
aren’t they?
36 / 37
Automatic Differentiation Variational Inference
Example

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath,


Andrew Gelman, and David M. Blei. Automatic
differentiation variational inference. Journal of
Machine Learning Research, 18(14):1–45, 2017.
URL
http://jmlr.org/papers/v18/16-107.html.

37 / 37

Potrebbero piacerti anche