M6 Advi PDF

Automatic Differentiation Variational Inference
Automatic Differentiation
Variational Inference
Philip Schulz and Wilker Aziz
https:
//github.com/philschulz/VITutorial
1 / 37
What we know so far
I DGMs:
2 / 37
What we know so far
I DGMs: probabilistic models parameterised by

neural networks
2 / 37
What we know so far

neural networks
I Objective:
2 / 37
What we know so far

neural networks
I Objective: lowerbound on log-likelihood
(ELBO)
2 / 37
What we know so far

neural networks
(ELBO)
I cannot be computed exactly
2 / 37
What we know so far

neural networks
(ELBO)
we resort to Monte Carlo estimation
2 / 37
What we know so far

neural networks
(ELBO)
I But the MC estimator is not differentiable
2 / 37
What we know so far

neural networks
(ELBO)
I Score function estimator: applicable to any model
2 / 37
What we know so far

neural networks
(ELBO)
I Score function estimator: applicable to any model
I Reparameterised gradients
so far seems applicable only to Gaussian variables
2 / 37
Multivariate calculus recap
Reparameterised gradients revisited
ADVI
Example
3 / 37
Let x ∈ RK and let T : RK → RK be differentiable

and invertible
I y = T (x)
I x = T −1 (y )
4 / 37
Jacobian
The Jacobian matrix JT (x) of T

assessed at x is the matrix of partial derivatives
∂yi
Jij =
∂xj
5 / 37
Jacobian
The Jacobian matrix JT (x) of T

assessed at x is the matrix of partial derivatives
∂yi
Jij =
∂xj
Inverse function theorem
JT −1 (y ) = (JT (x))−1
5 / 37
Differential (or inifinitesimal)

The differential dx of x
refers to an infinitely small change in x
6 / 37

We can relate the differential dy of y = T (x) to dx
6 / 37

I Scalar case
dy d
dy = T 0 (x)dx = dx = T (x)dx
dx dx
where dy /dx is the derivative of y wrt x
6 / 37

I Scalar case
dy d
dy = T 0 (x)dx = dx = T (x)dx
dx dx
where dy /dx is the derivative of y wrt x
I Multivariate case
dy = |det JT (x)|dx
the absolute value absorbs the orientation 6 / 37
Integration by substitution
We can integrate a function g (x)
by substituting x = T −1 (y )
Z
g (x)dx
7 / 37
Z Z
g (x)dx = g (T −1 (y )) |det JT −1 (y )|dy
| {z } | {z }
x dx
7 / 37
Z Z
| {z } | {z }
x dx
and similarly for a function h(y )

Z
h(y )dy
7 / 37
Z Z
| {z } | {z }
x dx
and similarly for a function h(y )

Z Z
h(y )dy = h(T (x))|det JT (x)|dx
7 / 37
Change of density
Let X take on values in RK with density pX (x)
8 / 37
Change of density
and recall that y = T (x) and x = T −1 (y )
8 / 37
Change of density
Then T induces a density pY (y ) expressed as
pY (y ) = pX (T −1 (y ))|det JT −1 (y )|
8 / 37
Change of density
Then T induces a density pY (y ) expressed as
pY (y ) = pX (T −1 (y ))|det JT −1 (y )|
and then it follows that
pX (x) = pY (T (x))|det JT (x)|
8 / 37
Revisiting reparameterised gradients

Let Z take on values in RK with pdf q(z|λ)
9 / 37

The idea is to count on a reparameterisation
9 / 37


a transformation Sλ : RK → RK such that
9 / 37


Sλ (z) ∼ π()
9 / 37


Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)
9 / 37


Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)
I π() does not depend on parameters λ

we call it a base density
9 / 37


Sλ (z) ∼ π()
Sλ−1 () ∼ q(z|λ)
I π() does not depend on parameters λ

we call it a base density
I Sλ (z) absorbs dependency on λ
9 / 37
Reparameterised expectations
If we are interested in
Eq(z|λ) [g (z)]
10 / 37
Z
Eq(z|λ) [g (z)] = q(z|λ)g (z)dz
10 / 37
Z
Z
= π(Sλ (z))|det JSλ (z)| g (z)dz
| {z }
change of density
10 / 37
Z
Z
| {z }
change of density
Z
= π()
10 / 37
Z
Z
| {z }
change of density
Z −1
= π() det JSλ−1 ()

| {z }
inv func theorem
10 / 37
Z
Z
| {z }
change of density
Z −1
= π() det JSλ−1 () g (Sλ−1 ())

| {z } | {z
z
}
inv func theorem
10 / 37
Z
Z
| {z }
change of density
Z −1
−1
= π() det JSλ−1 () g (Sλ ()) det JSλ−1 ()d

| {z } | {z
z
} | {z }
inv func theorem change of var
10 / 37
Z
Z
| {z }
change of density
Z −1
−1

| {z } | {z
z
} | {z }
Z
= π()g (Sλ−1 ())d
10 / 37
Z
Z
| {z }
change of density
Z −1
−1

| {z } | {z
z
} | {z }
Z
= π()g (Sλ−1 ())d = Eπ() g (Sλ−1 ())

10 / 37
Reparameterised gradients
For optimisation, we need tractable gradients
∂ ∂
Eπ() g (Sλ−1 ())

Eq(z|λ) [g (z)] =
∂λ ∂λ
11 / 37
∂ ∂
Eπ() g (Sλ−1 ())

Eq(z|λ) [g (z)] =
∂λ ∂λ
since now the density does not depend on λ, we can
obtain a gradient estimate

∂ ∂
Eq(z|λ) [g (z)] = Eπ() g (Sλ−1 ())
∂λ ∂λ
11 / 37
∂ ∂
Eπ() g (Sλ−1 ())

Eq(z|λ) [g (z)] =
∂λ ∂λ
since now the density does not depend on λ, we can
obtain a gradient estimate

∂ ∂
Eq(z|λ) [g (z)] = Eπ() g (Sλ−1 ())
∂λ ∂λ
M
MC 1 X ∂
≈ g (Sλ−1 (i ))
M ∂λ
i=1
i ∼π()
11 / 37
Reparameterised gradients: Gaussian

We have seen one case, namely,
if ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
12 / 37

if ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and
∂
EN (z|µ,σ2 ) [g (z)]
∂λ
12 / 37

if ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and
∂
EN (z|µ,σ2 ) [g (z)]
∂λ
∂
= EN (0,I ) g (z = µ + σ)
∂λ
12 / 37

if ∼ N (0, I ) and Z ∼ N (µ, σ 2 )
Then
Z ∼ µ + σ
and
∂
EN (z|µ,σ2 ) [g (z)]
∂λ
∂
= EN (0,I ) g (z = µ + σ)
∂λ

∂ ∂z
= EN (0,I ) g (z = µ + σ)
∂z ∂λ
12 / 37
Beyond
Many interesting densities cannot easily be

reparameterised
13 / 37
Beyond
reparameterised
Beta
13 / 37
Beyond
reparameterised
Gamma
13 / 37
Beyond
reparameterised
Weibull
13 / 37
Beyond
reparameterised
Dirichlet
13 / 37
Beyond
reparameterised
von Mises-Fisher
13 / 37
ADVI
Automatic Differentiation VI
Motivation
I many models have intractable posteriors
their normalising constants (evidence) lack
analytic solutions
14 / 37
ADVI
Motivation
analytic solutions
I but many models are differentiable
that’s the main constraint for using NNs
14 / 37
ADVI
Motivation
analytic solutions
Reparameterised gradients are a step towards
automatising VI for differentiable models
14 / 37
ADVI
Motivation
analytic solutions
Reparameterised gradients are a step towards
automatising VI for differentiable models
I but not every model of interest employs rvs for
which a reparameterisation is known
14 / 37
ADVI
Example: Weibull-Poisson model
Suppose we have some ordinal data which we

assume to be Poisson-distributed
X |z ∼ Poisson(z) z ∈ R>0
15 / 37
ADVI

and suppose we want to impose a Weibull prior on

the Poisson rate
15 / 37
ADVI

z|r , k ∼ Weibull(r , k) r ∈ R>0 , k ∈ R>0
and suppose we want to impose a Weibull prior on

the Poisson rate
15 / 37
ADVI
VI for Weibull-Poisson model

Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
16 / 37
ADVI

Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
p(x|r , k) = p(z|r , k)p(x|ρ)dz
R>0
16 / 37
ADVI

Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
R>0
ELBO
Eq(z|λ) [log p(x, z|r , k)] + H (q(z))
16 / 37
ADVI

Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
R>0
ELBO
Can we make q(z|λ) Gaussian?
16 / 37
ADVI

Generative model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
Marginal
Z
R>0
ELBO
Can we make q(z|λ) Gaussian?
No! supp(N (z|µ, σ 2 )) = R
16 / 37
ADVI
Strategy
Build a change of variable into the model
p(x, z|r , k) = p(z|r , k)p(x|ρ)

= Weibull(z|r , k) Poisson(x|z)
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

= Weibull( |r , k) Poisson(x| )
| {z } | {z }
z z
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))
| {z } | {z }
z z
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

= Weibull( exp(ζ) |r , k) Poisson(x| exp(ζ))|det Jexp (ζ)|
| {z } | {z }
z z
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
Can we use a Gaussian approximate posterior?
17 / 37
ADVI
Strategy
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
Can we use a Gaussian approximate posterior? Yes!
17 / 37
ADVI
Differentiable models
We focus on differentiable probability models
p(x, z) = p(x|z)p(z)
18 / 37
ADVI
I members of this class have continuous latent

variables z
18 / 37
ADVI
I members of this class have continuous latent

variables z
I and the gradient ∇z log p(x, z) is valid within
the support of the prior
supp(p(z)) = {z ∈ RK : p(z) > 0} ⊆ RK
18 / 37
ADVI
Why do we need differentiable models?

Recall the gradient of the ELBO
∂ ∂
Eq(z;λ) [log p(x, z)] + H (q(z; λ))
∂λ ∂λ
19 / 37
ADVI

∂ ∂
∂λ ∂λ
∂
Reparameterisation requires ∂z
19 / 37
ADVI

∂ ∂
∂λ ∂λ
∂
∂
Eq(z;λ) [log p(x, z)]
∂λ
19 / 37
ADVI

∂ ∂
∂λ ∂λ
∂

∂ ∂
Eq(z;λ) [log p(x, z)] = Eπ() log p(x, z = Sλ−1 ())
∂λ ∂λ
19 / 37
ADVI

∂ ∂
∂λ ∂λ
∂

∂ ∂
Eq(z;λ) [log p(x, z)] = Eπ() log p(x, z = Sλ−1 ())
∂λ ∂λ

∂ ∂ −1
= Eπ() log p(x, z) Sλ ()
∂z ∂λ
19 / 37
ADVI
VI optimisation problem
Let’s focus on the design and optimisation of the
variational approximation
arg min KL (q(z) || p(z|x))
q(z)
20 / 37
ADVI
q(z)
To automatise the search for a variational

approximation q(z) we must ensure that
supp(q(z)) ⊆ supp(p(z|x))
20 / 37
ADVI
q(z)
To automatise the search for a variational

approximation q(z) we must ensure that
supp(q(z)) ⊆ supp(p(z|x))
I otherwise KL is not a real number

def
KL (q || p) = Eq [log q] − Eq [log p] = ∞
20 / 37
ADVI
Support matching constraint

So let’s constrain q(z) to a family Q whose support
is included in the support of the posterior
q(z)∈Q
where
Q = {q(z) : supp(q(z)) ⊆ supp(p(z|x))}
21 / 37
ADVI

q(z)∈Q
where
But what is the support of p(z|x)?
21 / 37
ADVI

q(z)∈Q
where
I typically the same as the support of p(z)
21 / 37
ADVI

q(z)∈Q
where
I typically the same as the support of p(z)
as long as p(x, z) > 0 if p(z) > 0
21 / 37
ADVI
Parametric family
is included in the support of the prior

q(z)∈Q
where
Q = {q(z; λ) : λ ∈ Λ, supp(q(z; λ)) ⊆ supp(p(z))}
22 / 37
ADVI
Parametric family
is included in the support of the prior

q(z)∈Q
where

I a parameter vector λ picks out a member of
the family
22 / 37
ADVI
Constrained optimisation for the ELBO

We maximise the ELBO
arg max Eq(z;λ) [log p(x, z)] + H (q(z; λ))
λ∈Λ
23 / 37
ADVI

λ∈Λ
subject to
23 / 37
ADVI

λ∈Λ
subject to
Often there can be two constraints here
23 / 37
ADVI

λ∈Λ
subject to

I support matching constraint
23 / 37
ADVI

λ∈Λ
subject to

I Λ may be constrained to a subset of RD
23 / 37
ADVI

λ∈Λ
subject to

I Λ may be constrained to a subset of RD
e.g. univariate Gaussian location lives in R but
scale lives in R>0
23 / 37
ADVI
Parameters in real coordinate space

Consider the Gaussian case: Z ∼ N (µ, σ)
24 / 37
ADVI

how can we obtain µ ∈ Rd and σ ∈ Rd>0
from λµ ∈ Rd and λσ ∈ Rd ?
24 / 37
ADVI

I µ = λµ
24 / 37
ADVI

I µ = λµ
I σ = exp(λσ )
24 / 37
ADVI

I µ = λµ
I σ = exp(λσ ) or σ = softplus(λσ )
24 / 37
ADVI

I µ = λµ
The vMF distribution is parameterised by a
unit-norm vector v
how can we get v from λv ∈ Rd ?
24 / 37
ADVI

I µ = λµ
unit-norm vector v
I v = kλλvk
v 2
24 / 37
ADVI

I µ = λµ
unit-norm vector v
I v = kλλvk
v 2
It is typically possible to work with unconstrained
parameters, it only takes an appropriate activation
24 / 37
ADVI


λ∈RD
25 / 37
ADVI


λ∈RD
subject to
Q = {q(z; λ) : λ ∈ RD , supp(q(z; λ)) ⊆ supp(p(z))}

There is one constraint left
25 / 37
ADVI


λ∈RD
subject to
Q = {q(z; λ) : λ ∈ RD , supp(q(z; λ)) ⊆ supp(p(z))}

There is one constraint left
I support of q(z; λ) depends on the choice of prior
and thus may be a subset of RK
25 / 37
ADVI
ADVI
A gradient-based black-box VI procedure
26 / 37
ADVI
ADVI
1. Custom parameter space
26 / 37
ADVI
ADVI
I Appropriate transformations of unconstrained
parameters!
26 / 37
ADVI
ADVI
parameters!
2. Custom supp(p(z))
26 / 37
ADVI
ADVI
parameters!
I Express z ∈ supp(p(z)) ⊆ RK as a transformation
of some unconstrained ζ ∈ RK
26 / 37
ADVI
ADVI
parameters!
I Pick a variational family over the entire real
coordinate space
26 / 37
ADVI
ADVI
parameters!
coordinate space
I basically, pick a Gaussian!
26 / 37
ADVI
ADVI
parameters!
coordinate space
3. Intractable expectations
26 / 37
ADVI
ADVI
parameters!
coordinate space
3. Intractable expectations
I Reparameterised Gradients!
26 / 37
ADVI
Joint model in real coordinate space

Let’s introduce an invertible and differentiable
transformation
T : supp(p(z)) → RK
27 / 37
ADVI

transformation
and define a transformed variable ζ ∈ RK
ζ = T (z)
27 / 37
ADVI

transformation
ζ = T (z)
Recall that we have a joint density p(x, z)
27 / 37
ADVI

transformation
ζ = T (z)

which we can use to construct p(x, ζ)
27 / 37
ADVI

transformation
ζ = T (z)

p(x, ζ) = p(x, )|det J ( )|
| {z }
z
27 / 37
ADVI

transformation
ζ = T (z)

p(x, ζ) = p(x, T −1 (ζ))|det J ( )|
| {z }
z
27 / 37
ADVI

transformation
ζ = T (z)

p(x, ζ) = p(x, T −1 (ζ))|det JT −1 (ζ)|
| {z }
z
27 / 37
ADVI
VI in real coordinate space

We can design a posterior approximation whose
support is RK
28 / 37
ADVI

support is RK
q(ζ; λ)
28 / 37
ADVI

support is RK
K
Y
q(ζ; λ) = q(ζk ; λ)
|k=1 {z }
mean field
28 / 37
ADVI

support is RK
K
Y K
Y
q(ζ; λ) = q(ζk ; λ) = N (ζk |µk , σk2 )
|k=1 {z } k=1
mean field
where
I µk = λµk for λµk ∈ RK
I σk = softplus(λσk ) for λσk ∈ RK
28 / 37
ADVI
ELBO in real coordinate space
log p(x)
29 / 37
ADVI

Z
log p(x) = log p(x, z)dz
29 / 37
ADVI

Z
Z
= log p(x, T −1 (ζ))|det JT −1 (ζ)|dζ
29 / 37
ADVI

Z
Z
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)
29 / 37
ADVI

Z
Z
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
JI
≥ q(ζ) log dζ
q(ζ)
29 / 37
ADVI

Z
Z
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
= log q(ζ) dζ
q(ζ)
p(x, T −1 (ζ))|det JT −1 (ζ)|
Z
JI
≥ q(ζ) log dζ
q(ζ)
= Eq(ζ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ))

29 / 37
ADVI
Reparameterised ELBO
Recall that for Gaussians we have a standardisation

procedure Sλ (ζ) ∼ N (|0, I )
Eq(ζ;λ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ; λ))

30 / 37
ADVI
Reparameterised ELBO
Recall that for Gaussians we have a standardisation

procedure Sλ (ζ) ∼ N (|0, I )
Eq(ζ;λ) log p(x, T −1 (ζ)) + log |det JT −1 (ζ)| + H (q(ζ; λ))

 
ζ
z }| {
= EN (|0,I ) log p(x, T −1 (Sλ−1 ())) + log det JT −1 (Sλ−1 ())
 
| {z }
z
+ H (q(ζ; λ))
30 / 37
ADVI
Gradient estimate
For i ∼ N (0, I )
∂
ELBO(λ)
∂λ
31 / 37
ADVI
Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
ELBO(λ) ≈ log p(x|T −1 (Sλ−1 (i )))
∂λ M i=1 ∂λ | {z }
likelihood
31 / 37
ADVI
Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
∂λ M i=1 ∂λ | {z }
likelihood
∂
+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior
31 / 37
ADVI
Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
∂λ M i=1 ∂λ | {z }
likelihood
∂
+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior
∂
log det JT −1 (Sλ−1 (i ))

+
∂λ | {z }
change of volume
31 / 37
ADVI
Gradient estimate
For i ∼ N (0, I )
M
∂ MC 1 X ∂
∂λ M i=1 ∂λ | {z }
likelihood
∂
+ log p(T −1 (Sλ−1 (i )))
∂λ | {z }
prior
∂
log det JT −1 (Sλ−1 (i ))

+
∂λ | {z }
change of volume
∂
+ H (q(ζ; λ))
∂λ | {z }
analaytic
31 / 37
ADVI
Practical tips
Many software packages know how to transform the

support of various distributions
I Stan
I Tensorflow tf.probability
I Pytorch torch.distributions
32 / 37
Example
Weibull-Poisson model
p(x, z|r , k) = p(z|r , k)p(x|ρ)
33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull( |r , k) Poisson(x| )
| {z } | {z }
z z
33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ))
| {z } | {z }
z z
33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)
= Weibull(log−1 (ζ) |r , k) Poisson(x| log−1 (ζ))det Jlog−1 (ζ)

| {z } | {z }
z z
33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
−1

= p(x, z = log (ζ)) det Jlog−1 (ζ)

33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
−1


ELBO
Eq(ζ|λ) [. . .] + H (q(ζ))
33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
−1


ELBO
Eq(ζ|λ) log p(x, z = log−1 (ζ))det Jlog−1 (ζ) + H (q(ζ))

33 / 37
Example
p(x, z|r , k) = p(z|r , k)p(x|ρ)

| {z } | {z }
z z
−1


ELBO
Eq(ζ|λ) log p(x, z = log−1 (ζ))det Jlog−1 (ζ) + H (q(ζ))

Eφ() log p(x, z = log−1 (S −1 ()))det Jlog−1 (S −1 ()) + H (q(ζ))

33 / 37
Example
Visualisation
Images from Kucukelbir et al. (2017) 34 / 37

Example
Wait... no deep learning?
Sure! Parameters may well be predicted by NNs

I approximate posterior location and scale
I Weibull rate and shape
Everything is now differentiable, reparameterisable,
and the optimisation is unconstrained!
35 / 37
Example
Summary
ADVI is a big step towards blackbox VI
36 / 37
Example
Summary
I we knew how to map parameters to the
unconstrained real coordinate space
36 / 37
Example
Summary
I now we also know how to map latent variables
to unconstrained real coordinate space
36 / 37
Example
Summary
I it takes a change of variable built into the
model
36 / 37
Example
Summary
model
Think of ADVI as reparameterised gradients and
autodiff expanded to many more models!
36 / 37
Example
Summary
model
What’s left?
36 / 37
Example
Summary
model
What’s left? Our posteriors are still rather simple,
aren’t they?
36 / 37
Example
Alp Kucukelbir, Dustin Tran, Rajesh Ranganath,

Andrew Gelman, and David M. Blei. Automatic
differentiation variational inference. Journal of
Machine Learning Research, 18(14):1–45, 2017.
URL
http://jmlr.org/papers/v18/16-107.html.
37 / 37

M6 Advi PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

M6 Advi PDF

Caricato da

Copyright:

Formati disponibili

Automatic Differentiation Variational Inference

What we know so far

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

What we know so far

I DGMs: probabilistic models parameterised by

Multivariate calculus recap

Reparameterised gradients revisited

Multivariate calculus recap

Let x ∈ RK and let T : RK → RK be differentiable

The Jacobian matrix JT (x) of T

The Jacobian matrix JT (x) of T

Inverse function theorem

Differential (or inifinitesimal)

Differential (or inifinitesimal)

Differential (or inifinitesimal)

Differential (or inifinitesimal)

and similarly for a function h(y )

and similarly for a function h(y )

Then T induces a density pY (y ) expressed as

Then T induces a density pY (y ) expressed as

and then it follows that

pX (x) = pY (T (x))|det JT (x)|

Revisiting reparameterised gradients

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

I π() does not depend on parameters λ

Revisiting reparameterised gradients

The idea is to count on a reparameterisation

I π() does not depend on parameters λ

Reparameterised gradients: Gaussian

Reparameterised gradients: Gaussian

Reparameterised gradients: Gaussian

Reparameterised gradients: Gaussian

Many interesting densities cannot easily be

Example: Weibull-Poisson model

Suppose we have some ordinal data which we

Example: Weibull-Poisson model

Suppose we have some ordinal data which we

and suppose we want to impose a Weibull prior on

Example: Weibull-Poisson model

Suppose we have some ordinal data which we

and suppose we want to impose a Weibull prior on

I π() does not depend on parameters λ

I π() does not depend on parameters λ