Sei sulla pagina 1di 86

Probabilistic Numerics

– Part I –
Integration and Differential Equations

Philipp Hennig

MLSS 2015
18 / 07 / 2015

Emmy Noether Group on Probabilistic Numerics


Department of Empirical Inference
Max Planck Institute for Intelligent Systems
Tübingen, Germany
Information Content of Partial Computations
division with remainder

2 3 7 3 6 ÷ 7 3 6 = X X . X X

1 ,
Information Content of Partial Computations
division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 X . X X
2 2 0 8 0
1 6 5 6

1 ,
Information Content of Partial Computations
division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . X X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4

1 ,
Information Content of Partial Computations
division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8

1 ,
Information Content of Partial Computations
division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 5
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8
3 6 . 8
0

1 ,
What about ML computations?
Contemporary computational tasks are more challenging

What happens with


▸ a neural net if we stop the “training” of a neural network after four
steps of sgd?
▸ . . . on only 1% of the data set?
▸ a GP regressor if we stop the Gauss-Jordan elemination after three
steps?
▸ a DP mixture model if we only run MCMC for ten samples?
▸ a robotic controller built using all these methods?

2 ,
What about ML computations?
Contemporary computational tasks are more challenging

What happens with


▸ a neural net if we stop the “training” of a neural network after four
steps of sgd?
▸ . . . on only 1% of the data set?
▸ a GP regressor if we stop the Gauss-Jordan elemination after three
steps?
▸ a DP mixture model if we only run MCMC for ten samples?
▸ a robotic controller built using all these methods?

As data-sets becomes infinite, ML models increasingly complex,


and their applications permeate our lives,
we need to model effects of approximations more explicitly
to achieve fast, reliable AI.

2 ,
Machine learning methods are chains of numerical computations
▸ linear algebra (least-squares)
▸ optimization (training & fitting)
▸ integration (MCMC, marginalization)
▸ solving differential equations (RL, control)
Are these methods just black boxes on your shelf?

3 ,
Numerical methods perform inference
an old observation [Poincaré 1896, Diaconis 1988, O’Hagan 1992]

A numerical method
estimates a function’s latent property
given the result of computations.

integration estimates ∫a f (x) dx given {f (xi )}


b

linear algebra estimates x s.t. Ax = b given {As = y}


optimization estimates x s.t. ∇f (x) = 0 given {∇f (xi )}
analysis estimates x(t) s.t. x = f (x, t),

given {f (xi , ti )}

▸ computations yield “data” / “observations”


▸ non-analytic quantities are “latent”
▸ even deterministic quantities can be uncertain.

4 ,
If computation is inference, it should be possible to build
probabilistic numerical methods
that take in probability measures over inputs,
and return probability measures over outputs,
which quantify uncertainty arising from the uncertain input
and the finite information content of the computation.

i compute o

5 ,
Classic methods identified as maximum a-posteriori
probabilistic numerics is anchored in established theory

quadrature [Diaconis 1988]


Gaussian quadrature Gaussian process regression

linear algebra [Hennig 2014]


conjugate gradients Gaussian conditioning

nonlinear optimization [Hennig & Kiefel 2013]


BFGS autoregressive filtering

ordinary differential equations [Schober, Duvenaud & Hennig 2014]


Runge-Kutta Gauss-Markov extrapolation

6 ,
Integration

F =∫ f (x) dx
b

f ∫ F

7 ,
Integration
a toy problem

1 100

F − F̂
f (x)

0.5 10−5

10−10 0
−2
0
0 2 10 101 102
x
# samples

f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?


3

−3

8 ,
Integration
a toy problem

1 100

F − F̂
f (x)

0.5 10−5

10−10 0
−2
0
0 2 10 101 102
x
# samples

f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?


3

−3

≤ exp(−x2 ) ∫ exp(−x ) =
2
π

8 ,
Monte Carlo
(almost) no assumptions, stochastic convergence

1 100

F − F̂
f (x)

0.5 10−5

10−10 0
−2
0
0 2 10 101 102
x
# samples

F = ∫ exp (− sin2 (3x) − x2 ) dx


f (x) g(x)
=Z∫ dx Z = ∫ g(x)dx
g(x) Z
1 N f (xi ) varg (f /g)
≈Z ∑ = F̂ xi ∼ var F̂ =
g(x)
N i g(xi ) Z N

9 ,
Monte Carlo
(almost) no assumptions, stochastic convergence

1 100

F − F̂
f (x)

0.5 10−5

10−10 0
−2
0
0 2 10 101 102
x
# samples

F = ∫ exp (− sin2 (3x) − x2 ) dx


f (x) g(x)
=Z∫ dx Z = ∫ g(x)dx
g(x) Z
1 N f (xi ) varg (f /g)
≈Z ∑ = F̂ xi ∼ var F̂ =
g(x)
N i g(xi ) Z N

▸ adding randomness enforces stochastic convergence 9 ,


The probabilistic approach
integration as nonparametric inference [P. Diaconis, 1988, T. O’Hagan, 1991]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

p(f ) = GP(f ; 0, k) k(x, x′ ) = min(x, x′ ) + b


p(z) = N (z; µ, Σ) ⇒ p(Az) = N (Az; Aµ, AΣA⊺ )

p (∫ f (x) dx) = N [∫ f (x) dx; ∫ k(x, x′ ) dx dx′ ]


b b b b
m(x) dx, ∬
a a a a

= N (F ; 0, −1/6(b3 − a3 ) + 1/2[b3 − 2a2 b + a3 ] − (b − a)2 c)


10 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]


active node placement for maximum expected error reduction
gives regular grid.

11 ,
Posterior Mean is a Linear Spline. . .
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b


my (x) = ∑ k(x, xi )αi αi = kX X −1 y
i

12 ,
We just re-discovered the Trapezoid Rule!
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]

1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b


my (x) = ∑ k(x, xi )αi αi = kX X −1 y
i

N −1
Ep(f ∣ y) [F ] = ∫ my (x) dx = ∑ (xi+1 − xi ) (f (xi+1 ) + f (xi ))
1
i=1 2
12 ,
1
100

0.5

F − F̂
f (x)

0 10−5

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

▸ The trapezoid rule is the maximum a-posteriori estimate for F under a


Wiener process prior on f .
▸ Node placement by information maximization under this prior.

12 ,
▸ introducing random numbers into a deterministic problem may not be
a good strategy
▸ recipe for a probabilistic numerical method estimating x from y(t):
▸ choose model p(x, y(t))
▸ choose action rule / policy / strategy
[t1 , . . . , ti−1 , y(t1 ), . . . , y(ti−1 )] _ ti
▸ some classic numerical methods can be derived entirely from an
inference perspective, using classic statistical methods
▸ the trapezoid rule is a MAP estimate under a Wiener process prior on
the integrand; regular node placement arises from
information-greediness
▸ the probabilistic interpretation as such does not ensure the posterior
distribution is well calibrated

13 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102 103
x
# samples

Computations collect information about a latent quantity.


The more valid prior information is available, the cheaper the
computation—if the algorithm uses the prior information!

15 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102 103
x
# samples

Computations collect information about a latent quantity.


The more valid prior information is available, the cheaper the
computation—if the algorithm uses the prior information!

15 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102 103
x
# samples

Computations collect information about a latent quantity.


The more valid prior information is available, the cheaper the
computation—if the algorithm uses the prior information!

15 ,
Model Mismatch can be Detected at Runtime
“a numerical conscience”

1 0.8
0.8
f (x)

f (x)
0.6 0.6
0.4 0.4
0.2 0.2
−2 −2
0 0
0 2 0 2
100 100
F − F̂

F − F̂
10−5 10−5

10−10 10−10
400 400
200 200
0 0
r

r
−200 −200
−400 −400
100 101 102 103 100 101 102 103
# samples # samples

Ef˜[p(f˜(x))]
r = log = (f (x) − µ(x))⊺ K −1 (f (x) − µ(x)) − N
p(f (x))
16 ,
▸ including tangible prior information in the prior can give tailored
numerics that drastically improve computational efficiency
▸ in contrast to physical data sources, in numerical problems, prior
assumptions can be rigorously verified, because the problem is stated
in a formal (programming) language!
▸ incorrect prior assumptions can catastrophically affect performance
▸ model assumptions can be adapted at runtime, using established
statistical techniques, e.g. “type-II maximum likelihood”

17 ,
So why is everyone (in ML) using MCMC?

18 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples


f (x) ⋅ exp(−x2 /2) ∼ GP [0, k = exp(−1/2(x − x′ )⊺ Λ−1 (x − x′ ))]

▸ encodes positivity
▸ encodes nonstationarity

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

1
100

F − F̂
f (x)

0.5
10−5

0
10−10 0
−2 0 2 10 101 102
x
# samples

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]


▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,
Probabilistic Numerics Need Not Be Expensive
WSABI is time-competititive with MCMC [Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

synthetic (moG) yacht hydrodynamics


0
10
∣Fest − Ftrue ∣/Ftrue

104
10−1

∣Fest − Ftrue ∣
10−2 103

10−3
102 −2
10−2 10−1 100 101 102 10 10−1 100 101 102 103
time [s] time [s]
GP classification, synthetic GP classification, graph

101
101
∣Fest − Ftrue ∣

0
∣Fest − Ftrue ∣
10

10−1
10−1 Monte Carlo
WSABI
10−2 −1
10 100 101 10−1 100 101 102 103
time [s] time [s]

20 ,
end of Integration part

▸ computation is inference
▸ there is a deep formal connection between basic numerical methods
and basic statistical models
▸ ignoring salient prior information causes drastic computational cost
increase. Black boxes may be convenient, but they are not efficient
▸ machine learning can help numerics, and vice versa

21 ,
Ordinary Differential Equations

x′ (t) = f (x(t), t) x(t0 ) = x0 x ∶ R _ RN

ode x

x0

22 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)

t0 t0 + c1 t0 + c2 t0 + h
t

0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2

23 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)

t0 t0 + c1 t0 + c2 t0 + h
t

0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h

23 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)

t0 t0 + c1 t0 + c2 t0 + h
t

0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = 1x0 + ∑i bi yi

▸ RK choose (c, w, b) such that ∥x̂(t0 + h) − x(t0 + h)∥ = O(hp )


23 ,
Gauss-Markov inference on ODEs
a probabilistic model matching Runge-Kutta
x(t)
[Schober, Duvenaud, Hennig, NIPS 2014]

t0 c1 c2 h
t

0 1 y1 = f (µ ∣ x0 (t0 ), t0 )
c1 1 w11 y2 = f (µ ∣ x0 ,y1 (c1 ), t0 + c1 )
c2 1 w21 w22 ys+1 = f (µ ∣ x0 ,y1 ,⋯,ys (cs ), t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = µ ∣ x0 ,yi (t0 + h)

µ ∣ y (t) ∶= cov(x(t), yi ) cov(y, y)−1 y = 1x0 + ∑i wsi yi


24 ,
Gauss-Markov inference on ODEs
a probabilistic model matching Runge-Kutta
x(t)
[Schober, Duvenaud, Hennig, NIPS 2014]

t0 c1 c2 h
t

▸ Linear extrapolation suggests Gaussian process model


▸ polynomial form suggests Wiener state-space model
dz = F dt + L dω
⎡ ⎤ ⎡0 1 . . . ⎤ ⎡0⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ x′ (t) ⎥ 1 ⎢ 0 2 ⎥ ⎢⋮⎥
x(t)
⎢ ⎥ ⎢ ⋱⎥ ⎢ ⎥
d⎢ ⎥= ⎢ ⎥ dt + σ ⎢ ⎥ dω
⎢ ⋮ ⎥ h ⎢⋮ ⋱ ⎥ ⎢0⎥
⎢ k ⎥ ⎢ k⎥ ⎢ ⎥
⎢h /k!x(k) (t)⎥ ⎢ ⎥ ⎢1⎥
⎣ ⎦ ⎣ 0⎦ ⎣ ⎦
▸ inference through filtering 24 ,
Calibrating Uncertainty
within the parametrized class

x(t0 + h)
x(t)

t0 t0 + c 1 t0 + c2 t0 + h p[x(t0 + h)]
t

▸ posterior mean µ ∣ y = kK −1 y invariant under k _ θ2 k


▸ posterior covariance k ∣ y = k − kK −1 k scaled by θ2
▸ connection to local error estimation in existing methods

25 ,
▸ as in integration, classic families of ODE solvers can be interpreted as
MAP inference
▸ for each classic solver, whole family of probabilistic solvers with same
point estimate, different uncertainties
▸ probabilistic solvers need not be expensive. Can even have exact
same cost as classic methods

26 ,
Visualizing Computational Uncertainty
Neural Pathways [M. Schober, N. Kasenburg, A. Feragen, P.H., S. Hauberg, MICCAI 2014]

27 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(

▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same


(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ⎥
Bottom 1 as the
⎣ ⎦
∂x(t)
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
28 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(

▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same


(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ∂x(t) ⎥
Bottom 1 as the
⎣ ⎦
what if M N (m, V ) is uncertain (inferred from data)?
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
▸ 28 ,
Uncertainty Across Composite Computations
interacting information requirements [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]

data variables parameters


inference by estimation by
D xt θ
quadrature optimization
learning / inference / pattern rec. / system id.

prediction by

analysis
action by
a xt+δt
control
action prediction

environment machine

▸ numerical methods able to deal with uncertain (probabilistic) inputs,


and returning uncertain (probabilistic) outputs, allow control of
computational effort
▸ (this is not the same as probabilistic programming!)
29 ,
Summary
Probabilistic Numerics – Part I [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]

The probabilistic view of computation


▸ computation is (active) inference
▸ several classic method can be interpreted precisely as MAP inference

▸ Gaussian Quadrature—Gaussian process regression


▸ Runge-Kutta Methods—Autoregressive Filtering
▸ [more to come on Monday]
▸ correct prior information can reduce runtime
▸ prior assumptions can be tested at runtime
▸ probabilistic uncertainty can be propagated between different
numerical computations, allowing control of cost

“[PN is] a re-emerging area of very active research”


Z. Ghahramani, Nature 521, 452–459 (28 May 2015)

http://probabilistic-numerics.org

30 ,
31 ,
— Backup —

32 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ

λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging

33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ

λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging

33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ

λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging

33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ

λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging

33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ

λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging

33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

MC is optimal for totally unstructured (but integrable) f . But


▸ random node placement is not necessary!
Introducing randomness to a deterministic problem is generally a bad
idea.
▸ no integrand is totally unstructured. Prior knowledge is available!
Computations should use as much available prior information as
possible.
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit

1
100
0.5

F − F̂
f (x)

10−5
0

−0.5

−1 10−10 0
−2 0 2 10 101 102
x
# samples

MC is optimal for totally unstructured (but integrable) f . But


▸ random node placement is not necessary!
Introducing randomness to a deterministic problem is generally a bad
idea.
▸ no integrand is totally unstructured. Prior knowledge is available!
Computations should use as much available prior information as
possible.
33 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666
▸ 169399375105820974944592307816
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011

34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011

34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011

34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895 von Neumann method, seed 908344
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011

34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895 von Neumann method, seed 908344
▸ 100011111101111111100101000001 bits from G. Marsaglia’s ‘diehard CD’
▸ 01110000011100100110111101100011

34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895 von Neumann method, seed 908344
▸ 100011111101111111100101000001 bits from G. Marsaglia’s ‘diehard CD’
▸ 01110000011100100110111101100011 deterministic sequence,
corrupted by horizontal coin drop from unknown height

http://www.stat.fsu.edu/pub/diehard/
34 ,
What is a Random Number?
What is a sequence of random numbers?

▸ 662244111144334455553366666666 dice, doubled


▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895 von Neumann method, seed 908344
▸ 100011111101111111100101000001 bits from G. Marsaglia’s ‘diehard CD’
▸ 01110000011100100110111101100011 deterministic sequence,
corrupted by horizontal coin drop from unknown height

▸ for use in Monte Carlo, the important property is freedom from


patterns (because it implies anytime unbiasedness)
▸ in fact, use for MC only really requires the right density, disorder is
just helpful for the argument
▸ for use in cryptography, the important property is unpredictability
▸ randomness is a philosophically dodgy concept
▸ uncertainty is a much clearer idea
http://www.stat.fsu.edu/pub/diehard/
34 ,
How Expensive is a Computation?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

▸ Shortest paths (geodesics) on Riemannian manifold of metric M obey

⎡ Ð ⇀ ⎤⊺
⎢ ∂ M (x(t)) ⎥
x (t) =
′′
−1/2M −1 (x(t)) ⎢ ⎥
⎢ ∂x(t) ⎥ (x (t)⊗x (t)) = f (x(t), x (t), t)
′ ′ ′
⎢ ⎥
⎣ ⎦
▸ Karcher mean µ of data set {xi }i=1,...,N is point minimizing
∑i Distance(µ, xi )
N

35 ,
How Expensive is a Computation?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

▸ to find Karcher mean, do gradient descent from initial guess µ0


N
µk+1 = µk − α∇µ ∑ Distance(µ, xi )
i

▸ requires solving N initial value problems, over and over.

35 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

36 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

36 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

36 ,
37 ,

Potrebbero piacerti anche