Part I - Integration and Differential Equations: Probabilistic Numerics

Probabilistic Numerics
– Part I –
Integration and Differential Equations
Philipp Hennig
MLSS 2015
18 / 07 / 2015
Emmy Noether Group on Probabilistic Numerics

Department of Empirical Inference
Max Planck Institute for Intelligent Systems
Tübingen, Germany
Information Content of Partial Computations
division with remainder
2 3 7 3 6 ÷ 7 3 6 = X X . X X
1 ,
2 3 7 3 6 ÷ 7 3 6 = 3 X . X X
2 2 0 8 0
1 6 5 6
1 ,
2 3 7 3 6 ÷ 7 3 6 = 3 2 . X X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 ,
2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8
1 ,
2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 5
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8
3 6 . 8
0
1 ,
What about ML computations?
Contemporary computational tasks are more challenging
What happens with

▸ a neural net if we stop the “training” of a neural network after four
steps of sgd?
▸ . . . on only 1% of the data set?
▸ a GP regressor if we stop the Gauss-Jordan elemination after three
steps?
▸ a DP mixture model if we only run MCMC for ten samples?
▸ a robotic controller built using all these methods?
2 ,
What about ML computations?
Contemporary computational tasks are more challenging
What happens with

▸ a neural net if we stop the “training” of a neural network after four
steps of sgd?
▸ . . . on only 1% of the data set?
▸ a GP regressor if we stop the Gauss-Jordan elemination after three
steps?
▸ a DP mixture model if we only run MCMC for ten samples?
▸ a robotic controller built using all these methods?
As data-sets becomes infinite, ML models increasingly complex,

and their applications permeate our lives,
we need to model effects of approximations more explicitly
to achieve fast, reliable AI.
2 ,
Machine learning methods are chains of numerical computations
▸ linear algebra (least-squares)
▸ optimization (training & fitting)
▸ integration (MCMC, marginalization)
▸ solving differential equations (RL, control)
Are these methods just black boxes on your shelf?
3 ,
Numerical methods perform inference
an old observation [Poincaré 1896, Diaconis 1988, O’Hagan 1992]
A numerical method
estimates a function’s latent property
given the result of computations.
integration estimates ∫a f (x) dx given {f (xi )}

b
linear algebra estimates x s.t. Ax = b given {As = y}

optimization estimates x s.t. ∇f (x) = 0 given {∇f (xi )}
analysis estimates x(t) s.t. x = f (x, t),
′
given {f (xi , ti )}
▸ computations yield “data” / “observations”

▸ non-analytic quantities are “latent”
▸ even deterministic quantities can be uncertain.
4 ,
If computation is inference, it should be possible to build
probabilistic numerical methods
that take in probability measures over inputs,
and return probability measures over outputs,
which quantify uncertainty arising from the uncertain input
and the finite information content of the computation.
i compute o
5 ,
Classic methods identified as maximum a-posteriori
probabilistic numerics is anchored in established theory
quadrature [Diaconis 1988]

Gaussian quadrature Gaussian process regression
linear algebra [Hennig 2014]

conjugate gradients Gaussian conditioning
nonlinear optimization [Hennig & Kiefel 2013]

BFGS autoregressive filtering
ordinary differential equations [Schober, Duvenaud & Hennig 2014]

Runge-Kutta Gauss-Markov extrapolation
6 ,
Integration
F =∫ f (x) dx
b
f ∫ F
7 ,
Integration
a toy problem
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?

3
−3
8 ,
Integration
a toy problem
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?

3
−3
√
≤ exp(−x2 ) ∫ exp(−x ) =
2
π
8 ,
Monte Carlo
(almost) no assumptions, stochastic convergence
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
F = ∫ exp (− sin2 (3x) − x2 ) dx

f (x) g(x)
=Z∫ dx Z = ∫ g(x)dx
g(x) Z
1 N f (xi ) varg (f /g)
≈Z ∑ = F̂ xi ∼ var F̂ =
g(x)
N i g(xi ) Z N
9 ,
Monte Carlo
(almost) no assumptions, stochastic convergence
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
F = ∫ exp (− sin2 (3x) − x2 ) dx

f (x) g(x)
=Z∫ dx Z = ∫ g(x)dx
g(x) Z
1 N f (xi ) varg (f /g)
≈Z ∑ = F̂ xi ∼ var F̂ =
g(x)
N i g(xi ) Z N
▸ adding randomness enforces stochastic convergence 9 ,

The probabilistic approach
integration as nonparametric inference [P. Diaconis, 1988, T. O’Hagan, 1991]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
p(f ) = GP(f ; 0, k) k(x, x′ ) = min(x, x′ ) + b

p(z) = N (z; µ, Σ) ⇒ p(Az) = N (Az; Aµ, AΣA⊺ )
p (∫ f (x) dx) = N [∫ f (x) dx; ∫ k(x, x′ ) dx dx′ ]

b b b b
m(x) dx, ∬
a a a a
= N (F ; 0, −1/6(b3 − a3 ) + 1/2[b3 − 2a2 b + a3 ] − (b − a)2 c)

10 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

active node placement for maximum expected error reduction
gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples

gives regular grid.
11 ,
Posterior Mean is a Linear Spline. . .
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b

my (x) = ∑ k(x, xi )αi αi = kX X −1 y
i
12 ,
We just re-discovered the Trapezoid Rule!
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b

my (x) = ∑ k(x, xi )αi αi = kX X −1 y
i
N −1
Ep(f ∣ y) [F ] = ∫ my (x) dx = ∑ (xi+1 − xi ) (f (xi+1 ) + f (xi ))
1
i=1 2
12 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
▸ The trapezoid rule is the maximum a-posteriori estimate for F under a

Wiener process prior on f .
▸ Node placement by information maximization under this prior.
12 ,
▸ introducing random numbers into a deterministic problem may not be
a good strategy
▸ recipe for a probabilistic numerical method estimating x from y(t):
▸ choose model p(x, y(t))
▸ choose action rule / policy / strategy
[t1 , . . . , ti−1 , y(t1 ), . . . , y(ti−1 )] _ ti
▸ some classic numerical methods can be derived entirely from an
inference perspective, using classic statistical methods
▸ the trapezoid rule is a MAP estimate under a Wiener process prior on
the integrand; regular node placement arises from
information-greediness
▸ the probabilistic interpretation as such does not ensure the posterior
distribution is well calibrated
13 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples
Computations collect information about a latent quantity.

The more valid prior information is available, the cheaper the
computation—if the algorithm uses the prior information!
15 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples

15 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples

15 ,
Model Mismatch can be Detected at Runtime
“a numerical conscience”
1 0.8
0.8
f (x)
f (x)
0.6 0.6
0.4 0.4
0.2 0.2
−2 −2
0 0
0 2 0 2
100 100
F − F̂
F − F̂
10−5 10−5
10−10 10−10
400 400
200 200
0 0
r
r
−200 −200
−400 −400
100 101 102 103 100 101 102 103
# samples # samples
Ef˜[p(f˜(x))]
r = log = (f (x) − µ(x))⊺ K −1 (f (x) − µ(x)) − N
p(f (x))
16 ,
▸ including tangible prior information in the prior can give tailored
numerics that drastically improve computational efficiency
▸ in contrast to physical data sources, in numerical problems, prior
assumptions can be rigorously verified, because the problem is stated
in a formal (programming) language!
▸ incorrect prior assumptions can catastrophically affect performance
▸ model assumptions can be adapted at runtime, using established
statistical techniques, e.g. “type-II maximum likelihood”
17 ,
So why is everyone (in ML) using MCMC?
18 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
√
f (x) ⋅ exp(−x2 /2) ∼ GP [0, k = exp(−1/2(x − x′ )⊺ Λ−1 (x − x′ ))]
▸ encodes positivity
▸ encodes nonstationarity
19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ this scales to higher input dimensionality
more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]
19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples

19 ,
Probabilistic Numerics Need Not Be Expensive
WSABI is time-competititive with MCMC [Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
synthetic (moG) yacht hydrodynamics

0
10
∣Fest − Ftrue ∣/Ftrue
104
10−1
∣Fest − Ftrue ∣
10−2 103
10−3
102 −2
10−2 10−1 100 101 102 10 10−1 100 101 102 103
time [s] time [s]
GP classification, synthetic GP classification, graph
101
101
0
10
10−1
10−1 Monte Carlo
WSABI
10−2 −1
10 100 101 10−1 100 101 102 103
time [s] time [s]
20 ,
end of Integration part
▸ computation is inference
▸ there is a deep formal connection between basic numerical methods
and basic statistical models
▸ ignoring salient prior information causes drastic computational cost
increase. Black boxes may be convenient, but they are not efficient
▸ machine learning can help numerics, and vice versa
21 ,
Ordinary Differential Equations
x′ (t) = f (x(t), t) x(t0 ) = x0 x ∶ R _ RN
ode x
x0
22 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2
23 ,
Runge-Kutta Methods
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h
23 ,
Runge-Kutta Methods
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = 1x0 + ∑i bi yi
▸ RK choose (c, w, b) such that ∥x̂(t0 + h) − x(t0 + h)∥ = O(hp )

23 ,
Gauss-Markov inference on ODEs
a probabilistic model matching Runge-Kutta
x(t)
[Schober, Duvenaud, Hennig, NIPS 2014]
t0 c1 c2 h
t
0 1 y1 = f (µ ∣ x0 (t0 ), t0 )
c1 1 w11 y2 = f (µ ∣ x0 ,y1 (c1 ), t0 + c1 )
c2 1 w21 w22 ys+1 = f (µ ∣ x0 ,y1 ,⋯,ys (cs ), t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = µ ∣ x0 ,yi (t0 + h)
µ ∣ y (t) ∶= cov(x(t), yi ) cov(y, y)−1 y = 1x0 + ∑i wsi yi

24 ,
Gauss-Markov inference on ODEs
a probabilistic model matching Runge-Kutta
x(t)
[Schober, Duvenaud, Hennig, NIPS 2014]
t0 c1 c2 h
t
▸ Linear extrapolation suggests Gaussian process model

▸ polynomial form suggests Wiener state-space model
dz = F dt + L dω
⎡ ⎤ ⎡0 1 . . . ⎤ ⎡0⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ x′ (t) ⎥ 1 ⎢ 0 2 ⎥ ⎢⋮⎥
x(t)
⎢ ⎥ ⎢ ⋱⎥ ⎢ ⎥
d⎢ ⎥= ⎢ ⎥ dt + σ ⎢ ⎥ dω
⎢ ⋮ ⎥ h ⎢⋮ ⋱ ⎥ ⎢0⎥
⎢ k ⎥ ⎢ k⎥ ⎢ ⎥
⎢h /k!x(k) (t)⎥ ⎢ ⎥ ⎢1⎥
⎣ ⎦ ⎣ 0⎦ ⎣ ⎦
▸ inference through filtering 24 ,
Calibrating Uncertainty
within the parametrized class
x(t0 + h)
x(t)
t0 t0 + c 1 t0 + c2 t0 + h p[x(t0 + h)]
t
▸ posterior mean µ ∣ y = kK −1 y invariant under k _ θ2 k

▸ posterior covariance k ∣ y = k − kK −1 k scaled by θ2
▸ connection to local error estimation in existing methods
25 ,
▸ as in integration, classic families of ODE solvers can be interpreted as
MAP inference
▸ for each classic solver, whole family of probabilistic solvers with same
point estimate, different uncertainties
▸ probabilistic solvers need not be expensive. Can even have exact
same cost as classic methods
26 ,
Visualizing Computational Uncertainty
Neural Pathways [M. Schober, N. Kasenburg, A. Feragen, P.H., S. Hauberg, MICCAI 2014]
27 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(
▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same
⎥
′
(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ⎥
Bottom 1 as the
⎣ ⎦
∂x(t)
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
28 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(
▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same
⎥
′
(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ∂x(t) ⎥
Bottom 1 as the
⎣ ⎦
what if M N (m, V ) is uncertain (inferred from data)?
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
▸ 28 ,
Uncertainty Across Composite Computations
interacting information requirements [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]
data variables parameters

inference by estimation by
D xt θ
quadrature optimization
learning / inference / pattern rec. / system id.
prediction by
analysis
action by
a xt+δt
control
action prediction
environment machine
▸ numerical methods able to deal with uncertain (probabilistic) inputs,

and returning uncertain (probabilistic) outputs, allow control of
computational effort
▸ (this is not the same as probabilistic programming!)
29 ,
Summary
Probabilistic Numerics – Part I [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]
The probabilistic view of computation

▸ computation is (active) inference
▸ several classic method can be interpreted precisely as MAP inference
▸ Gaussian Quadrature—Gaussian process regression

▸ Runge-Kutta Methods—Autoregressive Filtering
▸ [more to come on Monday]
▸ correct prior information can reduce runtime
▸ prior assumptions can be tested at runtime
▸ probabilistic uncertainty can be propagated between different
numerical computations, allowing control of cost
“[PN is] a re-emerging area of very active research”

Z. Ghahramani, Nature 521, 452–459 (28 May 2015)
http://probabilistic-numerics.org
30 ,
31 ,
— Backup —
32 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
λ
△
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
λ
△
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
λ
△
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
λ
△
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
MC is optimal for totally unstructured (but integrable) f . But

▸ random node placement is not necessary!
Introducing randomness to a deterministic problem is generally a bad
idea.
▸ no integrand is totally unstructured. Prior knowledge is available!
Computations should use as much available prior information as
possible.
33 ,
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
MC is optimal for totally unstructured (but integrable) f . But

▸ random node placement is not necessary!
Introducing randomness to a deterministic problem is generally a bad
idea.
▸ no integrand is totally unstructured. Prior knowledge is available!
Computations should use as much available prior information as
possible.
33 ,
What is a Random Number?
What is a sequence of random numbers?
▸ 662244111144334455553366666666
▸ 169399375105820974944592307816
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816 41-70th digits of π
▸ 712904263472610590208336044895 von Neumann method, seed 908344
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816 41-70th digits of π
▸ 100011111101111111100101000001 bits from G. Marsaglia’s ‘diehard CD’
▸ 01110000011100100110111101100011
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816 41-70th digits of π
▸ 01110000011100100110111101100011 deterministic sequence,
corrupted by horizontal coin drop from unknown height
http://www.stat.fsu.edu/pub/diehard/
34 ,
▸ 662244111144334455553366666666 dice, doubled

▸ 169399375105820974944592307816 41-70th digits of π
▸ 01110000011100100110111101100011 deterministic sequence,
corrupted by horizontal coin drop from unknown height
▸ for use in Monte Carlo, the important property is freedom from

patterns (because it implies anytime unbiasedness)
▸ in fact, use for MC only really requires the right density, disorder is
just helpful for the argument
▸ for use in cryptography, the important property is unpredictability
▸ randomness is a philosophically dodgy concept
▸ uncertainty is a much clearer idea
http://www.stat.fsu.edu/pub/diehard/
34 ,
How Expensive is a Computation?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
▸ Shortest paths (geodesics) on Riemannian manifold of metric M obey
⎡ Ð ⇀ ⎤⊺
⎢ ∂ M (x(t)) ⎥
x (t) =
′′
−1/2M −1 (x(t)) ⎢ ⎥
⎢ ∂x(t) ⎥ (x (t)⊗x (t)) = f (x(t), x (t), t)
′ ′ ′
⎢ ⎥
⎣ ⎦
▸ Karcher mean µ of data set {xi }i=1,...,N is point minimizing
∑i Distance(µ, xi )
N
35 ,
How Expensive is a Computation?
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
▸ to find Karcher mean, do gradient descent from initial guess µ0

N
µk+1 = µk − α∇µ ∑ Distance(µ, xi )
i
▸ requires solving N initial value problems, over and over.
35 ,
How Much Information is Needed?
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
36 ,
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
36 ,
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
36 ,
37 ,

Part I - Integration and Differential Equations: Probabilistic Numerics

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Part I - Integration and Differential Equations: Probabilistic Numerics

Caricato da

Copyright:

Formati disponibili

Probabilistic Numerics

Emmy Noether Group on Probabilistic Numerics

What happens with

What happens with

As data-sets becomes infinite, ML models increasingly complex,

integration estimates ∫a f (x) dx given {f (xi )}

linear algebra estimates x s.t. Ax = b given {As = y}

▸ computations yield “data” / “observations”

quadrature [Diaconis 1988]

linear algebra [Hennig 2014]

nonlinear optimization [Hennig & Kiefel 2013]

ordinary differential equations [Schober, Duvenaud & Hennig 2014]

f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?

f (x) = exp (− sin2 (3x) − x2 ) F =∫ f (x)dx =?

F = ∫ exp (− sin2 (3x) − x2 ) dx

F = ∫ exp (− sin2 (3x) − x2 ) dx

▸ adding randomness enforces stochastic convergence 9 ,

p(f ) = GP(f ; 0, k) k(x, x′ ) = min(x, x′ ) + b

p (∫ f (x) dx) = N [∫ f (x) dx; ∫ k(x, x′ ) dx dx′ ]

= N (F ; 0, −1/6(b3 − a3 ) + 1/2[b3 − 2a2 b + a3 ] − (b − a)2 c)

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

xt = arg min [varp(F ∣ x1 ,...,xt−1 ) (F )]

p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b

p(f ∣ y, x) = GP(f ; my , ky ) k(x, xi ) = min(x, xi ) + b

▸ The trapezoid rule is the maximum a-posteriori estimate for F under a

Computations collect information about a latent quantity.

Computations collect information about a latent quantity.

Computations collect information about a latent quantity.

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

▸ select evaluation nodes at arg max var[f (x) ⋅ exp(−x2 )]

synthetic (moG) yacht hydrodynamics

x′ (t) = f (x(t), t) x(t0 ) = x0 x ∶ R _ RN

▸ RK choose (c, w, b) such that ∥x̂(t0 + h) − x(t0 + h)∥ = O(hp )

µ ∣ y (t) ∶= cov(x(t), yi ) cov(y, y)−1 y = 1x0 + ∑i wsi yi

▸ Linear extrapolation suggests Gaussian process model

▸ posterior mean µ ∣ y = kK −1 y invariant under k _ θ2 k

data variables parameters

▸ numerical methods able to deal with uncertain (probabilistic) inputs,

The probabilistic view of computation

▸ Gaussian Quadrature—Gaussian process regression

“[PN is] a re-emerging area of very active research”

MC is optimal for totally unstructured (but integrable) f . But

MC is optimal for totally unstructured (but integrable) f . But

▸ 662244111144334455553366666666 dice, doubled

▸ 662244111144334455553366666666 dice, doubled

▸ 662244111144334455553366666666 dice, doubled

▸ 662244111144334455553366666666 dice, doubled

▸ 662244111144334455553366666666 dice, doubled