Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
– Part I –
Integration and Differential Equations
Philipp Hennig
MLSS 2015
18 / 07 / 2015
2 3 7 3 6 ÷ 7 3 6 = X X . X X
1 ,
Information Content of Partial Computations
division with remainder
2 3 7 3 6 ÷ 7 3 6 = 3 X . X X
2 2 0 8 0
1 6 5 6
1 ,
Information Content of Partial Computations
division with remainder
2 3 7 3 6 ÷ 7 3 6 = 3 2 . X X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 ,
Information Content of Partial Computations
division with remainder
2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 X
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8
1 ,
Information Content of Partial Computations
division with remainder
2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 5
2 2 0 8 0
1 6 5 6
1 4 7 2
1 8 4
1 4 7 . 2
3 6 . 8
3 6 . 8
0
1 ,
What about ML computations?
Contemporary computational tasks are more challenging
2 ,
What about ML computations?
Contemporary computational tasks are more challenging
2 ,
Machine learning methods are chains of numerical computations
▸ linear algebra (least-squares)
▸ optimization (training & fitting)
▸ integration (MCMC, marginalization)
▸ solving differential equations (RL, control)
Are these methods just black boxes on your shelf?
3 ,
Numerical methods perform inference
an old observation [Poincaré 1896, Diaconis 1988, O’Hagan 1992]
A numerical method
estimates a function’s latent property
given the result of computations.
4 ,
If computation is inference, it should be possible to build
probabilistic numerical methods
that take in probability measures over inputs,
and return probability measures over outputs,
which quantify uncertainty arising from the uncertain input
and the finite information content of the computation.
i compute o
5 ,
Classic methods identified as maximum a-posteriori
probabilistic numerics is anchored in established theory
6 ,
Integration
F =∫ f (x) dx
b
f ∫ F
7 ,
Integration
a toy problem
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
−3
8 ,
Integration
a toy problem
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
−3
√
≤ exp(−x2 ) ∫ exp(−x ) =
2
π
8 ,
Monte Carlo
(almost) no assumptions, stochastic convergence
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
9 ,
Monte Carlo
(almost) no assumptions, stochastic convergence
1 100
F − F̂
f (x)
0.5 10−5
10−10 0
−2
0
0 2 10 101 102
x
# samples
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Active Collection of Information
choise of evaluation nodes [T. Minka, 2000]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
11 ,
Posterior Mean is a Linear Spline. . .
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
12 ,
We just re-discovered the Trapezoid Rule!
A classic numerical method, derived as a learning machine [P. Diaconis, 1988, T. O’Hagan, 1991]
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
N −1
Ep(f ∣ y) [F ] = ∫ my (x) dx = ∑ (xi+1 − xi ) (f (xi+1 ) + f (xi ))
1
i=1 2
12 ,
1
100
0.5
F − F̂
f (x)
0 10−5
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
12 ,
▸ introducing random numbers into a deterministic problem may not be
a good strategy
▸ recipe for a probabilistic numerical method estimating x from y(t):
▸ choose model p(x, y(t))
▸ choose action rule / policy / strategy
[t1 , . . . , ti−1 , y(t1 ), . . . , y(ti−1 )] _ ti
▸ some classic numerical methods can be derived entirely from an
inference perspective, using classic statistical methods
▸ the trapezoid rule is a MAP estimate under a Wiener process prior on
the integrand; regular node placement arises from
information-greediness
▸ the probabilistic interpretation as such does not ensure the posterior
distribution is well calibrated
13 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
Customized Numerics
machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
(x − x′ )2
k(x, x′ ) = exp (− )
2λ2
Encodes
▸ smooth (infinitely differentiable) f
▸ exponentially decaying Fourier power spectrum
But ignores
▸ non-stationarity
▸ positivity
14 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples
15 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples
15 ,
No Such Thing as a Free Lunch!
incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102 103
x
# samples
15 ,
Model Mismatch can be Detected at Runtime
“a numerical conscience”
1 0.8
0.8
f (x)
f (x)
0.6 0.6
0.4 0.4
0.2 0.2
−2 −2
0 0
0 2 0 2
100 100
F − F̂
F − F̂
10−5 10−5
10−10 10−10
400 400
200 200
0 0
r
r
−200 −200
−400 −400
100 101 102 103 100 101 102 103
# samples # samples
Ef˜[p(f˜(x))]
r = log = (f (x) − µ(x))⊺ K −1 (f (x) − µ(x)) − N
p(f (x))
16 ,
▸ including tangible prior information in the prior can give tailored
numerics that drastically improve computational efficiency
▸ in contrast to physical data sources, in numerical problems, prior
assumptions can be rigorously verified, because the problem is stated
in a formal (programming) language!
▸ incorrect prior assumptions can catastrophically affect performance
▸ model assumptions can be adapted at runtime, using established
statistical techniques, e.g. “type-II maximum likelihood”
17 ,
So why is everyone (in ML) using MCMC?
18 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
√
f (x) ⋅ exp(−x2 /2) ∼ GP [0, k = exp(−1/2(x − x′ )⊺ Λ−1 (x − x′ ))]
▸ encodes positivity
▸ encodes nonstationarity
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Warped Sequential Active Bayesian Integration (WSABI)
[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
1
100
F − F̂
f (x)
0.5
10−5
0
10−10 0
−2 0 2 10 101 102
x
# samples
19 ,
Probabilistic Numerics Need Not Be Expensive
WSABI is time-competititive with MCMC [Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]
104
10−1
∣Fest − Ftrue ∣
10−2 103
10−3
102 −2
10−2 10−1 100 101 102 10 10−1 100 101 102 103
time [s] time [s]
GP classification, synthetic GP classification, graph
101
101
∣Fest − Ftrue ∣
0
∣Fest − Ftrue ∣
10
10−1
10−1 Monte Carlo
WSABI
10−2 −1
10 100 101 10−1 100 101 102 103
time [s] time [s]
20 ,
end of Integration part
▸ computation is inference
▸ there is a deep formal connection between basic numerical methods
and basic statistical models
▸ ignoring salient prior information causes drastic computational cost
increase. Black boxes may be convenient, but they are not efficient
▸ machine learning can help numerics, and vice versa
21 ,
Ordinary Differential Equations
ode x
x0
22 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2
23 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h
23 ,
Runge-Kutta Methods
iterative linear extrapolation
x(t)
t0 t0 + c1 t0 + c2 t0 + h
t
0 1 y1 = f (1x0 , t0 )
c1 1 w11 y2 = f (1x0 + w11 y1 , t0 + c1 )
c2 1 w21 w22 ys+1 = f (1x0 + ∑si wsi yi , t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = 1x0 + ∑i bi yi
t0 c1 c2 h
t
0 1 y1 = f (µ ∣ x0 (t0 ), t0 )
c1 1 w11 y2 = f (µ ∣ x0 ,y1 (c1 ), t0 + c1 )
c2 1 w21 w22 ys+1 = f (µ ∣ x0 ,y1 ,⋯,ys (cs ), t0 + cs )
h 1 b1 b2 b3 x̂(t0 + h) = µ ∣ x0 ,yi (t0 + h)
t0 c1 c2 h
t
x(t0 + h)
x(t)
t0 t0 + c 1 t0 + c2 t0 + h p[x(t0 + h)]
t
25 ,
▸ as in integration, classic families of ODE solvers can be interpreted as
MAP inference
▸ for each classic solver, whole family of probabilistic solvers with same
point estimate, different uncertainties
▸ probabilistic solvers need not be expensive. Can even have exact
same cost as classic methods
26 ,
Visualizing Computational Uncertainty
Neural Pathways [M. Schober, N. Kasenburg, A. Feragen, P.H., S. Hauberg, MICCAI 2014]
27 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(
▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same
⎥
′
(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ⎥
Bottom 1 as the
⎣ ⎦
∂x(t)
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
28 ,
Propagating Uncertainty
geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P.H., Feragen, MICCAI 2015]
GP density
3)
2,
1,
std(
▸ 2.
Fig. Shortest
Top row:paths (geodesics)
The density on Riemannian
of a single GP geodesic manifold
under the of metricmetric.
random M obey The
⎡ Ð ⇀ where ⎤at least one expert annotated the CST.
density heatmap is projected into axis-aligned ⊺slices; the background image is the
⎢ ∂ Mis(x(t)) ⎥
⎢ ⎥ (xslices
expected metric trace; and the outline
x (t)
′′
row:=Standard
− /2M deviation
−1
(x(t)) ⎢of d at the same
⎥
′
(t)⊗x ′
(t))
top=row.
f (x(t), x′ (t), t)
⎢ ∂x(t) ⎥
Bottom 1 as the
⎣ ⎦
what if M N (m, V ) is uncertain (inferred from data)?
Dijkstra in CST GP in CST Dijkstra in ILF GP in ILF
▸ 28 ,
Uncertainty Across Composite Computations
interacting information requirements [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]
prediction by
analysis
action by
a xt+δt
control
action prediction
environment machine
http://probabilistic-numerics.org
30 ,
31 ,
— Backup —
32 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
∣x − x′ ∣
p(f ) = GP(f ; 0; kλOU + c) kλOU (x, x′ ) = exp (− )
λ
△
λ_∞ ⇒ k(x, x′ ) = ∣x − x′ ∣ _ trapezoid rule
λ_0 ⇒ k(x, x ) _ δ(x − x )
′ ′
_ averaging
33 ,
Monte Carlo is not all that Different
MC as a ‘cautious’ limit
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
1
100
0.5
F − F̂
f (x)
10−5
0
−0.5
−1 10−10 0
−2 0 2 10 101 102
x
# samples
▸ 662244111144334455553366666666
▸ 169399375105820974944592307816
▸ 712904263472610590208336044895
▸ 100011111101111111100101000001
▸ 01110000011100100110111101100011
34 ,
What is a Random Number?
What is a sequence of random numbers?
34 ,
What is a Random Number?
What is a sequence of random numbers?
34 ,
What is a Random Number?
What is a sequence of random numbers?
34 ,
What is a Random Number?
What is a sequence of random numbers?
34 ,
What is a Random Number?
What is a sequence of random numbers?
http://www.stat.fsu.edu/pub/diehard/
34 ,
What is a Random Number?
What is a sequence of random numbers?
0.5
−0.5
⎡ Ð ⇀ ⎤⊺
⎢ ∂ M (x(t)) ⎥
x (t) =
′′
−1/2M −1 (x(t)) ⎢ ⎥
⎢ ∂x(t) ⎥ (x (t)⊗x (t)) = f (x(t), x (t), t)
′ ′ ′
⎢ ⎥
⎣ ⎦
▸ Karcher mean µ of data set {xi }i=1,...,N is point minimizing
∑i Distance(µ, xi )
N
35 ,
How Expensive is a Computation?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]
0.5
−0.5
35 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]
0.5
−0.5
36 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]
0.5
−0.5
36 ,
How Much Information is Needed?
Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]
0.5
−0.5
36 ,
37 ,