Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I N M AT H E M AT I C S 199
Applied
Stochastic
Analysis
Weinan E
Tiejun Li
Eric Vanden-Eijnden
Applied
Stochastic
Analysis
GRADUATE STUDIES
I N M AT H E M AT I C S 199
Applied
Stochastic
Analysis
Weinan E
Tiejun Li
Eric Vanden-Eijnden
EDITORIAL COMMITTEE
Daniel S. Freed (Chair)
Bjorn Poonen
Gigliola Staffilani
Jeff A. Viaclovsky
2010 Mathematics Subject Classification. Primary 60-01, 60J22, 60H10, 60H35, 62P10,
62P35, 65C05, 65C10, 82-01.
Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy select pages for
use in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for permission
to reuse portions of AMS publication content are handled by the Copyright Clearance Center. For
more information, please visit www.ams.org/publications/pubpermissions.
Send requests for translation rights and licensed reprints to reprint-permission@ams.org.
c 2019 by the authors. All rights reserved.
Printed in the United States of America.
∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at https://www.ams.org/
10 9 8 7 6 5 4 3 2 1 24 23 22 21 20 19
To our families
Preface xvii
Notation xix
Part 1. Fundamentals
vii
viii Contents
xiii
xiv Introduction to the Series
machine learning has also been added to the list. In addition to the summer
school, these courses have also been taught from time to time during regular
terms at New York University, Peking University, and Princeton University.
The lecture notes from these courses are the main origin of this textbook
series. The early participants, including some of the younger people who
were students or teaching assistants early on, have also become contributors
to this book series.
After many years, this series of courses has finally taken shape. Obvi-
ously, this has not come easy and is the joint effort of many people. I would
like to express my sincere gratitude to Tiejun Li, Pingbing Ming, Shi Jin,
Eric Vanden-Eijnden, Pingwen Zhang, and other collaborators involved for
their commitment and dedication to this effort. I am also very grateful to
Mrs. Yanyun Liu, Baomei Li, Tian Tian, Yuan Tian for their tireless efforts
to help run the summer school. Above all, I would like to pay tribute to
someone who dedicated his life to this cause, David Cai. David had a pas-
sion for applied mathematics and its applications to science. He believed
strongly that one has to find ways to inspire talented young people to go into
applied mathematics and to teach them applied mathematics in the right
way. For many years, he taught a course on introducing physics to applied
mathematicians in the summer school. To create a platform for practicing
the philosophy embodied in this project, he cofounded the Institute of Nat-
ural Sciences at Shanghai Jiaotong University, which has become one of the
most active centers for applied mathematics in China. His passing away last
year was an immense loss, not just for all of us involved in this project, but
also for the applied mathematics community as a whole.
This book is the first in this series, covering probability theory and sto-
chastic processes in a style that we believe is most suited for applied math-
ematicians. Other subjects covered in this series will include numerical
algorithms, the calculus of variations and differential equations, a mathe-
matical introduction to machine learning, and a mathematical introduction
to physics and physical modeling. The stochastic analysis and differential
equations courses summarize the most relevant aspects of pure mathemat-
ics, the algorithms course presents the most important technical tool of ap-
plied mathematics, and the learning and physical modeling courses provide
a bridge to the real world and to other scientific disciplines. The selection
of topics represents a possibly biased view about the true fundamentals of
applied mathematics. In particular, we emphasize three themes through-
out this textbook series: learning, modeling, and algorithms. Learning is
about data and intelligent decision making. Modeling is concerned with
physics-based models. The study of algorithms provides the practical tool
for building and interrogating the models, whether machine learning–based
Introduction to the Series xv
xvii
xviii Preface
Weinan E
Tiejun Li
Eric Vanden-Eijnden
December 2018
Notation
xix
xx Notation
(5) Functions:
(a) ·: The ceil function. x = m+1 if x ∈ [m, m+1) for m ∈ Z.
(b) · : The floor function. x = m if x ∈ [m, m + 1) for m ∈ Z.
1
(c) |x|: 2 -modulus of a vector x ∈ Rd : |x| = ( di=1 x2i ) 2 .
(d)
f
: Norm of function f in some function space.
(e) a ∨ b: Maximum of a and b: a ∨ b = max(a, b).
(f) a ∧ b: Minimum of a and b: a ∧ b = min(a, b).
(g)
f : The average of f with respect to a measure μ:
f =
f (x)μ(dx).
(h) (x, y): Inner product for x, y ∈ Rd : (x, y) = xT y.
(i) (f, g): Inner product for L2 -functions f, g: (f, g) = f (x)g(x)dx.
(j)
f, g: Dual product for f ∈ B ∗ and g ∈ B.
(k) A : B: Twice contraction for second-order tensors, i.e., A :
B = ij aij bji .
(l) |S|: The cardinality of set S.
(m) |Δ|: The subdivision size when Δ is a subdivision of a domain.
(n) χA (x): Indicator function, i.e., χA (x) = 1 if x ∈ A and 0
otherwise.
(o) δ(x − a): Dirac’s delta-function at x = a.
(p) Range(P), Null(P): The range and null space of operator P,
i.e., Range(P) = {y|y = Px}, Null(P) = {x|Px = 0}.
(q) ⊥ C : Perpendicular subspace, i.e., ⊥ C = {x|x ∈ B ∗ ,
x, y =
0, ∀y ∈ C } where C ⊂ B.
Notation xxi
(6) Symbols:
(a) cε dε : Logrithmic equivalence, i.e., limε→0 log cε / log dε = 1.
(b) cε ∼ dε or cn ∼ dn : Equivalence, i.e., limε→0 cε /dε = 1 or
limn→∞ cn /dn = 1.
Dx: Formal infinitesimal element in path space, Dx =
(c)
0≤t≤T dxt .
Part 1
Fundamentals
Chapter 1
Random Variables
3
4 1. Random Variables
Prob
−
> → 0,
2n 2
as n → ∞. This can be seen as follows. Noting that the distribution
Prob{S2n = k} is unimodal and symmetric around the state k = n, we have
S2n 1
1 (2n)!
Prob
−
> ≤ 2 · 2n
2n 2 2 k!(2n − k)!
k>n+2n
1 (2n)!
≤ 2(n − 2n ) · 2n
2 n + 2n !n − 2n !
√ √
2 1 − 2 n
∼ · →0
π(1 + 2 ) (1 − 2 ) n(1−2) (1 + 2 )n(1+2)
for sufficiently small and n 1, where · and · are the ceil and floor
functions, respectively, defined by x = m+1 and x = m if x ∈ [m, m+1)
for m ∈ Z. This is the weak law of large numbers for this particular example.
In the example of a fair coin, the number of outcomes in an experiment
is finite. In contrast, the second class of examples involves a continuous set
of possible outcomes. Consider the orientation of a unit vector τ . Denote by
S2 the unit sphere in R3 . Define ρ(n), n ∈ S2 , as the orientation distribution
density; i.e., for A ⊂ S2 ,
Prob(τ ∈ A) = ρ(n)dS,
A
and
1
P(A) = n |A|,
2
where |A| is the cardinality of the set A.
6 1. Random Variables
Proposition
∞ 1.6 (Bayes’s theorem). If A1 , A2 , . . . are disjoint sets such
that j=1 Aj = Ω, then we have
P(Aj )P(B|Aj )
(1.6) P(Aj |B) = ∞ for any j ∈ N.
n=1 P(An )P(B|An )
The Poisson distribution P(λ) may be viewed as the limit of the binomial
distribution in the sense that
λk −λ
Cnk pk q n−k −→ e as n → ∞, p → 0, np = λ.
k!
The proof is simple and is left as an exercise.
1.6. Independence
We now come to one of the most distinctive notions in probability theory,
the notion of independence. Let us start by defining the independence of
events. Two events A, B ∈ F are independent if
P(A ∩ B) = P(A)P(B).
Definition 1.21. Two random variables X and Y are said to be indepen-
dent if for any two Borel sets A and B, X −1 (A) and Y −1 (B) are independent;
i.e.,
(1.30) P X −1 (A) ∩ Y −1 (B) = P X −1 (A) P Y −1 (B) .
1.6. Independence 13
Consequently, we have
(1.32) μ = μ1 μ2 ;
i.e., the joint distribution of two independent random variables is the product
distribution. If both μ1 and μ2 are absolutely continuous, with densities p1
and p2 , respectively, then μ is also absolutely continuous, with density given
by
Note that pairwise independence does not imply independence (see Exercise
1.9).
To see that this definition reduces to the previous one for the case of discrete
random variables, let Ωj = {ω : Y (ω) = j} and
n
Ω= Ωj .
j=1
The σ-algebra G is simply the set of all possible unions of the Ωj ’s. The
measurability condition of E(X|Y ) with respect to G means E(X|Y ) is con-
stant on each Ωj , which is exactly E(X|Y = j) as we will see. By definition,
we have
(1.38) E(X|Y )P(dω) = X(ω)P(dω),
Ωj Ωj
which implies
1
(1.39) E(X|Y ) = X(ω)P(dω).
P(Ωj ) Ωj
Proof. We have
E(X − g(Y ))2 = E(X − E(X|Y ))2 + E(E(X|Y ) − g(Y ))2
+ 2E (X − E(X|Y ))(E(X|Y ) − g(Y ))
and
E (X − E(X|Y ))(E(X|Y ) − g(Y ))
= E E (X − E(X|Y ))(E(X|Y ) − g(Y ))|Y
= E (E(X|Y ) − E(X|Y ))(E(X|Y ) − g(Y )) = 0,
which implies (1.40).
Remark 1.31. Note that the power p does not need to be greater than 1
since it still gives a metric for the random variables. But only when p ≥ 1
do we get a norm. The statements in this section hold when 0 < p < ∞.
Proof. The proof of the first statements is straightforward. For the second
statement, we have
|f (ξ1 ) − f (ξ2 )| = |E(eiξ1 X − eiξ2 X )| = |E(eiξ1 X (1 − ei(ξ2 −ξ1 )X ))|
≤ E|1 − ei(ξ2 −ξ1 )X |.
Since the integrand |1 − ei(ξ2 −ξ1 )X | depends only on the difference between
ξ1 and ξ2 and since it tends to 0 almost surely as the difference goes to
0, uniform continuity follows immediately from the dominated convergence
theorem.
Example 1.34. Here are the characteristic functions of some typical dis-
tributions:
(1) Bernoulli distribution.
f (ξ) = q + peiξ .
(2) Binomial distribution B(n, p).
f (ξ) = (q + peiξ )n .
(3) Poisson distribution P(λ).
iξ −1)
f (ξ) = eλ(e .
(4) Exponential distribution E(λ).
f (ξ) = (1 − λ−1 iξ)−1 .
(5) Normal distribution N (μ, σ 2 ).
σ2ξ2
(1.47) f (ξ) = exp iμξ − .
2
(6) Multivariate normal distribution N (μ, Σ).
1 T
(1.48) f (ξ) = exp iμ ξ − ξ Σξ .
T
2
Proof. We only prove the necessity part. The other part is less trivial and
readers may consult [Chu01]. Assume that f is a characteristic function.
Then
n
n
2
iξi x
P (X = xk ) = G (x)
.
k! x=0
k
(1.52) ck = aj bk−j .
j=0
with {ck } = {ak } ∗ {bk } satisfy C(x) = A(x)B(x). The following result is
more or less obvious.
P (X = j) = aj , P (Y = k) = bk ,
This gives
∞
tn
(1.55) M (t) = μn ,
n!
n=0
which explains why M (t) is called the moment generating function.
Theorem 1.40. Denote MX (t), MY (t), and MX+Y (t) the moment gener-
ating functions of the random variables X, Y , and X + Y , respectively. If
X, Y are independent, then
(1.56) MX+Y (t) = MX (t)MY (t).
Lemma 1.41.
(1) If ∞ n=1 P(An ) < ∞, then P ({An i.o.}) = 0.
∞
(2) If n=1 P(An ) = ∞ and the An ’s are mutually independent, then
P ({An i.o.}) = 1.
∞
for any n. The last term goes to 0 as n → ∞ since k=1 P(Ak ) < ∞ by
assumption.
(2) Using independence, one has
∞
∞
∞ ∞
P Ak =1−P Ack =1− P(Ack ) = 1 − (1 − P(Ak )).
k=n k=n k=n k=n
Xn
lim =0 a.s.
n→∞ n
Then
∞
∞
∞
P(An ) = P(|Xn | > n ) = P(|X1 | > n )
n=1 n=1 n=1
∞ ∞
= P (k < |X1 | ≤ (k + 1) )
n=1 k=n
∞
= kP (k < |X1 | ≤ (k + 1) )
k=1
∞
= k P(dω)
k=1 k<|X1 |≤(k+1)
∞
1
≤ |X1 |P(dω)
k<|X1 |≤(k+1)
k=1
1 1
= |X1 |P(dω) ≤ E|X1 | < ∞.
<|X1 |
Therefore if we define
B = {ω ∈ Ω : ω ∈ An i.o.},
then P(B ) = 0. Let B = ∞n=1 B 1 . Then P(B) = 0, and
n
Xn (ω)
lim =0 if ω ∈ B.
n→∞ n
The proof is complete.
With the Borel-Cantelli lemma, we can give the proof of (ii) in Theorem
1.32.
k0 ∞
1
≤ P(|Xnk | ≥ ) + < ∞.
2k
k=1 k=k0 +1
24 1. Random Variables
P(|Xnk | ≥ , i.o.) = 0.
Thus the subsequence Xnk converges to 0 almost surely by the same argu-
ment as the example above.
Exercises
In the following, i.i.d. stands for independent, identically distributed.
1.1. Let X = nj=1 Xj , where the Xj are i.i.d. random variables with
Bernoulli distribution with parameter p. Prove that X ∼ B(n, p).
1.2. Let Sn be the number of heads in the outcome of n independent
tosses for a fair coin. Its distribution is given by
1 n
pj = P(Sn = j) = n .
2 j
Show that the mean and the variance of this random variable are
n n
ESn = , Var(Sn ) = .
2 4
This implies that Var(Sn )/(ESn )2 → 0 as n → ∞.
1.3. Prove that
λk −λ
Cnk pk q n−k −→ e for any fixed k
k!
as n → ∞, p → 0, and np = λ. This is the classical Poisson
approximation to the binomial distribution.
1.4. Numerically investigate the limit processes
1.14. (Variance identity) Suppose the random variable X has the par-
tition X = (X (1) , X (2) ). Prove the variance identity for any inte-
grable function f :
n
H(X) = − pi log pi .
i=1
Prove that H(X) ≥ 0 and that the equality holds only when X is
reduced to a deterministic variable; i.e., pi = δik0 , for some fixed
integer k0 in {1, 2, . . . , n}. Here we take the convention 0 log 0 = 0.
1.17. Let X = (X1 , X2 , . . . , Xn ) be i.i.d. continuous random variables
with common distribution function F and density ρ(x) = F
(x).
(X(1) , X(2) , . . . , X(n) ) are called the order statistics of X if X(k) is
the kth smallest of these random variables. Prove that the density
of the order statistics is given by
n
p(x1 , x2 , . . . , xn ) = n! ρ(xk ), x1 < x2 < · · · < xn .
k=1
a(X + Y − b)
where the notation means summing of the products over all
possible partitions of X1 , . . . , Xk into pairs; e.g., if (X, Y, Z) is
jointly Gaussian, we have
(1.59) E(X 2 Y 2 Z 2 ) =(EX 2 )(EY 2 )(EZ 2 ) + 2(EY Z)2 EX 2 + 2(EXY )2 EZ 2
+ 2(EXZ)2 EY 2 + 8(EXY )(EY Z)(EXZ).
Each term in (1.59) can be schematically mapped to some graph,
as shown below:
Notes
The axiomatic approach to probability theory starts with Kolmogorov’s mas-
terpiece [Kol56]. There are already lots of excellent textbooks on the in-
troduction to elementary probability theory [Chu01, Cin11, Dur10, KS07,
Shi96]. To know more about the measure theory approach to probability
theory, please consult [Bil79]. More on the practical side of probability
theory can be found in the classic book by Feller [Fel68] or the books by
some applied mathematicians and scientists [CH13, CT06, Gar09, VK04].
Chapter 2
Limit Theorems
Sn
→η in probability.
n
29
30 2. Limit Theorems
> ≤ 2 E
n n
for any > 0. Using independence, we have
n
E|Sn |2 = E(Xi Xj ) = nE|X1 |2 .
i,j=1
Hence
Sn
1
P
> ≤ 2 E|X1 |2 → 0,
n n
as n → ∞.
Theorem 2.2 (Strong law of large numbers (SLLN)). Let {Xj }∞ j=1 be a
sequence of i.i.d. random variables such that E|Xj | < ∞. Then
Sn
→η a.s.
n
Proof. We will only give a proof of the SLLN under the stronger assumption
that E|Xj |4 < ∞. The proof under the stated assumption can be found in
[Chu01].
Without loss of generality, we can assume η = 0. Using the Chebyshev
inequality, we have
Sn
1
Sn
P
> ≤ 4E
.
n n
Using independence, we get
n
E|Sn | =
4
E(Xi Xj Xk Xl ) = nE|Xj |4 + 3n(n − 1)(E|Xj |2 )2 .
i,j,k,l=1
> ≤ 2 4 .
n n
Now we are in a situation to use the Borel-Cantelli lemma, since the series
1/n2 is summable. This gives us
Sn
P
> , i.o. = 0.
n
Hence Sn /n → 0 a.s.
The weak and strong laws of large numbers may not hold if the moment
condition is not satisfied.
2.2. Central Limit Theorem 31
1
f √ = f (0) + √ f
(0) + √ f (0) + o
n n 2 n n
ξ2 1
=1− +o .
2n n
Hence
n
√ n ξ2 1
→ e− 2 ξ
1 2
(2.2) gn (ξ) = f ξ/ n = 1− +o as n → ∞
2n n
for every ξ ∈ R. Using Theorem 1.35, we obtain the stated result.
32 2. Limit Theorems
The CLT gives an estimate for the rate of convergence in the law of large
numbers. Since by the CLT we have
Sn σ
− η ≈ √ N (0, 1),
n n
the rate of convergence of Sn /n to η is O(n− 2 ). This is the reason why most
1
sample size.
Remark 2.6 (Stable laws). Theorem 2.4 requires that the variance of Xj be
finite. For random variables with unbounded variances, one can show that
the following generalized central limit theorem holds: If there exist {an } and
{bn } such that
P(an (Sn − bn ) ≤ x) → G(x) as n → ∞,
then the distribution G(x) must be a stable law. A probability distribution is
said to be a stable law if any linear combination of two independent random
variables with that distribution has again the same distribution, up to a
shift and rescaling. One simple instance of the stable law is the Cauchy
distribution considered in Example 2.3, for which we can take an = 1/n and
bn = 0 and the limit G(x) is exactly the distribution function of the Cauchy-
Lorentz distribution (the limit process is in fact an identity in this case).
For more details about the generalized central limit theorem and stable law
see [Bre92, Dur10].
Here the symbol means logarithmic equivalence; i.e., limε→0 log cε / log dε
= 1 if cε dε .
This is the large deviation theorem (LDT) for i.i.d. random variables. It
states roughly that the probability of large deviation events converges to 0
exponentially fast. I(x) is called the rate function.
To study the properties of the rate function, we need some results on
the Legendre-Fenchel transform.
Lemma 2.8. Assume that f : Rd → R̄ = R ∪ {∞} is a lower semicontinu-
ous convex function. Let F be its conjugate function (the Legendre-Fenchel
transform):
F (y) = sup{(x, y) − f (x)}.
x
Then:
(i) F is also a lower semicontinuous convex function.
(ii) The Fenchel inequality holds:
(x, y) ≤ f (x) + F (y).
(iii) The conjugacy relation holds:
f (x) = sup{(x, y) − F (y)}.
y
34 2. Limit Theorems
Proof. (i) The convexity of Λ(λ) follows from Hölder’s inequality. For any
0 ≤ θ ≤ 1,
Λ(θλ1 + (1 − θ)λ2 ) = log E exp(θλ1 Xj ) exp((1 − θ)λ2 Xj )
θ (1−θ)
≤ log (E exp λ1 Xj ) E exp(λ2 Xj )
= θΛ(λ1 ) + (1 − θ)Λ(λ2 ).
Thus Λ(λ) is a convex function. The rest is a direct application of Lemma
2.8.
2.3. Cramér’s Theorem for Large Deviations 35
= exp(−I(x)) < ∞.
Thus μ((x, ∞)) = 0 and
∞
exp(−I(x)) = lim exp(λn (y − x))μ(dy) = μ({x}).
n→∞ x
Since
μn (G) ≥ μn ({x}) ≥ (μ({x}))n = exp(−nI(x)),
we get
1
log μn (G) ≥ −I(x).
n
A similar argument can be applied to the case when x < 0.
Case 2. Assume that the supremum in the definition of I(x) is attained
at λ0 :
I(x) = λ0 x − Λ(λ0 ).
Then x = Λ
(λ0 ) = M
(λ0 )/M (λ0 ). Define a new probability measure:
1
μ̃(dy) = exp(λ0 y)μ(dy).
M (λ0 )
Its expectation is given by
1 M
(λ0 )
y μ̃(dy) = y exp(λ0 y)μ(dy) = = x.
R M (λ0 ) R M (λ0 )
2.3. Cramér’s Theorem for Large Deviations 37
Thus
1
lim log μn (G) ≥ −λ0 (x + δ) + Λ(λ0 ) = −I(x) − λ0 δ for all 0 < δ 1.
n→∞ n
A similar argument can be applied to the case when x < 0.
Example 2.10 (Cramér’s theorem applied to the Bernoulli distribution with
parameter p (0 < p < 1)). We have Λ(λ) = log(peλ + q) where q = 1 − p.
The rate function
x log xp + (1 − x) log 1−x
q , x ∈ [0, 1],
(2.9) I(x) =
∞, otherwise.
It is obvious that I(x) ≥ 0, and I(x) achieves its global minimum 0 at
x∗ = p. The function I is a special case of the so-called relative entropy,
or Kullback-Leibler distance, defined between two (discrete) distributions μ
and ν:
r
μi
(2.10) D(μ||ν) = μi log ,
νi
i=1
Here Sn (E) is the Boltzmann entropy for the system with n spins. Denote
by S(E) the limit of Sn (E)/n as n → ∞. We get
(2.12) kB I(E) = kB log 2 − S(E).
So we obtain that the rate function is the negative entropy (with factor 1/kB )
up to an additive constant in this simplest setup. In fact this is a general
statement.
In the canonical ensemble in statistical mechanics (the number of spins
n and the temperature T are fixed in this setup), let us investigate the
physical meaning of Λ. The logarithmic moment generating function of
Hn (ω) = nhn (ω) with normalization 1/n is
1
Λ(λ) = lim log EeλHn ,
n→∞ n
where we take Hn instead of nhn since it admits more general interpretation.
Take λ = −β = −(kB T )−1 . We have
1
Λ(−β) = lim log e−βHn (ω) − log 2.
n→∞ n
ω
Define the partition function
Zn (β) = e−βHn (ω)
ω
and Helmholtz free energy
Fn (β) = −β −1 log Zn (β).
We have
1
Λ(−β) = −β lim Fn (β) − log 2 = −βF (β) − log 2.
n
n→∞
Thus the free energy F (β) is the negative logarithmic moment generating
function up to a constant.
According to the large deviation theory we have
−1
−βF (β) − log 2 = sup{−βE − log 2 + kB S(E)};
E
i.e.,
F (β) = inf {E − T S(E)}.
E
The infimum is achieved at the critical point E ∗ such that
∂S(E)
1
= ,
∂E E=E ∗ T
which is exactly a thermodynamic relation between S and T . Here E ∗ is
called the internal energy U . Readers may referd to [Cha87,Ell85] for more
details.
40 2. Limit Theorems
as n → ∞, or equivalently
P(n(Mn − 1) ≤ x) → e−|x| for x ≤ 0.
This implies that 1 − Mn = O (1/n) as n → ∞.
Exercises
2.1. Let Xj be i.i.d. U [0, 1] random variables. Prove that
'
n X12 + · · · + Xn2
lim −1 −1 , lim n
X1 X2 · · · Xn , lim
1 + · · · + Xn
n→∞ X n→∞ n→∞ n
Prove that under the condition EXj = μ < ∞, we have the service
rate of the queue
Nt 1
lim = a.s.
t→∞ t μ
2.3. The fact that the “central limit” of i.i.d. random variables must
be Gaussian can be understood from the following viewpoint. Let
X1 , X2 , . . . be i.i.d. random variables with mean 0. Let
X1 + · · · + Xn d X1 + · · · + X2n d
(2.15) Zn = √ →X and Z2n = √ → X.
n 2n
Let the characteristic function
√ of X be f (ξ).
2
(a) Prove that f (ξ) = f (ξ/ 2).
(b) Prove that f (ξ) is the characteristic function of a Gaussian
random variable if f ∈ C 2 (R).
√
(c) Investigate the situation when the scaling 1/ n in (2.15) is
replaced with 1/n. Prove that X corresponds to the Cauchy-
Lorentz distribution under the symmetry condition f (ξ) =
f (−ξ), or f (ξ) ≡ 1.
√
(d) If the scaling 1/ n is replaced by 1/nα , what can we infer
about the characteristic function f (ξ) if we assume f (ξ) =
f (−ξ)? What is the correct range of α?
2.4. Prove the assertion in Example 2.3.
2.5. (Single-sided Laplace lemma) Suppose that h(x) attains the only
maximum at x = 0, h
∈ C 1 (0, +∞), h
(0) < 0, h(x) < h(0) for
∞
x > 0. Here h(x) → −∞ as x → ∞, and 0 eh(x) dx converges.
Exercises 43
TN = D ∈ S N :
|Bj | − p(j)
< δ( ), j = 1, . . . , r ,
N
where |Bj | is the number of elements in set Bj . Prove that for
any fixed δ > 0
lim P(TN ) = 1.
N →∞
(c) For any > 0, there exists δ( ) > 0 such that for sufficiently
large N
eN (H−) ≤ |TN | ≤ eN (H+)
and for any D ∈ TN , we have
e−N (H+) ≤ P(D) ≤ e−N (H−) .
This property is called the asymptotic equipartition principle
for the elements in a typical set.
44 2. Limit Theorems
2.8. Let {Xj } be i.i.d. random variables. Suppose that the variance of
Xj is finite. Then by the CLT the distribution
G(x) in Remark 2.6
is Gaussian for bn = mean(X1 ), an = Var(X1 ). Show that one
also recovers WLLN for a different choice of {an } and {bn }. What
is the stable distribution G(x) in this case?
2.9. For the statistics of extrema, verify that
an = 2 log n, bn = 2 log n(log log n + log 4π)
for the normal distribution N (0, 1).
Notes
Rigorous mathematical work on the law of large numbers for general random
variables starts from that of Chebyshev in the 1860s. The strong law of
large numbers stated in this chapter is due to Kolmogorov [Shi92]. The
fact that the Gaussian distribution and stable laws are the scaling limit of
empirical averages of i.i.d. random variables can also be understood using
renormalization group theory [ZJ07]. A brief introduction to the LDT can
be found in [Var84].
Chapter 3
Markov Chains
45
46 3. Markov Chains
Given Xn = i, we have
1
P(Xn+1 = i ± 1|Xn = i) = P(ξn+1 = ±1) =
2
and P(Xn+1 = anything else|Xn = i) = 0. We see that, knowing Xn , the
distribution of Xn+1 is completely determined. In other words,
(3.1) P(Xn+1 = in+1 |{Xm = im }nm=0 ) = P(Xn+1 = in+1 |Xn = in );
i.e., the probability of Xn+1 conditional on the whole past history of the se-
quence, {Xm }nm=0 , is equal to the probability of Xn+1 conditional on the lat-
est value alone, Xn . The sequence (or process) {Xn }n∈N is called a Markov
process. In contrast, instead we consider
n
1
Yn = (ξj + 1)ξj+1 .
2
j=1
{Yn }n∈N is also a kind of random walk on Z, but with a bias: the walk stops
if the last move was to the left, and it remains there until some subsequent
ξj = +1. It is easy to see that
P(Yn+1 = in+1 |{Ym = im }nm=0 ) = P(Yn+1 = in+1 |Yn = in )
and the sequence {Yn }n∈N is not a Markov process.
Example 3.2 (Ehrenfest’s diffusion model). Consider a container separated
by a permeable membrane in the middle and filled with a total of K particles.
At each time n = 1, 2 . . ., pick one particle in random among the K particles
and place it into the other part of the container. Let Xn be the number of
particles in the left part of the container. Then we have
P(Xn+1 = in+1 |{Xm = im }nm=0 ) = P(Xn+1 = in+1 |Xn = in ),
for ik ∈ {0, 1, . . . , K}, and this is also a Markov process.
3.1. Discrete Time Finite Markov Chains 47
1 1 1 1
2 2 2 2 2 2
1 1 3 1 1 1 1 3
2 2
Figure 3.1. Graphical representation of two Markov chains in (3.4) and (3.5).
The following questions are of special interest for a given Markov chain.
(i) Existence: Is there an invariant distribution?
This is equivalent to asking whether there exists a nonnegative
left eigenvector of P with eigenvalue equal to 1. Note that 1 is
always an eigenvalue of P since it always has the right eigenvector
1 = (1, . . . , 1)T by the second property in (3.2).
(ii) Uniqueness: When is the invariant distribution unique?
The answer of these questions is closely related to the spectral structure
of the matrix P . We begin with the following preliminary result.
Lemma 3.5. The spectral radius of P is equal to 1:
ρ(P ) = max |λ| = 1,
λ
|λ| |ui | =
uj pji
≤ |uj |pji = |uj |.
i∈S i∈S j∈S i,j∈S j∈S
Hence |λ| ≤ 1.
Definition 3.6 (Irreducibility). If there exists a permutation matrix R such
that
T A1 B
(3.6) R PR = ,
0 A2
then P is called reducible. Otherwise P is called irreducible.
50 3. Markov Chains
Theorem 3.11. Assume that the Markov chain is primitive. Then for any
initial distribution μ0
μn = μ0 P n → π exponentially fast as n → ∞,
where π is the unique invariant distribution.
Proof. Given two distributions, μ0 and μ̃0 , we define the total variation
distance by
1
d(μ0 , μ̃0 ) = |μ0,i − μ̃0,i |.
2
i∈S
Since
0= (μ0,i − μ̃0,i ) = (μ0,i − μ̃0,i )+ − (μ0,i − μ̃0,i )− ,
i∈S i∈S i∈S
i∈S j∈S
+
≤ μ0,j − μ̃0,j (P s )ji ,
j∈S i∈B+
where B+ is the subset of indices where j∈S (μ0,j − μ̃0,j ) (P s )ji > 0. We
note that B+ cannot contain all the elements of S; otherwise one must have
(μ0 P s )i > (μ̃0 P s )i for all i and
(μ0 P s )i > (μ̃0 P s )i ,
i∈S i∈S
which is impossible since both sides sum to 1. Therefore at least one element
is missing in B+ . By assumption, there existsan s > 0 and α ∈ (0, 1) such
that (P s )ij ≥ α for all pairs (i, j). Hence i∈B+ (P )ji ≤ (1 − α) < 1.
s
Therefore
d(μs , μ̃s ) ≤ d(μ0 , μ̃0 )(1 − α);
i.e., the Markov chain is contractive after every s steps. Similarly for any
m≥0
d(μn , μn+m ) ≤ d(μn−sk , μn+m−sk )(1 − α)k ≤ (1 − α)k ,
where k is the largest integer such that n − sk ≥ 0. If n is sufficiently large,
the right-hand side can be made arbitrarily small. Therefore the sequence
{μn }∞
n=0 is a Cauchy sequence. Hence it has to converge to a limit π, which
satisfies
π = lim μ0 P n+1 = lim (μ0 P n )P = πP .
n→∞ n→∞
The probability distribution satisfying such a property is also unique, for if
there were two such distributions, π (1) and π (2) , then
d(π (1) , π (2) ) = d(π (1) P s , π (2) P s ) < d(π (1) , π (2) ).
This implies d(π (1) , π (2) ) = 0, i.e π (1) = π (2) .
Poisson processes are often used to model the number of telephone calls
received at a phone booth, or the number of customers waiting in a queue,
etc.
Theorem 3.14. For any fixed time t, the distribution of Nt is Poisson with
parameter λt.
3.5. Q-processes
Now let us turn to a type of continuous time Markov chains called Q-
processes. We assume that the trajectory {Xt }t∈R+ is right-continuous and
the state space S is finite, i.e., S = {1, 2, . . . , I}. Let the transition proba-
bility
pij (t) = P(Xt+s = j|Xs = i).
Here we have assumed that the Markov chain is stationary; i.e., the right-
hand side is independent of s. By definition we have
pij (t) ≥ 0, pij (t) = 1.
j∈S
3.5. Q-processes 55
Q is called the generator of the Markov chain. For simplicity, we also define
qi = −qii ≥ 0.
Since
P (t + h) − P (t) P (h) − I
(3.17) = P (t)
h h
as s → 0+, we get the well-known forward equation for P (t)
dP (t)
(3.18) = P (t)Q.
dt
Since P (h) and P (t) commute, we also get from (3.17) the backward equation
dP (t)
(3.19) = QP (t).
dt
56 3. Markov Chains
Both the solutions of the forward and the backward equations are given by
P (t) = eQt P (0) = eQt
since P (0) = I.
Next we discuss how the distribution of the Markov chain evolves in
time. Let μ(t) be the distribution of Xt . Then
μj (t + dt) = μi (t)pij (dt) + μj (t)pjj (dt)
i=j
= μi (t)qij dt + μj (t)(1 + qjj dt) + o(dt)
i=j
Q̃ is called the jump matrix, and the corresponding Markov chain is called
the embedded chain or jump chain of the original Q-process.
Define the jump times of {Xt }t≥0 by
J0 = 0, Jn+1 = inf{t : t ≥ Jn , Xt = XJn }, n ∈ N,
where we take the convention inf ∅ = ∞. Define the holding times by
Jn − Jn−1 , if Jn−1 < ∞,
(3.25) Hn =
∞, otherwise,
for n = 1, 2, . . .. We let X∞ = XJn if Jn+1 = ∞. The jump chain induced
by Xt is defined to be
Yn = XJn , n ∈ N.
The following result provides a connection between the jump chain and
the original Q-process.
58 3. Markov Chains
Figure 3.2. Schematics of the jump time and holding time of a Q-process.
Theorem 3.15. The jump chain {Yn } is a Markov chain with Q̃ as the
transition probability matrix, and the holding times H1 , H2 , . . . are indepen-
dent exponential random variables with parameters qY0 , qY1 , . . ., respectively.
The proof relies on the strong Markov property of the Q-process. See
Section E of the appendix and [Nor97].
A state j is said to be accessible from i if there exists t > 0 such that
pij (t) > 0. Similarly to the discrete time case, one can define communication
and irreducibility.
Theorem 3.16. The Q-process X is irreducible if and only if the embedded
chain Y is irreducible.
Proof. From irreducibility and Theorem 3.16, we know that pij (t) > 0 for
any i = j and t > 0. It is straightforward to see that
pii (t) ≥ Pi (J1 > t) = e−qi t > 0
for any i ∈ S and t > 0, where Pi is the probability distribution conditioned
on X0 = i. From this, we conclude that there exist t0 > 0 and α ∈ (0, 1)
such that pij (t0 ) ≥ α for any pairs (i, j). By following a similar induction
procedure as in the proof of Theorem 3.11, we obtain, for any two initial
distributions μ(0), μ̃(0),
d(μ(t), μ̃(t)) ≤ d (μ(t − kt0 ), μ̃(t − kt0 )) (1 − α)k , t ≥ kt0 , k ∈ N.
This implies the existence and uniqueness of the invariant distribution π for
the Q process. Furthermore, we have
d(μ(t), π) ≤ d(μ(t − kt0 ), π)(1 − α)k , t ≥ kt0 , k ∈ N.
Letting t, k → ∞, we get the desired result.
Theorem 3.18 (Ergodic theorem). Assume that Q is irreducible. Then for
any bounded function f we have
1 T
f (Xs )ds →
f π , a.s.,
T 0
as T → ∞, where π is the unique invariant distribution.
Note that
P(X̂0 = i0 , X̂1 = i1 , . . . , X̂N = iN )
= P(XN = i0 , XN −1 = i1 , . . . , X0 = iN )
= πiN piN iN −1 · · · pi1 i0 = πi0 p̂i0 i1 · · · p̂iN −1 iN
In this case, we have p̂ij = pij or q̂ij = qij . We call the chain reversible. A
reversible chain can be equipped with a variational structure and has nice
spectral properties. In the discrete time setup, define the Laplacian matrix
L=P −I
and correspondingly its action on any function f
(Lf )(i) = pij (f (j) − f (i)).
j∈S
Let L2π be the space of square summable functions f endowed with the
π-weighted scalar product
(3.29) (f, g)π = πi f (i)g(i).
i∈S
One can show that D(f ) = (f, −Lf )π . Similar constructions can be done
for Q-processes and more general stochastic processes. These formulations
are particularly useful in potential theory for Markov chains [Soa94].
Observations Y1 Y2 YT
Hidden States X1 X2 XT
t=1 t=2 t=T
In this setup, both P and R are time independent. The HMM is fully
specified by the parameter set:
θ = (P , R, μ).
The distribution P(Y |θ) can be obtained via the following forward algo-
rithm.
Algorithm 1 (Forward procedure). Finding the posterior probability under
each model.
(i) Initialization. Compute α1 (i) = μi riy1 for i ∈ S.
(ii) Induction. Compute {αn (i)} via (3.36).
(iii) Termination.
Compute the final a posterior probability P(Y |θ) =
α
i∈S N (i).
Optimal Prediction. From the joint distribution P(X, Y |θ), we have the
conditional probability
1
P(X|Y , θ) = P(X, Y |θ) ∝ P(X, Y |θ)
P(Y |θ)
since the parameter θ and the observation Y are already known. Thus
given θ and Y , the optimal state sequence X can be defined as the most
probable path that maximizes the probability P(X, Y |θ) in the path space
S N . Again, naive enumeration and maximization are not feasible. The
Viterbi algorithm is designed for this task.
The Viterbi algorithm also uses ideas from dynamic programming to
decompose the optimization problem into many subproblems. First, let us
define the partial probability δn (i):
# $
δn (i) = max P(X1:n−1 = x1:n−1 , Xn = i, Y1:n = y1:n |θ) ,
x1:n−1
To retrieve the state sequence, we need to keep track of the argument that
maximizes δn (j) for each n and j. This can be done by recording the maxi-
mizer in step n in (3.37) in an array, denoted by ψn . The complete Viterbi
algorithm is as follows.
Algorithm 2 (Viterbi algorithm). Finding the optimal state sequence X̂ =
(X̂1:N ) given θ and Y .
(i) Initialization. Compute δ1 (i) = μi riy1 for i ∈ S.
(ii) Induction. Compute
ψn (j) = arg max{δn (i)pij }, n = 1, . . . , N − 1; j ∈ S.
i∈S
δn (j) = max{δn−1 (i)pij } rjyn , n = 2, . . . , N ; j ∈ S.
i∈S
(iii) Termination. Obtain the maximum probability and record the
maximizer in the final step
P ∗ = max{δN (i)}, X̂N = arg max{δN (i)}.
i∈S i∈S
given the observation Y , where P(Y |θ) is defined in (3.33). This maxi-
mization is nontrivial since it is generally nonconvex and involves hidden
variables. Below we introduce a very simple iterative method that gives rise
to a local maximizer, the well-known Baum-Welch algorithm proposed in
the 1970s [BE67].
To do this let us first introduce the backward procedure. We define the
backward variable βn (i), which represents the probability of the observation
sequence from time n + 1 to N , given that the hidden state at time n is at
state i and the model θ
(3.39) βn (i) = P(Yn+1:N = yn+1:N |Xn = i, θ), n = N − 1, . . . , 1.
The calculation of {βn (i)} can be simplified using the dynamic programming
principle:
(3.40) βn (i) = pij rjyn+1 βn+1 (j), n = N − 1, . . . , 1.
j∈S
N −1
ξn (i, j): expected number of transitions from state i to j given (Y , θ).
n=1
Note that the above equivalence is up to a constant, e.g., the number of
transitions starting from state i in the whole observation time series.
Algorithm 4 (Baum-Welch algorithm). Parameter estimation of the HMM.
(i) Set up an initial HMM θ = (P , R, μ).
(ii) Update the parameter θ through
(3.42) μ∗ = γ1 (i),
N −1
∗ ξn (i, j)
(3.43) pij = n=1 N −1
,
n=1 γn (i)
N
∗ n=1,Yn =k γn (j)
(3.44) rjk = N .
n=1 γn (j)
(iii) Repeat the above procedure until convergence.
node i to node j. The simplest example of the weight matrix is given by the
adjacency matrix: eij = 0 or 1, depending on whether i and j are connected.
Below we will focus on the situation when the network is undirected; i.e.,
W is symmetric.
Given a network, one can define naturally a discrete time Markov chain,
with the transition probability matrix P = (pij )i,j∈S given by
eij
(3.45) pij = , di = eik .
di
k∈S
Here di is the degree of the node i [Chu97, Lov96]. Let
di
(3.46) πi = .
k∈S dk
Then it is easy to see that
(3.47) πi pij = πj ;
i∈S
i.e., π is an invariant distribution of this Markov chain. Furthermore, one
has the detailed balance relation:
(3.48) πi pij = πj pji .
For u = (ui )i∈S , v = (vi )i∈S defined on S, introduce the inner product
(u, v)π = ui vi πi .
i∈S
With this inner product, P is selfadjoint in the sense that
(P u, v)π = (u, P v)π .
Denote by {λk }k=0,...,I−1 the eigenvalues of P , {ϕk }k=0
I−1
and {ψ k }k=0
I−1
the
corresponding right and left eigenvectors. We have (ϕj , ϕk )π = δjk , ψk,i =
ϕk,i πi , and
(3.49) P ϕk = λk ϕk , ψ Tk P = λk ψ Tk , k = 0, 1, . . . , I − 1.
Moreover, one can easily see that 1 is an eigenvalue and all other eigenvalues
lie in the interval [−1, 1]. We will order them according to 1 = λ0 ≥ |λ1 | ≥
· · · ≥ |λI−1 |. Note that ψ 0 = π and ϕ0 is a constant vector. The spectral
decomposition of P t is then given by
I−1
(3.50) (P t )ij = λtk ϕk,i ψk,j .
k=0
Equation (3.53) says that the probability of jumping from any state in Ck
is the same and the walker enters Cl according to the invariant distribution.
This is consistent with the idea of coarsening the original dynamics onto the
new state space C = {C1 , . . . , CK } and ignoring the details of the dynamics
within the sets Ck . Note that P̃ is a stochastic matrix on S with invariant
distribution π if P̂ is a stochastic matrix on C with invariant distribution
π̂.
One can ask further about the optimal partition, and this can be approached
by minimizing the approximation error with respect to all possible partitions.
Algebraically, the existence of community structure or clustering property of
Exercises 71
Exercises
3.1. Write down the transition probability matrix P of Ehrenfest’s dif-
fusion model and analyze its invariant distribution.
3.2. Suppose {Xn }n∈N is a Markov chain on a finite state space S. Prove
that
P(A∩{Xm+1 = im+1 , . . . , Xm+n = im+n }|Xm = im )
= P(A|Xm = im )Pi ({X1 = im+1 , . . . , Xn = im+n }),
where the set A is an arbitrary union of elementary events like
{(X0 , . . . , Xm−1 ) = (i0 , . . . , im−1 )}.
3.3. Prove Lemma 3.7.
3.4. Show that the exponent s in P s in Theorem 3.11 necessarily satis-
fies s ≤ I − 1, where I = |S| is the number of states of the chain.
3.5. Show that the recurrence relation μn = μn−1 P can be written as
μn,i = μn−1,i 1 − pij + μn−1,j pji .
j∈S,j=i j∈S,j=i
The first term on the right-hand side gives the total probability
of not making a transition from state i, while the last term is the
probability of transition from one of states j = i to state i.
3.6. Prove (3.8) by mathematical induction.
3.7. Derive (3.8) through the moment generating function
∞
g(t, z) = pm (t)z m .
m=0
3.8. Let f be a function defined on the state space S, and let
hi (t) = Ei f (Xt ), i ∈ S.
Prove that dh/dt = P h for h = (h1 (t), . . . , hI (t))T .
3.9. Let {Nt1 } and {Nt2 } be two independent Poisson processes with
parameters λ1 and λ2 , respectively. Prove that Nt = Nt1 + Nt2 is a
Poisson process with parameter λ1 + λ2 .
3.10. Let {Nt } be a Poisson process with rate λ and let τ ∼ E(μ) be an
independent random variable. Prove that
P(Nτ = m) = (1 − p)m p, m ∈ N,
where p = μ/(λ + μ).
72 3. Markov Chains
3.11. Let {Nt } be a Poisson process with rate λ. Show that the distribu-
tion of the arrival time τk between k successive jumps is a gamma
distribution with probability density
λk tk−1 −λt
p(t) = e , t > 0,
Γ(k)
where Γ(k) = (k − 1)! is the gamma function.
3.12. Consider the Poisson process {Nt } with rate λ. Given that Nt = n,
prove that the n arrival times J1 , J2 , . . . , Jn have the same distribu-
tion as the order statistics corresponding to n independent random
variables uniformly distributed in (0, t). That is, (J1 , . . . , Jn ) has
the probability density
n!
p(x1 , x2 , . . . , xn |Nt = n) =
, 0 < x1 < · · · < xn < t.
tn
3.13. The compound Poisson process {Xt } (t ∈ R+ ) is defined by
Nt
Xt = Yk
k=1
where {Nt } is a Poisson process with rate λ and {Yk } are i.i.d. ran-
dom variables on R with probability density function q(y). Derive
the equation for the probability density function p(x, t) of Xt .
(k)
3.14. Suppose that {Nt } (k = 1, 2, . . .) areindependent Poisson pro-
cesses with rate λk , respectively, λ = ∞ k=1 λk < ∞. Prove that
∞ (k)
{Xt = k=1 kNt } is a compound Poisson process.
3.15. Prove (3.36) and (3.40).
3.16. Consider an irreducible Markov chain {Xn }n∈N on a finite state
space S. Let H ⊂ S. Define the first passage time TH = inf{n|Xn ∈
H, n ∈ N} and
hi = Pi (TH < ∞), i ∈ S.
Prove that h = (hi )i∈S satisfies the equation
(I − P ) · h = 0 with boundary condition hi = 1, i ∈ H,
where P is the transition probability matrix.
3.17. Connection between resistor networks and Markov chains.
(a) Consider a random walk {Xn }n∈N on a one-dimensional lat-
tice S with states {1, 2, . . . , I}. Assume
that the transition
probability pij > 0 for |i − j| = 1 and |j−i|=1 pij = 1, where
i, j ∈ S. Let xL = 1, xR = N . Define the first passage times
TL = inf{n|Xn = xL , n ∈ N}, TR = inf{n|Xn = xR , n ∈ N}
Notes 73
Notes
The finite state Markov chain is one of the most commonly used stochastic
models in practice. More refined notions, such as recurrence and transience,
can be found in [Dur10, Nor97]. Ergodicity for Markov chains with count-
able or continuous states is discussed in [MT93]. Our discussion of the
hidden Markov model follows [Rab89]. The Baum-Welch algorithm is es-
sentially the EM (expectation-maximimization) algorithm applied to the
HMM [DLR77, Wel03]. However, it was proposed long before the EM
algorithm.
More theoretical studies about the random walk on graphs can be found
in [Chu97, Lov96]. One very interesting application is the PageRank algo-
rithm for ranking webpages [BP98].
Chapter 4
U . Then as β = T −1 → ∞,
1 −βU (x)
(4.2) e −→ δ(x − x∗ ),
Zβ
where x∗ is the global minimum of U . For this reason, sampling
can be viewed as a “finite temperature” version of optimization.
This analogy between sampling and optimization can be used in at
least two ways. The first is that ideas used in optimization algo-
rithms, such as the Gauss-Seidel and the multigrid methods, can be
adopted in the sampling setting. The second is a way of perform-
ing global optimization, by sampling the probability distribution
on the left-hand side of (4.2), in the “low temperature” limit when
β → ∞. This idea is the essence of one of the best known global
minimization technique, the simulated annealing method.
75
76 4. Monte Carlo Methods
1
N
(4.4) IN (f ) = f (Xi ).
N
i=1
1
N
1
(4.5) E|eN |2 = 2
E (f (X i ) − I(f ))(f (X j ) − I(f )) = Var(f )
N N
i,j=1
1
by using independence, where Var(f ) = 0 (f (x) − I(f ))2 dx. Therefore, we
conclude that
σf
(4.6) IN (f ) − I(f ) ≈ √ ,
N
4.2. Generation of Random Variables 77
where σf = Var(f ) is the standard deviation. We can compare this with
the error bound for the trapezoidal rule [HNW08] with N points
1 2
(4.7) eN ≈ h max |f
(x)|
12 x∈[0,1]
One important fact about the LCG is that it shows very poor behav-
ior for the s-tuples, i.e., the vectors (Xn , Xn+1 , . . . , Xn+s−1 ). In [Mar68],
Marsaglia proved the following:
Theorem 4.2. Let {Yn } be a sequence generated by (4.8) and let Xn =
Yn /m. Then the s-tuples (Xn , Xn+1 , . . . , Xn+s−1 ) lie on a maximum of
1
(s!m) s equidistant parallel hyperplanes within the s-dimensional hypercube
[0, 1]s .
The deviation from the uniform distribution over [0, 1]s is apparent when
s is large. Inspite of this, the LCG is still one of the most widely used pseudo-
random number generators. There are also nonlinear generators which aim
to overcome this problem [RC04]. Some very recent mathematical software
4.2. Generation of Random Variables 79
p(y)
F(y)
y1 y
y1 y2 y
y2
Figure 4.1. Left panel: The PDF of Y . Right panel: The distribution
function F (y).
The Box-Muller Method. The basic idea is to use the strategy presented
above in two dimensions instead of one. Consider a two-dimensional Gauss-
ian random vector with independent components. Using polar coordinates
x1 = r cos θ, x2 = r sin θ, we have
1 − x21 +x22 1 r2
e 2 dx1 dx2 = dθ · (e− 2 rdr).
2π 2π
2
This means that θ and r are also independent, θ is uniformly distributed
over the interval [0, 2π], and r√2 is exponentially distributed. Hence if X, Y ∼
U [0, 1] i.i.d. and we let R = −2 log X, Θ = 2πY , then
Z1 = R cos Θ, Z2 = R sin Θ
are independent and have the desired distribution N (0, 1).
d
Rejection
f(x)
0
a Acceptance b
clearly limited by the computational cost we are willing to afford. But the
variance can be manipulated in order to reduce the size of the error.
10
20 − 1 X 2
N
e− 2 x dx ≈
1 2
e 2 i,
−10 N
i=1
where {Xi }N
i=1 are i.i.d. random variables that are uniformly distributed
on [−10, 10]. However, notice that the integrand e− 2 x is an extremely
1 2
p(x)
f(x)
0 a b 1
If the {Xi }’s are sampled according to a different distribution, say with
density function p(x), then since
f
f (x)
f (x)dx = p(x)dx = E (X)
p(x) p
1 f (Xi )
N
= lim ,
N →∞ N p(Xi )
i=1
we should approximate f (x)dx by
1 f (Xi )
N
p
IN (f ) = .
N p(Xi )
i=1
4.3. Variance Reduction 85
The error can be estimated in the same way as before, and we get
f
p 1 1 f 2 (x)
E(I(f ) − IN (f ))2 = Var = dx − I 2 (f ) .
N p N p(x)
From this, we see that an ideal importance function is
p(x) = Z −1 f (x)
if f is nonnegative, where Z is the normalization factor Z = f (x)dx. In
this case
p
I(f ) = IN (f ).
This is not a miracle since all the necessary work has gone into computing
Z, which was our original task.
Nevertheless, this suggests how the sample distribution should be con-
structed. For the example discussed earlier, we can pick p(x) that behaves
like e− 2 x and at the same time can be sampled with a reasonable cost.
1 2
This is the basic idea behind importance sampling. One should note that
importance sampling bears a lot of similarity with preconditioning in linear
algebra.
A slight variant of these ideas is as follows [Liu04]. To compute
I = f (x)π(x)dx,
Notice that
−1 1, x ≤ 0,
(1 + r) ≈ h(x) =
0, x > 0.
We have
+∞
1 x2 1
I(f ) = √ (1 + r)−1 − h(x) e− 2 dx + .
2π −∞ 2
Here h(x) is the control variates.
1
N
ˆ
I= f (X i ).
N
i=1
Suppose that x can be decomposed into two parts (x(1) , x(2) ) and the con-
ditional expectation E(f (X)|x(2) ) can be approximated with good accuracy
and low cost. We can define another unbiased estimator of I by
1
N
(2)
I˜ = E(f (X)|X i ).
N
i=1
If the computational effort for obtaining the two estimates is similar, then
I˜ should be preferred. This is because [Dur10]
(4.10) Var(f (X)) = Var(E(f (X)|X (2) )) + E(Var(f (X)|X (2) )),
which implies
where Q = (Qij ) is a transition probability matrix with Qij being the proba-
bility of generating state j from state i and Aij is the probability of accepting
state j. We will call Q the proposal matrix and A the decision matrix. We
will require Q to be symmetric for the moment:
Qij = Qji .
Then the detailed balance condition is reduced to
Aij
= e−βΔHij .
Aji
88 4. Monte Carlo Methods
• Glauber dynamics.
−1
(4.14) Aij = 1 + eβΔHij .
Both choices satisfy the detailed balance condition. Usually the Metropo-
lis strategy results in faster convergence since its off-diagonal elements are
larger than those of the Glauber dynamics [Liu04]. Note that in the Me-
tropolis strategy, we accept the new configuration if its energy is lower than
that of the current state. But if the energy of the new configuration is
higher, we still accept it with some positive probability.
The choice of Q depends on the specific problem. To illustrate this, we
consider the one-dimensional Ising model as an example.
x = (x1 , x2 , . . . , xM ),
M
1 2 Spin system
1
N
(4.17)
f ≈
f N := f (Xj ),
N
j=1
where {Xj } is the time series generated according to the Markov chain P .
The ergodic Theorem 3.12 ensures the convergence of the empirical average
(4.17) if the chain P satisfies the irreducibility and primitivity conditions.
Coming back to the choice of Q, let Nt = 2M , the total number of
microscopic configurations. The following are two simple options:
Proposal 1. Equiprobability proposal.
1
Q(x → y) = Nt −1 , x = y,
0, x = y.
We should remark that the symmetry condition for the proposal matrix
Q can also be relaxed, and this leads to the Metropolis-Hastings algorithm:
Qji πj
(4.19) Aij = min 1, .
Qij πi
There is an obvious balance between the proposal step and the decision
step. One would like the proposal step to be more aggresive since this will
allow us to sample larger parts of the configuration space. However, this
may also result in more proposals being rejected.
Regarding the issue of convergence, we refer the readers to [Liu04,
Win03].
where Z = L(θ|X)μ(θ)dθ is simply a normalization constant that only
depends on the data X. One can sample the posterior distribution of θ
according to (4.21) or extract the posterior mean by
θ̄ = EΘ = θπ(θ|X)dθ
using the Metropolis algorithm. The Markov chain Monte Carlo method is
perfect here for computing θ̄ since it does not need the actual value of Z.
Table 4.1. Classification of sites according to the spins at the site and
neighboring sites.
For each given state x, let nj be the number of sites in the jth class,
and define
i
Qi = nj Pj , i = 1, . . . , 6,
j=1
of U (x). This issue becomes quite severe when we have multiple peaks in
π(x), which is very common in real applications (see Figure 4.5).
To improve the mixing rate of the MCMC algorithm in the low tem-
perature regime, Marinari and Parasi [MP92] and Geyer and Thompson
[GT95] proposed the simulated tempering technique, by varying the tem-
perature during the simulation. The idea of simulated tempering is to utilize
the virtue of the fast exploration on the overall energy landscape at high
temperature and the fine resolution of local details at low temperature, thus
reducing the chances that the simulation gets trapped at local metastable
states.
Figure 4.5. The Gibbs distribution at low temperature (left panel) and
high temperature (right panel).
not feasible in real applications and one only obtains some guidance through
some pilot studies [GT95].
To perform the MCMC in the augmented space, here we present a
mixture-type conditional sampling instead of a deterministic alternating ap-
proach.
Algorithm 9 (Simulated tempering). Set mixture weight α0 ∈ (0, 1).
Step 1. Given (xn , in ) = (x, i), draw u ∼ U [0, 1].
Step 2. If u < α0 , let in+1 = i and sample xn+1 with the MCMC for the
Gibbs distribution πst (x|i) from the current state xn .
Step 3. If u ≥ α0 , let xn+1 = x and sample in+1 with the MCMC for
the distribution πst (i|x) from the current state in .
If x ∈ M, we have
1
πβ (x) = −β(H(z)−m)
;
|M| + z ∈M
/ e
hence πβ (x) monotonically increases with β.
If x ∈
/ M, we have
∂πβ (x) 1
= 2 e−β(H(x)−m) (m − H(x))Z̃β − e−β(H(z)−m) (m − H(z)) ,
∂β Z̃β z∈X
where Z̃β := z∈X exp(−β(H(z) − m)). Note that
lim (m − H(x))Z̃β − e−β(H(z)−m) (m − H(z))
β→+∞
z∈X
= |M|(m − H(x)) < 0.
Hence πβ (x) is monotonically decreasing in β.
96 4. Monte Carlo Methods
Exercises
4.1. For different choices of N , apply the simple Monte Carlo method
and importance sampling to approximate the integral
10
e− 2 x dx.
1 2
I(f ) =
−10
1
N
I2N (f ) = (f (Xi ) + f (1 − Xi )), Xi ∼ i.i.d. U [0, 1].
2N
i=1
1
M Nk
(k)
IN = f (Xi ).
N
k=1 i=1
Prove that the variance is reduced with this uniformization; i.e.,
Var(IN ) ≤ Var(I) = (f (x) − I)2 dx.
S
4.7. Prove that both the Metropolis strategy (4.13) and Glauber dy-
namics (4.14) satisfy the detailed balance condition.
4.8. Prove that when the proposal matrix Q is not symmetric, the
detailed balance condition can be preserved by choosing the
Metropolis-Hastings decision matrix (4.19).
4.9. Prove that the Markov chain set up by the Gibbs sampling (4.18)
and Metropolis strategy (4.13) is primitive for the one-dimensional
Ising model.
98 4. Monte Carlo Methods
(b) Define the bond variable u = (uij ) on each edge and the ex-
tended Gibbs distribution
π(x, u) ∝ χ{0≤uij ≤exp{βJ(1+xi xj )}} (x, u).
i,j
Notes
Buffon’s needle problem is considered as the most primitive form of the
Monte Carlo methods [Ram69]. The modern Monte Carlo method begins
from the pioneering work of von Neumann, Ulam, Fermi, Metropolis, et al.
in the 1950s [And86, Eck87].
A more detailed discussion on variance reduction can be found in
[Gla04]. In practical applications, the choice of the proposal is essential
for the efficiency of the Markov chain Monte Carlo method. In this re-
gard, we mention Gibbs sampling which is easy to implement, but its ef-
ficiency for high-dimensional problems is questionable. More global moves
like the Swendsen-Wang algorithm have been proposed [Tan96]. Other
Notes 99
Stochastic Processes
Before going further, we discuss briefly the general features of two most im-
portant classes of stochastic processes, the Markov process and the Gaussian
process.
There are two ways to think about stochastic processes:
101
102 5. Stochastic Processes
Notice that for this process the number of all possible outputs is un-
countable. We cannot just define the probability of an event by summing
up the probabilities of every atom in the event, as for the case of discrete
random variables. In fact, if we define Ω = {H, T }N , the probability of an
atom ω = (ω1 , ω2 , . . . , ωn , . . .) ∈ {H, T }N is 0; i.e.,
P(X1 (ω) = k1 , X2 (ω) = k2 , . . . , Xn (ω) = kn , . . .)
1 n
= lim = 0, kj ∈ {0, 1}, j = 1, 2, . . .
n→∞ 2
and events such as {Xn (ω) = 1} involve uncountably many elements.
To set up a probability space (Ω, F , P) for this process, it is natural to
take Ω = {H, T }N and the σ-algebra F as the smallest σ-algebra containing
all events of the form
# $
(5.1) C = ω|ω ∈ Ω, ωik = ak ∈ {H, T }, k = 1, 2, . . . , m
for any 1 ≤ i1 < i2 < · · · < im and m ∈ N. These are sets whose finite
time projections are specified. They are called cylinder sets, which form an
algebra in F . We denote them generically by C. The probability of an event
of the form (5.1) is given by
1
P(C) = m .
2
It is straightforward to check that for any cylinder set F ∈ C, the probability
P(F ) coincides with the one defined in Example 5.1 for the independent coin
tossing process. From the extension theorem of measures, this probability
measure P is uniquely defined on F since it is well-defined on C [Bil79].
Another natural way is to take Ω = {0, 1}N , F being the smallest σ-
algebra containing all cylinder sets in Ω, and P is defined in the same way
as above. With this choice we have
Xn (ω) = ωn , ω ∈ Ω = {0, 1}N ,
5.1. Axiomatic Construction of Stochastic Process 103
which is called a coordinate process in the sense that Xn (ω) is just the nth
coordinate of ω.
In general, a stochastic process is a parameterized random variable
{Xt }t∈T defined on a probability space (Ω, F , P) taking values in R; the
parameter set T can be N, [0, +∞), or some finite interval. For any fixed
t ∈ T, we have a random variable
Xt : Ω → R, ω Xt (ω).
X· (ω) : T → R, t Xt (ω).
One simple example of stopping time for the coin tossing process is
% &
T = inf n : there exist three consecutive 0’s in {Xk }k≤n .
It is easy to show that the condition {T ≤ n} ∈ Fn is equivalent to
{T = n} ∈ Fn for discrete time processes.
106 5. Stochastic Processes
1 (λt)k−n −λt
∞
= lim f (n)(e−λt − 1) + f (n + 1)λte−λt + f (k) e
t→0+ t (k − n)!
k=n+2
= λ(f (n + 1) − f (n)).
It is straightforward to see that
A∗ f (n) = λ(f (n − 1) − f (n)).
For u(t, n) = En f (Xt ), we have
d (λt)k−n −λt
∞
du
(5.7) = f (k) e = λ(u(n + 1) − u(n)) = Au(t, n).
dt dt (k − n)!
k=n
The facts that the equation for the expectation of a given function has
the form of backward Kolmogorov equations (5.5) and (5.7) and the distri-
bution satisfies the form of forward Kolmogorov equations (5.6) and (5.8)
are general properties of Markov processes. This will be seen again when
we study the diffusion processes in Chapter 8.
These discussions can be easily generalized to the multidimensional set-
ting when Xt ∈ Rd .
Below we will show that a Gaussian process is determined once its mean
and covariance function
(5.9) m(t) = EXt , K(s, t) = E(Xs − m(s))(Xt − m(t))
are specified. Given these two functions, a probability space (Ω, F , P) can
be constructed and μt1 ,...,tk is exactly the finite-dimensional distribution of
(Xt1 , . . . , Xtk ) under P.
110 5. Stochastic Processes
For the above to hold, there must exist numbers m and σ 2 such that
m = lim mk , σ 2 = lim σk2
k k
iξm− 21 σ 2 ξ 2
and correspondingly EeiξX = e . This gives the desired result.
in the L∞ ([0, T ] × [0, T ]) sense, where {λk , φk (t)}k is the eigensystem of the
operator K.
Theorem 5.13 (Karhunen-Loève expansion). Let (Xt )t∈[0,1] be a Gaussian
process with mean 0 and covariance function K(s, t). Assume that K is
continuous. Let {λk }, {φk } be the sequence of eigenvalues and eigenfunctions
for the eigenvalue problem
1
K(s, t)φk (t)dt = λk φk (s), k = 1, 2, . . . ,
0
1
with the normalization condition φk φj dt = δkj . Then Xt has the repre-
0
sentation
∞
√
(5.11) Xt = αk λk φk (t),
k=1
in the sense that the series
N √
(5.12) XtN = αk λk φk (t) → Xt in L∞ 2
t Lω ;
k=1
i.e., we have
lim sup E|XtN − Xt |2 = 0.
N →∞ t∈[0,1]
Proof. We need to show that the random series is well-defined and that it
is a Gaussian process with the desired mean and covariance function.
First consider the operator K : L2 ([0, 1]) → L2 ([0, 1]) defined as in (5.10).
Thanks to Theorem 5.10, we know that K is nonnegative and compact.
Therefore it has countably many real eigenvalues, for which 0 is the only
possible accumulation point [Lax02]. From Mercer’s theorem, we have
N
λk φk (s)φk (t) → K(s, t), s, t ∈ [0, 1], N → ∞,
k=1
We denote the unique limit by Xt . For any t1 , t2 , . . . , tm ∈ [0, 1], the mean
square convergence of the Gaussian random vector (XtN1 , XtN2 , . . . , XtNm ) to
(Xt1 , Xt2 , . . . , Xtm ) implies the convergence in probability. From Theorem
5.11, we have that the limit Xt is indeed a Gaussian process. Since X N
converges to X in L∞ 2
t Lω , we have
Exercises
5.1. Prove that in Definition 5.4 the condition {T ≤ n} ∈ Fn is equiva-
lent to {T = n} ∈ Fn for discrete time stochastic processes.
5.2. Prove Proposition 5.5.
5.3. Let T be a stopping time for the filtration {Fn }n∈N . Define the
σ-algebra generated by T as
FT = {A : A ∈ F , A ∩ {T ≤ n} ∈ Fn for any n ∈ N}.
Prove that FT is indeed a σ-algebra and T is FT -measurable. Gen-
eralize this result to the continuous time case.
5.4. Let S and T be two stopping times. Prove that:
(i) If S ≤ T , then FS ⊂ FT .
(ii) FS∧T = FS ∩ FT , and the events
{S < T }, {T < S}, {S ≤ T }, {T ≤ S}, {T = S}
all belong to FS ∩ FT .
5.5. Consider the compound Poisson process Xt defined in Exercise 3.13.
Find the generator of {Xt }.
5.6. Find the generator of the following Gaussian processes.
(i) Brownian bridge with the mean and covariance function
m(t) = 0, K(s, t) = s ∧ t − st, s, t ∈ [0, 1].
(ii) Ornstein-Uhlenbeck process with the mean and covariance func-
tion
m(t) = 0, K(s, t) = σ 2 exp(−|t − s|/η), η > 0, s, t ≥ 0.
114 5. Stochastic Processes
5.7. Let p(t) and q(t) be positive functions on [a, b] and assume that p/q
is strictly increasing. Prove that the covariance function defined by
p(s)q(t), s ≤ t,
K(s, t) =
p(t)q(s), t < s,
is a strictly positive definite function in the sense that the matrix
n
K = K(ti , tj )
i,j=1
Notes
We did not cover martingales in the main text. Instead, we give a very
brief introduction in Section D of the appendix. The interested readers may
consult [Chu01, KS91, Mey66] for more details.
Our presentation about Markov processes in this chapter is very brief.
But the three most important classes of Markov processes are discussed
throughout this book: Markov chains, diffusion processes, and jump pro-
cesses used to describe chemical kinetics. One most important tool for
treating Markov processes is the master equations (also referred to as Kol-
mogorov equations or Fokker-Planck equations), which are either differential
or difference equations.
Notes 115
Wiener Process
However as we will see, one can understand the Wiener process in a number
of different ways:
117
118 6. Wiener Process
where x = ml and the factor 1/2 in the last equation appears since m
can take only even or odd values depending on whether N is even or odd.
Combining the results above we obtain
1 x2
W (x, t)Δx = 5 exp − l2 Δx.
2
2πt lτ 2t τ
(we say that the particle is on the positive side in the interval [n − 1, n] if
at least one of the values Xn−1 and Xn is positive). One can prove that
(6.4) P2k,2n = u2k · u2n−2k ,
where u2k = P(X2k = 0). Now let γ(2n) be the number of time units that
the particle spends on the positive axis in the interval [0, 2n]. Then when
x < 1,
1 γ(2n)
P < ≤x = P2k,2n .
2 2n
k,1/2<2k/2n≤x
From the asymptotics (6.2), we have
1 1
u2k ∼ √ , P2k,2n ∼
πk π k(n − k)
as k, n − k → ∞. Therefore
1 k k − 12
P2k,2n − · 1− → 0 as n → ∞,
πn n n
{k,1/2<2k/2n≤x} k,1/2<2k/2n≤x
and thus
x
1 dt
P2k,2n → , as n → ∞.
{k,1/2<2k/2n≤x}
π 1
2
t(1 − t)
From the symmetry of random walk, we obtain the following theorem.
Theorem 6.3 (Arcsine law). The probability that the fraction of the time
√
spent by the particle on the positive side is at most x tends to π2 arcsin x;
that is,
2 √
(6.5) P2k,2n → arcsin x as n → ∞.
π
{k,k/n≤x}
We will not prove Theorem 6.4. Rather we will take for granted that the
limiting process W exists and we shall study its properties. It is instructive
to see that the spatial-temporal rescaling in (6.7) is exactly the same scaling
considered in the previous section.
2.5
10
8 2
6
4 1.5
2
1
0
−2
0.5
−4
−6
0
−8
−10 −0.5
10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
N
t
s
(n) (m) (n) (m)
EWtN WsN = E(αk αl ) Hk (τ )dτ Hl (τ )dτ
n,m=0 k∈In ,l∈Im 0 0
N
t
s
(n) (n)
= Hk (τ )dτ Hk (τ )dτ
n=0 k∈In 0 0
N
1
1
(n) (n)
= Hk (τ )χ[0,t] (τ )dτ Hk (τ )χ[0,s] (τ )dτ
n=0 k∈In 0 0
1
(6.14) → χ[0,t] χ[0,s] (τ )dτ = t ∧ s,
0
where χ[0,t] (·) is the indicator function of the set [0, t]. In the last step, we
(n)
used Parserval’s identity since {Hk } is an orthonormal basis.
Proof. First we show that almost surely, WtN converges uniformly to some
continuous function. For this purpose, we note that for the Gaussian random
variable ξ ∼ N (0, 1),
'
'
' x2
2 ∞ − y2 2 ∞ y − y2 2 e− 2
P(|ξ| > x) = e 2 dy ≤ e 2 dy = , x > 0.
π x π x x π x
(n)
Define an = maxk∈In |αk |. Then
⎛ ⎞ ' 2
(n) 2 e − n2
P(an > n) = P ⎝ |αk | > n⎠ ≤ 2 n−1
, n ≥ 1.
π n
k∈In
∞ 5 n2
e− 2
Since n=1 2
n−1 2
π n < ∞, using the Borel-Cantelli lemma, we con-
clude that there exists a set Ω̃ with P(Ω̃) = 1 such that for any ω ∈ Ω̃ there
124 6. Wiener Process
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1.2
−1.4
−1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
is an N (ω) such that am (ω) ≤ m for any m ≥ N (ω). For any such ω, we
have
∞
t
∞
t (m)
(m)
αk
(m)
Hk (s)ds
≤ m Hk (s)ds
m=N (ω) k∈Im 0
m=N (ω) k∈Im 0
∞
m2−
m+1
≤ 2 < ∞,
m=N (ω)
Then
1
1 1
1
E exp − 2
Wt dt = E exp − λk αk
2
= E exp − λk αk .
2
2 0 2 2
k k
where
∞
4
M −2 = 1+ .
(2k − 1)2 π 2
k=1
Using the identity
∞
4x2
cosh(x) = 1+ ,
(2k − 1)2 π 2
k=1
we get
'
− 21 2e
M = (cosh(1)) = .
1 + e2
In other words,
1
pn (w1 , w2 , . . . , wn ) = exp(−In (w)),
Zn
where
1 wj − wj−1 2
n
In (w) = (tj − tj−1 ), t0 := 0, w0 := 0,
2 tj − tj−1
j=1
n 1
Zn = (2π) 2 t1 (t2 − t1 ) · · · (tn − tn−1 ) 2 .
From this we see that the Wiener process is a stationary Markov process
with transition kernel p(x, t|y, s) given by
(x−y)2
1 − 2(t−s)
P(Wt ∈ B|Ws = y) = e dx = p(x, t|y, s)dx
B 2π(t − s) B
Proposition 6.8. Consider the Wiener process. For any t and subdivision
Δ of [0, t], we have
(6.18) E(QΔ
t − t) 2
= 2 (tk+1 − tk )2 .
k
128 6. Wiener Process
Consequently, we have
t −→ t in Lω
QΔ as |Δ| → 0
2
[p,q] → q − p,
QΔ n
on Ω0 .
Therefore, we have
q −p ← (Wtk+1 −Wtk )2 ≤ sup |Wtk+1 −Wtk |·V (W (ω); [p, q]), ω ∈ Ω0 .
k k
From the uniform continuity of W on [p, q], supk |Wtk+1 − Wtk | → 0. Hence,
as a function of t, the total variation of Wt (ω) is infinite, for ω ∈ Ω0 .
Theorem 6.10 (Smoothness of the Wiener path). Let Ωα be the set of
functions that are Hölder continuous with exponent α (0 < α < 1):
6
|f (t) − f (s)|
Ωα = f ∈ C[0, 1], sup <∞ .
0≤s,t≤1 |t − s|α
The proof of Theorem 6.10 relies on the modification concept and the
following Kolmogorov continuity theorem, which can be found in [RY05].
Definition 6.11 (Modification). Two processes X and X̃ defined on the
same probability space Ω are said to be modifications of each other if for
each t,
Xt = X̃t a.s.
They are called indistinguishable if for almost all ω ∈ Ω
Xt (ω) = X̃t (ω) for every t.
for every α ∈ [0, ε/γ). In particular, the paths of X̃ are Hölder continuous
with exponent α.
which is a contradiction.
The critical case of α = 1/2 is covered by Lévy’s modulus of continuity
theorem. The reader may refer to [RY05] for details.
2
N2
Paths not hitting m1
N1
1
Paths hitting m1
0 0 m m1 2m1− m
Wr
= δ(x),
⎪
⎪ t=0
⎪
⎪
⎪
⎪ ∂Wr
⎩
= 0.
∂x x=x1
Case 2: Absorbing wall at m = m1 .
The absorbing wall at m = m1 means that once the particle hits the
state m1 at time n, it will be stuck there forever. We still assume m1 > 0
and obtain the probability Wa (m, N ; m1 ) that the particle will arrive at m
that is less than m1 after N steps by the reflection principle
Wa (m, N ; m1 ) = W (m, N ) − W (2m1 − m, N ).
With the same rescaling and asymptotics, we have
2 1 m2 (2m − m)2
2 1
Wa (m, N ; m1 ) ≈ exp − − exp − ,
πN 2N 2N
132 6. Wiener Process
Wa
= δ(x),
⎪
⎪ t=0
⎪
⎪
⎪
⎩ Wa
= 0.
x=x1
n! dx
6.7. Wiener Chaos Expansion 133
in L2ω L2t , where {ξk } are i.i.d. Gaussian random variables with distribution
N (0, 1) and {mk (t)} is any orthonormal basis in L2 [0, T ]. This establishes
another correspondence between the Wiener paths ω ∈ Ω and an infinite
sequence of i.i.d. normal random variables ξ = (ξ1 , ξ2 , . . .).
Theorem 6.14 (Wiener chaos expansion). Let W be a standard Wiener
process on (Ω, F , P). Let F be a Wiener functional in L2 (Ω). Then F can
be represented as
(6.21) F [W ] = fα Hα (ξ)
α∈I
This is an infinite-dimensional
√ analog of the Fourier expansion with
weight function ( 2π)−1 e−x /2 in each dimension. Its proof can be found
2
in [CM47].
To help appreciate this, let us consider the one-dimensional case F (ξ)
where ξ ∼ N (0, 1). Then the Wiener chaos expansion reduces to
∞
(6.22) F (ξ) = fk Hk (ξ) in L2ω ,
k=0
134 6. Wiener Process
where fk = E F (ξ)Hk (ξ) . This is simply the probabilistic form of the
Fourier-Hermite
√ −1 −x2 /2 expansion of F with the weight function w(x) =
( 2π) e . For the finite-dimensional case ξ = (ξ1 , ξ2 , . . . , ξK ), we can
define
% &
IK = α|α = (α1 , α2 , . . . , αK ), αk ∈ {0, 1, 2, . . .}, k = 1, 2, . . . , K
and the finite tensor product of Hermite polynomials
K
Hα (ξ) = Hαk (ξk ).
k=1
Then we have
(6.23) F (ξ) = fα Hα (ξ).
α∈IK
The Wiener chaos expansion can also be extended to the so-called gen-
eralized polynomial chaos (gPC) expansion when the random variables in-
volved are non-Gaussian [XK02]. It is commonly used in uncertainty quan-
tification (UQ) problems [GHO17].
Exercises
6.1. Let {ξn }∞
n=1 be a sequence of i.i.d. random variables taking values
+1 with probability 2/3 and −1 with probability 1/3. Consider the
(asymmetric) random walk on Z defined by
n
Sn = ξj .
j=1
U =
ψ(ω) e −iωt
f (t)dtdω
.
2π R R
Notes 137
Compute the mean detecting signal EU and compare this with the
case without random perturbation.
6.13. Prove that the Borel σ-algebra on (C[0, 1],
·
∞ ) is the same as the
product σ-algebra on C[0, 1], i.e., the smallest σ-algebra generated
by the cylinder sets
C = {ω : (ω(t1 ), ω(t2 ), . . . , ω(tk )) ∈ Rk , for any k ∈ N}.
Notes
Theoretical study of Brownian motion began with Einstein’s seminal pa-
per in 1905 [Ein05]. The mathematical framework was first established
by Wiener [Wie23]. More material on Brownian motion can be found in
[KS91, MP10].
Chapter 7
Stochastic Differential
Equations
We now turn to the most important object in stochastic analysis: the sto-
chastic differential equations (SDEs). Outside of mathematics, SDEs are
often written in the form
The first term on the right-hand side is called the drift term. The second
term is called the diffusion term. Ẇt is formally the derivative of the Wiener
process, known as the white noise. One can think of it as a Gaussian process
with mean and covariance functions:
d
m(t) = E(Ẇt ) = E(Wt ) = 0,
dt
∂2 ∂2
K(s, t) = E(Ẇs Ẇt ) = E(Ws Wt ) = (s ∧ t) = δ(t − s).
∂s∂t ∂s∂t
The name white noise comes from its power spectral density S(ω) defined
as the Fourier transform of its autocorrelation function R(t) = E(Ẇ0 Ẇt ) =
7 = 1. We call it “white” from the analogy with the
δ(t): S(ω) = δ(t)
spectrum of white light. Noises whose spectrum is not a constant are called
colored noises.
From the regularity theory of the Wiener paths discussed in Section 6.5,
Ẇt is not an ordinary function since Wt is not differentiable. It has to be
understood in the sense of distribution [GS66]. Consequently, defining the
product in the last term in (7.1) can be a tricky task.
139
140 7. Stochastic Differential Equations
To make this precise, the only obvious difficulty is in the last term, which is
a stochastic integral. This reduces the problem of the SDEs to that of the
stochastic integrals.
It turns out that one cannot use the classical theory of the Lebesgue-
Stieltjes integral directly to define stochastic integrals since each Brownian
path has unbounded variations on any interval [t1 , t2 ]. Nevertheless, there
are several natural ways of defining stochastic integrals. If we think of
approximating the integral by quadrature, then depending on how we pick
the quadrature points, we get different limits. This may seem worrisome at
first sight. If the stochastic integrals have different meanings, which meaning
are we supposed to use in practice? Luckily so far this has not been an issue.
It is often quite clear from the context of the problem which version of the
stochastic integral one should use. In addition, it is also easy to convert the
result of one version to that for another version.
By far the most popular way of defining stochastic integrals is that of
Itô.
where Δ is a subdivision of [0, t], Xj = Xt∗j , and t∗j is chosen arbitrarily from
the interval [tj , tj+1 ].
b
For the ordinary Riemann-Stieljes integral a f (t)dg(t), f and g are as-
sumed to be continuous and g is assumed to have bounded variation. In
that case the sum
(7.4) f (t∗j ) g(tj+1 ) − g(tj )
j
where ej (ω) is Ftj -measurable and χ[tj ,tj+1 ) (t) is the indicator function of
[tj , tj+1 ). For technical reason, we always assume that {Ft }t≥0 is the aug-
mented filtration of {FtW }t≥0 ; thus W is Ft -adapted. Obviously in this case
the Itô integral has to be defined as
T
(7.6) f (ω, t)dWt = ej (ω)(Wtj+1 − Wtj ).
0 j
Lemma 7.1. Assume that 0 ≤ S ≤ T . The stochastic integral for the simple
functions satisfies
T
(7.7) E f (ω, t)dWt = 0,
S
T 2
T
(7.8) (Itô isometry) E f (t, ω)dWt =E 2
f (ω, t)dt .
S S
142 7. Stochastic Differential Equations
where we have used the independence between δWk and ej ek δWj for j <
k.
From (7.9), the sequence {φn } is a Cauchy sequence in L2ω L2t . This implies
T
that { S φn dWt } is also a Cauchy sequence in L2ω . Therefore it has a unique
limit. We can also show that this limit is independent of the choice of the
approximating sequence {φn }. This is left as an exercise.
7.1. Itô Integral 143
T
T
f (ω, t)dWt
=
E f (ω, t)dWt − φn (ω, t)dWt
S S S
2
2 3 12
T T
≤ E f (ω, t)dWt − φn (ω, t)dWt → 0.
S S
X
and thus
Xn
2 →
X
2 , where
·
is the corresponding norm in H .
Using this, we have
T 2
T 2
E φn (ω, t)dWt → E f (ω, t)dWt
S S
and
T
T
E φ2n (ω, t)dt →E 2
f (ω, t)dt .
S S
From the Itô isometry for simple functions, we obtain (7.13) immediately.
Proof. Take the dyadic subdivision with mesh size 2−n on [0, T ]. From the
definition of Itô integral
t 2Wtj Wtj+1 − 2Wt2j
Ws dWs ≈ Wtj (Wtj+1 − Wtj ) =
0 2
j j
Wt2j+1 − Wt2j Wt2j+1 − 2Wtj+1 Wtj + Wt2j
= −
2 2
j j
Wt2 1 W2 t
= − (Wtj+1 − Wtj )2 → t − ,
2 2 2 2
j
in probability as the size of the subdivision goes to zero. Here t∗j ∈ [tj , tj+1 ]
but is otherwise arbitrary.
(f (t∗j ) − f (tj ))δWt2j
≤ sup |f (t∗j ) − f (tj )| · δWt2j .
j j j
The first term on the right-hand side goes to zero almost surely because
of the uniform continuity of f on [0, T ]. The second term converges to the
quadratic variation of Wt in probability.
2
146 7. Stochastic Differential Equations
Remark 7.8. Equation (7.17) can be derived formally using Taylor expan-
sion and the calculation rules:
(7.18) dt2 = 0, dtdWt = dWt dt = 0, (dWt )2 = dt.
First, we have
1
(7.19) dYt = f
(Xt )dXt + f
(Xt )(dXt )2
2
by Taylor expanding Yt = f (Xt ) to the second order. Using (7.18), we
obtain
(dXt )2 = (bdt + σdWt )2 = b2 dt2 + 2bσdtdWt + σ 2 (dWt )2 = σ 2 dt.
Substituting this into (7.19) yields (7.17).
are
bounded and continuous. If b and σ are simple functions, we have
1
(Xtj )δXt2j = f
Note that
f (Xtj )b (tj )δtj
≤ K
2
δt2j ≤ KT sup δtj → 0,
j j j
.
From Proposition 7.6, we get
t
The readers are referred to [IW81, KS91] for the full proof. The above
result can be generalized to the multidimensional case.
Theorem 7.9 (Multidimensional Itô formula). Let X t be an Itô process
defined by dX t = b(ω, t)dt + σ(ω, t)dW t , where X t ∈ Rn , σ ∈ Rn×m ,
W ∈ Rm . Define Yt = f (X t ), where f is a twice differentiable function.
Then
1
(7.20) dYt = b · ∇f + σσ T : ∇2 f dt + ∇f · σ · dW t .
2
Remark 7.10. Equation (7.20) can be derived formally using the calcula-
tion rules
(7.21)
dt2 = 0, dtdWti = dWti dt = dWti dWtj = 0 (i = j), (dWti )2 = dt.
Using Taylor expansion, we have
1
(7.22) dYt = ∇f (X t ) · dX t + (dX t )T · ∇2 f (X t ) · (dX t ).
2
With (7.21), we get
(dX t )T · ∇2 f (X t ) · (dX t ) = dWtl σil ∂ij
2
f σjk dWtk
l,k,i,j
(7.23) = 2
σik σjk ∂ij f dt = σσ T : ∇2 f dt,
k,i,j
where A : B = ij aij bji is the contraction for second-order tensors. Equa-
tion (7.20) is obtained by substituting (7.23) into (7.22).
Example 7.11. An example of integration by parts is
t
t
(7.24) sdWs = tWt − Ws ds.
0 0
Finally let us mention a very useful inequality about the Itô integral. De-
fine the martingale (see Section D of the appendix for the definition)[KS91]
t
Mt = σ(ω, s)dW s , t ∈ [0, T ],
0
+ sup
.
t∈[0,T ] 0
Thus
(k+1) (k) −k
P sup |Xt − Xt | >2
t∈[0,T ]
2
2 3
T
(k) (k−1) −2k−2
≤P |b(Xt , t) − b(Xt , t)|dt >2
0
2
t
3
+P sup
> 2−k−1 := P1 + P2 .
t∈[0,T ] 0
(n) (0)
n−1
(k+1) (k)
Xt = Xt + (Xt − Xt )
k=0
in the almost sure sense. Denote its limit by Xt (ω). Then Xt is t-continuous
(k)
and Ft -adapted since Xt is t-continuous and Ft -adapted. Using Fatou’s
lemma (see Theorem C.12) and the bound (7.32), we get
E|Xt |2 ≤ L̃(1 + Eξ 2 ) exp(L̃t).
Now let us show that Xt satisfies the SDE. It is enough to take the limit
k → ∞ on both sides of (7.30). For the integral term with respect to t, we
have
t
t
(b(Xs(k) , s) − b(Xs , s))ds
≤ |b(Xs(k) , s) − b(Xs , s)|ds
0 0
T
≤K |Xs(k) − Xs |ds → 0 as k → ∞ a.s.
0
Thus
t
t
(7.36) b(Xs(k) , s)ds → b(Xs , s)ds a.s. for t ∈ [0, T ].
0 0
152 7. Stochastic Differential Equations
(k)
For the stochastic integral, we note that Xt is a Cauchy sequence in L2ω
(k) (k)
for any fixed t. Based on the fact that Xt → Xt a.s., we obtain Xt → Xt
in L2ω . This gives
t 2
t
E (σ(Xs , s) − σ(Xs , s))dWs =
(k)
E(σ(Xs(k) , s) − σ(Xs , s))2 ds
0 0
t
≤ K2 E|Xs(k) − Xs |2 ds → 0 as k → ∞.
0
(k)
Combining (7.36), (7.37), and the fact that Xt → Xt almost surely,
we conclude that Xt is the solution of the SDE (7.27).
Simple SDEs.
Example 7.15 (Ornstein-Uhlenbeck process). The Ornstein-Uhlenbeck
(OU) process is the solution of the following linear SDE with the additive
noise:
(7.38) dXt = −γXt dt + σdWt , Xt |t=0 = X0 .
Here we assume γ, σ > 0.
If we use this to model the dynamics of a particle, then the drift term
represents a restoring force that prevents the particle from getting too far
away from the origin. We will see that the distribution of the particle po-
sition has a limit as time goes to infinity. This is a very different behavior
from that of the Brownian motion.
Solution. Dividing both sides by Nt , we have dNt /Nt = rdt + αdWt . Ap-
plying Itô’s formula to log Nt , we get
1 1
d(log Nt ) = dNt − (dNt )2
Nt 2Nt2
1 1 2 2
= dNt − α Nt dt
Nt 2Nt2
1 α2
= dNt − dt.
Nt 2
Therefore, we have
α2
d(log Nt ) = r − dt + αdWt .
2
Integrating from 0 to t, we get
8
α2
Nt = N0 exp r− t + αWt .
2
Example 7.17 (Langevin equation). The simplest model for an inertial
particle in a force field subject to friction and noise is given by
dX t = V t dt,
(7.41)
mdV t = − γV t − ∇U (X t ) dt + σdW t ,
where γ is the frictional coefficient, U (X) is the potential for the force field,
W t is the standard√Wiener process, and σ is the strength of the noise force.
We will show σ = 2kB T γ in Chapter 11 to ensure the correct physics.
154 7. Stochastic Differential Equations
Diffusion Process. Diffusion processes are the Markov process with con-
tinuous paths. From what we have just established, Itô processes defined as
the solutions of stochastic differential equations are diffusion processes if the
coefficients satisfy the conditions presented above. It is interesting to note
that the converse is also true, namely that diffusion processes are nothing
but solutions of stochastic differential equations.
We will not prove this last statement. Instead, we will see formally
how a diffusion process can give rise to a stochastic differential equation.
Let p(x, t|y, s) (t ≥ s) be the transition probability density of the diffusion
process, and let
1
(7.42) lim (x − y)p(x, t|y, s)dx = b(y, s) + O( ),
t→s t − s |x−y|<
1
(7.43) lim (x − y)(x − y)T p(x, t|y, s)dx = A(y, s) + O( ).
t→s t − s |x−y|<
Here b(y, s) is called the drift of the diffusion process and A(y, s) is called
the diffusion matrix at time s and position y. The conditions (7.42) and
(7.43) can also be represented as
1
(7.44) lim Ey,s (X t − y) = b(y, s),
t→s t − s
1
(7.45) lim Ey,s (X t − y)(X t − y) = A(y, s).
t−s
t→s
We use the special notation ◦ for the Stratonovich integral. One can show
that for a large class of the integrand, this limit is well-defined. Based on
this, one can also give a treatment of the stochastic differential equation:
(7.46) dXt = b(Xt , t)dt + σ(Xt , t) ◦ dWt
in the Stratonovich sense. However, Itô isometry is lost, and the integral is
no longer a martingale.
7.4. Stratonovich Integral 155
Integrating from tn to tn+1 and taking f (x) = b(x) and σ(x), we have
tn+1
tn+1
Xtn+1 =Xtn + b(Xs )ds + σ(Xs )dWs
tn tn
(7.54) =Xtn + b(Xtn )δtn + σ(Xtn )(Wtn+1 − Wtn )
tn+1
s
(7.55) + dWs (L2 σ)(Xτ )dWτ
tn tn
tn+1
s
tn+1
s
(7.56) + dWs (L1 σ)(Xτ )dτ + ds (L2 b)(Xτ )dWτ
tn tn tn tn
tn+1
s
(7.57) + ds (L1 b)(Xτ )dτ,
tn tn
where δtn = tn+1 − tn is the step size. The above procedure can be iterated.
One just has to replace Li b(Xτ ), Li σ(Xτ ) by Li b(Xt ), Li σ(Xt ). This gives
us the so-called Itô-Taylor expansion for SDEs. It is easy to see that the
terms in the Ito-Taylor expansion have the form
tn+1
s1
sk−1
Ii (g) = i1
dWs1 dWs2 · · ·
i2
dWsikk g(Xsk )
tn tn tn
for some k ∈ {1, 2, . . .}. Here the index i = (i1 , i2 , . . . , ik ) and ij ∈ {0, 1}
for j = 1, 2, . . . , k. The integrand g is the action of some compositions of
operators L1 and L2 on the function b or σ. We use the convention that
Wt0 := t and Wt1 := Wt .
By truncating the Itô-Taylor series at different order, we obtain different
schemes. For example, if we only keep terms in (7.54), we get
(1) Euler-Maruyama scheme.
where δWn ∼ N (0, δtn ). Due to its simplicity, the Euler-Maruyama scheme
is the most commonly used numerical scheme.
√
From the basic intuition dWt ∼ dt, we have (7.55)∼ O(δt), (7.56)∼
O(δt3/2 ), and (7.57)∼ O(δt2 ). By extracting the leading order term (7.55),
we obtain
tn+1
s
tn+1
s
dWs (L2 σ)(Xτ )dWτ ≈ (L2 σ)(Xtn ) dWs dWτ
tn tn tn tn
1
= (L2 σ)(Xtn )[(δWn )2 − δtn ].
2
Substituting this into the Ito-Taylor expansion, we obtain the well-known
Milstein scheme.
158 7. Stochastic Differential Equations
where Wti , Wtj are independent Wiener processes. Terms of this kind are
difficult to sample[KP92].
Another problem for the Milstein scheme is that it can be hard to com-
pute σ
(Xt ) even in the one-dimensional case. To overcome this difficulty,
one can borrow ideas from the Runge-Kutta method for solving ODEs.
(3) Runge-Kutta scheme.
X̂n =Xn + b(Xn )δtn + σ(Xn ) δtn ,
Xn+1 =Xn + b(Xn )δtn + σ(Xn )δWn
1 1
(7.60) + √ [σ(X̂n ) − σ(Xn )][(δWn )2 − δtn ].
2 δtn
If we formally take a higher-order Itô-Taylor expansion, say applying
(7.53) to
(L1 b)(Xτ ), (L2 b)(Xτ ), (L1 σ)(Xτ ), (L2 σ)(Xτ )
and dropping higher-order terms, we get the following higher-order scheme.
(4) Higher-order scheme.
1
Xn+1 =Xn + bδtn + σδWtn + σσ
{(δWn )2 − δtn }
2
1 1
+ σb
δZn + bb
+ σ 2 b
δt2n
2 2
1
+ bσ
+ σ 2 σ
1
(7.61) + σ σσ + (σ
)2 (δWn )2 − δtn δWn ,
2 3
where
tn+1 s
δZn := dWτ ds
tn tn
7.5. Numerical Schemes and Analysis 159
for any f ∈ Cb∞ (Rn ), the space of bounded, infinitely differentiable functions
(with bounded derivatives), where Cf is a constant independent of δt but
may depend on f , then we say {Xtδt } weakly converges to Xt with order β.
Proposition 7.19. β ≥ α.
Strong Convergence.
Theorem 7.20 (Convergence order). Define the length of the multi-index
i = (i1 , i2 , . . . , ik ) as
l(i) := k, n(i) := {the number of zeros in i}
and the set of indices
8 8
1 1 3
Sα = i| l(i) + n(i) ≤ 2α or l(i) = n(i) = α + for α ∈ , 1, , . . . ,
2 2 2
Wβ = {i|l(i) ≤ β} for β ∈ {1, 2, 3, . . .}.
Then with mild smoothness conditions on b, σ, and the function f in weak
approximation, the scheme derived by truncating the Ito-Taylor expansion
up to all indices with i ∈ Sα , have strong order α; the scheme derived
by truncating the Ito-Taylor expansion up to terms with i ∈ Wβ has weak
order β.
160 7. Stochastic Differential Equations
Table 7.1. The convergence order of some numerical schemes for the SDE.
(X̄t )(dX̄t )2 ;
2
i.e.,
t
t
1
f (X̄t ) = f (Xn ) + f
(X̄s )b(Xn ) + f
(X̄s ) ds + f
(X̄s )dWs
tn 2 tn
Since
t
X t − X tn = b(Xs )ds + (Wt − Wtn ),
tn
squaring both sides and taking the expectation, we get
t 2
E|Xt − Xtn | ≤ 2E
2
b(Xs )ds + 2δt.
tn
Hence, we have
t
E|Xt − Xtn |2 ≤ 2Lδt (1 + E|Xs |2 )ds + 2δt
tn
≤ 2Lδt (1 + K1 (T )) + 2δt, t ∈ [tn , tn+1 ).
2
1
N
(7.65) Yh,N = f (Xn(k) ), n = T /h ∈ N,
N
k=1
1
Nl
(k) (k)
Yl = Fl − Fl−1 , l = 0, 1, . . . , L.
Nl
k=1
The key point of multilevel Monte Carlo is that with the decomposition
(7.68), the term Fl − Fl−1 has smaller variance at higher levels provided
that the realizations of Fl − Fl−1 come from two discrete approximations
with different time stepsizes but the same Brownian paths. This property
suggests that we can use fewer Monte Carlo samples at higher levels, i.e.,
164 7. Stochastic Differential Equations
finer grids, but more samples for lower levels, i.e., coarser grids. This cost-
accuracy tradeoff is the origin of the improved efficiency of the multilevel
Monte Carlo method.
Now let us consider the minimization
L
Vl
L
min Var(ŶL ) = subject to the cost K = Nl h−1
l 1.
Nl Nl
l=0 l=0
From the strong and weak convergence result of the Euler-Maruyama scheme,
we have
|E(Fl ) − YE | = O(hl ), E|XT − Xl,M l |2 = O(hl ).
By assuming the Lipschitz continuity of f , we obtain
Var(Fl − f (XT )) ≤ E|f (Xl,M l ) − f (XT )|2 ≤ CE|XT − Xl,M l |2 = O(hl )
and thus
Vl = Var(Fl − Fl−1 ) ≤ 2 Var(Fl − f (XT )) + 2 Var(Fl−1 − f (XT )) = O(hl )
since hl−1 = M hl and M ∼ O(1). We remark that the terms Fl (ω) and
Fl−1 (ω) should be driven by the same Brownian path W (ω); otherwise Fl −
Fl−1 would not be a small quantity. This point should be strictly realized
in the numerical computations.
For a given tolerance ε 1, take
(7.72) Nl = O(ε−2 Lhl );
according to the optimal choice (7.71), we get the variance estimate
(7.73) Var(ŶL ) = O(ε2 )
from (7.70). Further taking L = log ε−1 / log M , we have
hL = M −L = O(ε).
So the bias error
(7.74) |EFL − YE | = O(hL ) = O(ε).
Combining (7.73) and (7.74), we obtain the overall mean square error
MSE = E(YE − ŶL )2 = O(ε2 )
Exercises 165
Exercises
7.1. Prove that the definition given in (7.10) is independent of the choice
of the approximating sequence {φn }.
7.2. Prove Proposition 7.3.
7.3. Prove that with the midpoint approximation
t W2
Ws dWs ≈ Wtj+ 1 (Wtj+1 − Wtj ) → t
0 2 2
j
7.11. Prove that if one takes the rightmost endpoint to define the sto-
chastic integral (backward stochastic integral)
T
f (ω, t) ∗ dWt ≈ f (tj+1 )(Wtj+1 − Wtj ),
0 j
Notes
Since Itô’s groundbreaking work on the stochastic integral, the theory of
SDEs has been developed into a mature subject in stochastic analysis
[Oks98, IW81, KS91, RY05, RW94a, RW94b]. For people interested in
applications, we recommend the references [Gar09, Pav14, VK04].
The numerical solution of SDEs is still a developing field. Among
the topics that we did not cover, we mention strong and weak approxi-
mation of multiple Itô integrals [KP92], implicit methods for overcoming
the stiffness, stability analysis, and explicit methods for stiff SDEs [AC08,
AL08], high-order Runge-Kutta methods [R0̈3], extrapolation methods
[TT90], structure-preserving methods and applications in molecular dy-
namics [LM15], etc. Interested readers can refer to [KP92, MT04] for
more details.
Chapter 8
Fokker-Planck
Equation
There are three different but related ways of studying diffusion processes.
The first is to take a pathwise viewpoint and represent the paths by solutions
of SDEs. This was done in the last chapter. The second approach is to study
the dynamics of the probability distribution and expectation values using
partial differential equations. This is the approach that we will take in this
chapter. The third is to study the probability distribution on path space
induced by the diffusion process. This is the approach of path integrals,
which will be taken up in the next chapter.
Let us first consider the situation of ordinary differential equations.
Imagine a point particle in Rd whose dynamics is described by
dx
(8.1) = b(x), x|t=0 = x0 .
dt
Assuming that the initial distribution density of the particle is p0 (x), at
later times the distribution density of the particle p(x, t) then satisfies the
Liouville equation:
(8.2) ∂t p + ∇x · (bp) = 0.
Alternatively, one can also think of the solutions of the ODE (8.1) as being
the characteristics of this first-order PDE (8.2).
We will see that if (8.1) is replaced by an SDE, then the first-order PDE
above is replaced by a second-order parabolic equation, the so-called Fokker-
Planck equation. Alternatively, one can also think of the solutions of the
SDE as the characteristics of the corresponding Fokker-Planck equation.
169
170 8. Fokker-Planck Equation
where the diffusion matrix A(x, t) = σ(x, t)σ T (x, t) = (aij (x, t)). Taking
the expectation on both sides conditioned on X s = y, we have
t
(8.4) E f (X t ) − f (y) = E
y,s y,s
(Lf )(X τ , τ )dτ,
s
where the operator L is defined by
1
(8.5) (Lf )(x, t) = b(x, t) · ∇f (x) + 2
aij (x, t)∂ij f (x).
2
i,j
Here (·, ·) denotes the standard inner product in L2 (Rd ). An easy calculation
gives
1
(8.7) (L∗ f )(x, t) = −∇x · (b(x, t)f (x)) + ∇2x : (A(x, t)f (x)),
2
where ∇x : (Af ) = ij ∂ij (aij f ).
2
and one can integrate both sides of the forward Kolmogorov equation against
ρ0 .
Example 8.1 (Brownian motion). The SDE reads
dX t = dW t , X 0 = 0.
Thus the Fokker-Planck equation is
1
(8.8) ∂t p = Δp, p(x, 0) = δ(x).
2
It is well known that
1 x2
p(x, t) = √ exp − ,
2πt 2t
which is the probability distribution density of N (0, tI).
Example 8.2 (Brownian dynamics). Consider the over-damped dynamics
of a stochastic particle described by
9
1 2kB T
(8.9) dX t = − ∇U (X t )dt + dW t .
γ γ
The Fokker-Planck equation is
1 k T
B
(8.10) ∂t p − ∇ · ∇U (x)p = Δp = DΔp,
γ γ
where D = kB T /γ is the diffusion coefficient.
172 8. Fokker-Planck Equation
and with similar reasoning the probability transfer from region R2 to R1 has
the form
P2→1 = dx dyp(x, t + δt; y, t).
R1 R2
we obtain
J2→1 = dx dy∂t p(x, t; y, s = t) − dx dy∂t p(x, t; y, s = t)
R1 R2 R2 R1
= dx∇x · j(x, t; R1 , t) − dx∇x · j(x, t; R2 , t)
R2 R1
= dSn · (j(x, t; R1 , t) + j(x, t; R2 , t)),
S12
where j(x, t; R1 , t) := R1 dyj(x, t; y, t) and n is the normal pointing from
R2 to R1 . The last equality is obtained by the divergence theorem and the
fact that j(x, t; R2 , t) = 0 when x ∈ S1 and j(x, t; R1 , t)
= 0 when x ∈ S2 .
From the fact that x ∈ R1 ∪ R2 we have j(x, t) = Rd dyj(x, t; y, t) =
j(x, t; R1 , t) + j(x, t; R2 , t) and thus
J2→1 = dSn · j(x, t).
S12
Recall that in a discrete time Markov chain, the probability flux is de-
fined as
Jijn = μn,i pij − μn,j pji
from state i to state j at time n and for a continuous time Markov chain
we see that n · j(x, t) is exactly the continuous space version of Jij (t) along
a specific direction n.
8.3. The Backward Equation 175
The two most commonly used boundary conditions are as follows. It will
be instructive for the readers to compare them with the boundary conditions
for the Wiener process in Chapter 6.
The reflecting boundary condition. Particles are reflected at the bound-
ary ∂U . Thus there will be no probability flux across ∂U . Therefore we have
(8.20) n · j(x, t) = 0, x ∈ ∂U.
In this case the total probability is conserved since
d
p(x, t)dx = − ∇x · j(x, t)dx
dt U
U
=− n · j(x, t)dS = 0.
∂U
When all the variables are of even type, i.e., s = I, the condition (ii) is
trivial and (i) gives exactly the detailed balance condition (8.26). This
general detailed balance condition is useful when we consider the invariant
distribution of the Langevin process (7.41). We have
(8.33)
x v 0 0
y= , b= γ , A(y) = .
v −m v−m1
∇U 0 2km BT γ
2 I
lim
Tt f − f
∞ = 0 for any f ∈ C0 (Rd ).
t→0+
Here Tt is called the Feller semigroup. With this setup, we can use tools
from semigroup theory to study Tt [Yos95]. We thus define the infinitesimal
generator A of Tt by
Ex f (X t ) − f (x)
Af (x) = lim ,
t→0+ t
where f ∈ D(A) := {f ∈ C0 (Rd ) such that the limit exists}. For f ∈
Cc2 (Rd ) ⊂ D(A) we have
1
Af (x) = Lf (x) = b(x) · ∇f (x) + (σσ T ) : ∇2 f (x)
2
from Itô’s formula (8.4).
This means that the solution of the problem (8.37) in the case when q = 0
can be represented as averages over diffusion processes. The Feynman-Kac
formula handles the situation when q = 0.
To motivate the formula, note that in the absence of the diffusion term,
the solution of the PDE
dX t
= b(X t ), X 0 = x,
dt
t
(8.41) v(x, t) = Ex exp q(X s )ds f (X t ) .
0
8.7. Boundary Value Problems 181
t
The proof goes as follows. Let Yt = f (X t ), Zt = exp( 0 q(X s )ds) and
define v(x, t) = Ex (Yt Zt ). Here v(x, t) is differentiable with respect to t and
1 x
E v(X s , t) − v(x, t)
s
1 x Xs
= E E Zt f (X t ) − Ex Zt f (X t )
s
1 x
t
= E exp q(X r+s )dr f (X t+s ) − Ex Zt f (X t )
s
s
0
1 x
= E exp − q(X r )dr Zt+s f (X t+s ) − Zt f (X t )
s 0
1 x
= E Zt+s f (X t+s ) − Zt f (X t )
s
1 x
s
+ E Zt+s f (X t+s ) exp − q(X r )dr − 1
s 0
→ ∂t v − q(x)v(x, t) as s → 0.
The left-hand side is Av(x, t) by definition.
(x,t)
Stochastic Characteristics
Deterministic Characteristics
0 x
(8.45) τU := inf {t ≥ 0, X t ∈
/ U }.
t
Proof. The proof of the fact that τU is a stopping time and that Ex τU < ∞
is quite technical and can be found in [KS91, Fri75a].
Standard PDE theory tells us that u ∈ C 2 (U )∩C(Ū ). Applying Dynkin’s
formula, we get to u(X t ):
τU
(8.46) E u(X τU ) − u(x) = E
x x
Au(X t )dt.
0
Thus
τU
τU
u(x) = E u(X τU ) −
x
Au(X t )dt = E x
f (X τU ) + g(X t )dt .
0 0
From the boundary conditions for the backward equation, we have R(x, t)|x=b
= 0 and ∂x R(x, t)|x=a = 0, which implies the boundary conditions for τ (x):
(8.49) ∂x τ (x)|x=a = 0, τ (x)|x=b = 0.
The study of the first passage time problem is then turned into a simple
PDE problem.
In analogy with Section 3.10, we define a Hilbert space L2μ with the
invariant distribution μ = Z −1 exp(−U (x)) as the weight function. The
inner product and the norm are defined by
(f, g)μ = f (x)g(x)μ(x)dx and
f
2μ = (f, f )μ .
[0,1]d
The operator L is selfadjoint in L2μ . This can be seen as follows. First note
that
(8.51) L∗ (gμ) = ∇ · (∇U μ)g + ∇U · ∇gμ + Δgμ + Δμg + 2∇g · ∇μ = μLg,
(8.52) (Lf, g)μ = Lf (x)g(x)μ(x)dx = f (x)L∗ (gμ)dx = (f, Lg)μ .
=
∇f
2μ ≥ 0.
Hence we conclude that the eigenvalues of L and L∗ are real and nonpositive.
Denote by φλ the eigenfunctions of L with eigenvalue λ, with the normal-
ization
(φλ (x), φλ (x))μ = δλλ .
We immediately see that the eigenfunctions of L∗ can be chosen as
(8.53) ψλ (x) = μ(x)φλ (x)
because of (8.51). We also have as a consequence that (φλ , ψλ ) = δλλ .
In the case of the discrete Markov chain with detailed balance, there is a
simple way to symmetrize the generator. Let P be the transition probability
matrix, let μ = (μ1 , μ2 , · · · , μn ) be the invariant distribution, and let D =
diag(μ1 , μ2 , . . . , μn ). Then
(8.54) DP = P T D.
Thus we can define the symmetrization of P by
D 2 P D− 2 = D− 2 P T D 2 .
1 1 1 1
(8.55)
Let w = u+ + u− . We have
∂2w 2
2∂ w ∂w α
2 = α − 2β , w|t=0 = f+ + f− , ∂t w|t=0 = ∂x (f+ − f− ).
∂t2 ∂x2 ∂t
Consider the case when f+ = f− = f . In this case the time derivative of w
vanishes at t = 0; hence we avoid the extra complication coming from the
initial layer. Following the standard approach in asymptotic analysis, we
make the ansatz
w = w0 + w1 + 2 w2 + · · · .
To leading order, this gives
∂w0 α2 ∂ 2 w0
(8.61) = , w0 |t=0 = 2f.
∂t 2β ∂x2
This means that to leading√ order, x behaves like Brownian motion with
diffusion constant D = α/ β. This is not surprising since it is what the
central limit theorem tells us for the process
1 t
x̃ (t) = y (s)ds as → 0.
0
We turn now to the general case. Suppose that the backward equation
for the stochastic process {X t } has the form
∂u 1 1
(8.62) = 2 L1 u + L2 u + L3 u , u (0) = f,
∂t
where L1 , L2 , and L3 are differential operators defined on some Banach space
B, whose properties will be specified below. We would like to study the
asymptotic behavior of u when → 0 for 0 ≤ t ≤ T , T < ∞. This discussion
follows the work of Khasminski, Kurtz, Papanicolaou, etc. [Pap77].
As a general framework we assume that the following conditions hold.
(a) L1 is an infinitesimal generator of a stationary Markov process, and
the semigroup exp(L1 t) generated by L1 converges to a projection
operator to the null space of L1 , which we will denote as P:
exp(L1 t) → P, t → ∞.
The reason and meaning for this will become clear later.
(b) Solvability condition: PL2 P = 0.
(c) Consistency condition for the initial value: Pf = f . This allows us
to avoid the initial layer problem.
Note that the following holds:
(8.63) Range(P) = Null(L1 ), Null(P) = Range(L1 ).
But P does not need to be an orthogonal projection since generally P ∗ = P
(see Exercise 8.13 for more details).
8.9. Asymptotic Analysis of SDEs 187
To see how this abstract framework actually works, we use it for the
simple model introduced at the beginning of this section. We have
+α 0 ∂
L1 = A, L2 = , L3 = 0.
0 −α ∂x
Thus the projection operator P is given by
1 1 + e−2βt 1 − e−2βt 1 1
P = lim exp(L1 t) = lim = 2 2
.
t→∞ t→∞ 2 1 − e−2βt 1 + e−2βt 1
2
1
2
We have
1
Lf (y) = b(y)f
(y) + f
(y)
2
1 1 1
(y) + b(y)f
(y) + f
(y) .
2 2 2
1
LN f (y) = b(x)f
(y) + f
(y)
2
and
1 1 1
(y) + b(x)f
(y) + f
(y) ,
2 2 2
where x is the initial value. Now it is obvious that
Since the local truncation error is of second order, we expect the global
error to be of weak order 1. In fact if one can bound the expansions in (8.71)
and (8.72) up to a suitable order, the above formal derivations based on the
local error analysis can be made rigorous.
It will become clear soon that the weak convergence analysis boils down
to some estimates about the solution of the backward equation. Let us first
consider the PDE
1
(8.76) ∂t u(x, t) = Lu(x, t) = b(x)∂x u + ∂x2 u, u(x, 0) = f (x).
2
Let CPm (Rd , R) be the space of functions w ∈ C m (Rd , R) for which all partial
derivatives up to order m have polynomial growth. More precisely, there
exist a constant K > 0 and m, p ∈ N such that
|∂xj w(x)| ≤ K(1 + |x|2p ), ∀ |j| < m
for any x ∈ Rd , where j is a d-multi-index. Here we have d = 1 and we will
simply denote CPm (Rd , R) as CPm .
The following important lemma can be found in [KP92, Theorem 4.8.6,
p. 153].
Lemma 8.9. Suppose that f ∈ CP2β for some β ∈ {2, 3, . . .}, X t is time-
homogeneous, and b ∈ CP2β with uniformly bounded derivatives. Then ∂u/∂t
is continuous and
u(·, t) ∈ CP2β , t ≤ T,
for any fixed T < ∞.
Hence
|en | = |Ef (Xn ) − Ef (Xtn )| = |Ev(Xn , tn ) − Ev(X0 , 0)|
tn
=
E ∂t v(X̄s , s) + b(Xns )∂x v(X̄s , s) + ∂xx v(X̄s , s) − L̃v(X̄s , s) ds
,
0 2
where ns := {m|tm ≤ s < tm+1 } and {X̄s } is the continuous extension of
{Xn } defined by
(8.78) dX̄s = b(Xns )ds + dWs , x ∈ [tm , tm+1 ).
With this definition, we obtain
tn
|en | =
E b(Xns )∂x v(X̄s , s) − b(X̄s )∂x v(s, X̄s ) ds
0
tn
≤
E b(Xns )∂x v(Xns , tns ) − b(X̄s )∂x v(X̄s , s) ds
0
tn
+
E b(X ns ) ∂ x v( X̄ s , s) − ∂ x v(X ns ns ds
, t )
0
tm+1
=
E b(Xm )∂x v(Xm , tm ) − b(X̄s )∂x v(X̄s , s) ds
m tm
tm+1
(8.79) +
E b(Xm ) ∂x v(X̄s , s) − ∂x v(Xm , tm ) ds
.
m tm
Using this with g = b∂x v and g = ∂x v in (8.79), we have the highest deriva-
tives ∂xxx v, ∂xx b ∈ CP2β as long as β ≥ 2. Notice that, conditional on Xm ,
t
b(Xm ) is independent of tm ∂x g(X̄s , s)dWs . Together with the fact that
E|Xm |2r and E|X̄t |2r ≤ C for any r ∈ N (Exercise 8.14), we get
|en | ≤ C Δt2 ≤ CΔt,
m
1
N
uN,Δt = (XM,k )2 .
N
k=1
0.025
0.02
Error
0.015
0.01
0.005
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Stepsize
Exercises
8.1. Prove that X t ∈ Sn−1 if X 0 ∈ Sn−1 for the process defined by
(8.15).
8.2. Prove that the Fokker-Planck equation for the Brownian motion on
S2 is given by (8.16). Prove that for Brownian motion on S1 , the
Fokker-Planck operator reduces to d2 /dθ2 if we use the parameter-
ization: (x, y) = (cos θ, sin θ), θ ∈ [0, 2π).
8.3. Derive the Fokker-Planck equation (8.13) for the Stratonovich SDEs.
8.4. Let X be the solution of the SDE
dXt = a(t)Xt dt + σ(t)dWt .
Find the PDF of Xt .
8.5. Prove that for the multidimensional OU process
dX t = BX t dt + σdW t ,
if the invariant distribution is a nondegenerate Gaussian with den-
sity Z −1 exp(−xT Σ−1 x/2), then the detailed balance condition
j s (x) = 0 implies
BΣ = (BΣ)T .
8.6. Prove that the detailed balance condition (8.30) is equivalent to
(8.32) when the diffusion process has the form (8.31).
8.7. Consider the one-dimensional OU process (7.38).
(a) Find the eigenvalues and eigenfunctions of the Fokker-Planck
operator.
(b) Find the transition probability density.
8.8. Find the transition PDF of the linear Langevin equation
dXt = Vt dt, dVt = −Xt dt − γVt dt + 2γdWt
starting from the initial state (Xt , Vt )|t=0 = (x0 , v0 ).
8.9. Suppose that Xt satisfies the SDE
dXt = b(Xt )dt + σ(Xt )dWt ,
where b(x) = −a0 − a1 x, σ(x) = b0 + b1 x, and ai , bi ≥ 0 (i = 0, 1).
(a) Write down the partial differential equations for the PDF
p(x, t) of Xt when X0 = x0 and u(x, t) = Ex f (Xt ). Spec-
ify their initial conditions.
(b) Assume that X0 is a random variable with PDF p0 (x) that
has finite moments. Derive a system of differential equations
for the moments of Xt (i.e., the equations for Mn (t) = EXtn ).
194 8. Fokker-Planck Equation
8.10. Let Xt satisfy the SDE with drift b(x) = 0 and diffusion coefficient
σ(x) = 1. Assume Xt ∈ [0, 1] with reflecting boundary conditions
and the initial state Xt |t=0 = x0 ∈ [0, 1].
(a) Solve the initial-boundary value problem for the Fokker-Planck
equation to calculate the transition PDF p(x, t|x0 , 0).
(b) Find the stationary distribution of the process Xt .
8.11. Derive the Fokker-Planck equation for the SDEs defined using the
backward stochastic integral (see Exercise 7.11 for definition)
dX t = b(X, t)dt + σ(X t , t) ∗ dW t .
8.12. Prove that the operator L∗ defined in (8.50) is selfadjoint and non-
positive in the space L2μ−1 with inner product
(f, g)μ−1 = f (x)g(x)μ−1 (x)dx.
[0,1]d
Notes
The connection between PDE and SDE is one of the most exciting areas in
stochastic analysis, with connections to potential theory [Doo84, Szn98],
semigroup theory [Fri75a, Fri75b, Tai04], among other topics. Historically,
the study of the distributions of diffusion processes was even earlier, and this
motivated the study of stochastic differential equations [Shi92].
Notes 195
Advanced Topics
Chapter 9
Path Integral
199
200 9. Path Integral
1 wj − wj−1 2
n
In (w) = (tj − tj−1 ), t0 := 0, w0 := 0.
2 tj − tj−1
j=1
We have
0 < μ(B1 ) = μ(B2 ) = · · · = μ(Bn ) = · · · < ∞, 0 < μ(B) < ∞.
9.1. Formal Wiener Measure 201
which is a contradiction!
These facts explain why the path integral is often used as a formal
tool. Nevertheless, the path integral is a very useful approach and in many
important cases, things can be made rigorous.
Example 9.1. Compute the expectation
1
1
E exp − Wt2 dt .
2 0
We have seen in Chapter 6 that the answer is 2e/(1 + e2 ) using the
Karhunen-Loève expansion. Here we will use the path integral approach.
Step 1. Finite-dimensional approximation.
Assume that, as before, {tj } defines a subdivision of [0, 1]:
1 n
1 1 2 1
exp − Wt dt ≈ exp − Wt2j δt = exp − δtX T AX ,
2 0 2 2
j=1
i }, {λi
where {λB A+B
} are eigenvalues of B and A + B, respectively.
202 9. Path Integral
We have formally
1
1
E exp − Wt2 dt
2 0
1 1 1
= exp − (Awt , wt ) · exp − (Bwt , wt ) δ(w0 )Dw
2 Z 2
where the operator Bu(t) := d2 u/dt2 and
1
Z = exp − (Bwt , wt ) δ(w0 )Dw.
2
Now compute formally the infinite-dimensional Gaussian integrals. We ob-
tain
1
1 det B 1
2
E exp − 2
Wt dt = ,
2 0 det(A + B)
where det B, det (A + B) mean the products of all eigenvalues for the fol-
lowing boundary value problems:
Bu = λu, or (A + B)u = λu,
u(0) = 0, u
(1) = 0.
Both determinants are infinite. But their ratio is well-defined. It is easy to
check that by discretizing and taking the limit, one obtains the same result
as before. In fact the eigenvalue problem for the operator B is the same as
the eigenvalue problem considered in the Karhunen-Loève expansion (6.11).
Although it seems to be more complicated than before, this example
illustrates the main steps involved in computing expectation values using
the path integral approach.
9.2. Girsanov Transformation 203
can be viewed as a mapping between the Wiener path {Wt } and {Xt }:
Φ
{Wt } −→ {Xt }.
(1) Is the new measure absolutely continuous with respect to the Wiener
measure?
(2) If it is, what is the Radon-Nykodym derivative?
Φ
{Wt1 , Wt2 , . . . , Wtn } −→
n
{Xt1 , Xt2 , . . . , Xtn }.
With the notation defined earlier, one can write this mapping as
∂(w1 , . . . , wn )
(9.7) = 1.
∂(x1 , . . . , xn )
n
xj − xj−1 2 1 2
n
1
I˜n (x) = δtj−1 + b (xj−1 , tj−1 )δtj−1
2 tj − tj−1 2
j=1 j=1
n
− b(xj−1 , tj−1 ) · (xj − xj−1 ).
j=1
9.2. Girsanov Transformation 205
Note that from (9.9) the stochastic integral is in the Itô sense. Since (9.10)
is valid for arbitrary F , we conclude that the distribution of {Xt }, μX , is
absolutely continuous with respect to the Wiener measure μW and
dμX 1
1
1
= exp − 2
b (Wt , t)dt + b(Wt , t)dWt .
dμW 2 0 0
(9.11) X t = W t + φ(t)
206 9. Path Integral
One can also see from the derivation that if the condition that σ = 1 is
violated, the measure induced by Xt fails to be absolutely continuous with
respect to the Wiener measure.
The following theorems are the rigorous statements of the formal deriva-
tions above.
Theorem 9.2 (Girsanov Theorem I). Consider the Itô process
(9.13) dW̃ t = φ(t, ω)dt + dW t , W̃ 0 = 0,
where W ∈ Rd is a d-dimensional standard Wiener process on (Ω, F , P).
Assume that φ satisfies the Novikov condition
1
T
(9.14) E exp |φ|2 (s, ω)ds < ∞,
2 0
where T ≤ ∞ is a fixed constant. Let
t
1 t 2
(9.15) Zt (ω) = exp − φ(s, ω)dW s − φ (s, ω)ds
0 2 0
and
(9.16) P̃(dω) = ZT (ω)P(dω).
Then W̃ is a d-dimensional Wiener process on [0, T ] with respect to (Ω, FT , P̃).
To see how (9.15) and (9.12) are related to each other, we note that for
any functional F
: ; : ;
F [W̃ t ] = F [W̃ t ]ZT
P̃ P
:
T
1 T 2 ;
= F [W̃ t ] exp − φ(s, ω)dW̃ s + φ (s, ω)ds
0 2 0 P
:
T 1
T dμ ;
= F [W t ] exp − φ(s, ω)dW s + φ2 (s, ω)ds W̃
2 0 dμW P
: ; 0
= F [W t ] .
P
where φ is defined by
φ(t, ω) = (σ(X t , t))−1 · γ(t, ω).
Theorem 9.3 (Girsanov Theorem II). Let X t , Y t , and φ be defined as
above. Assume that b and σ satisfy the same conditions as in Theorem 7.14,
σ is nonsingular, γ is an Ft -adapted process, and the Novikov condition
holds for φ:
1
T
(9.18) E exp |φ|2 (s, ω)ds < ∞.
2 0
Define W̃ t , Zt , and P̃ as in Theorem 9.2. Then W̃ is a standard Wiener
process with respect to (Ω, FT , P̃) and Y satisfies
dY t = b(Y t , t)dt + σ(Y t , t)dW̃ t , Y 0 = x, t ≤ T.
Thus the law of Y t under P̃ is the same as that of X t under P for t ≤ T .
equation
2
(9.19) i∂t ψ = − Δψ + V (x)ψ, ψ|t=0 = ψ0 (x)
2m
can be expressed formally as
i
1
(9.20) ψ(x, t) = δ(w0 − x) exp I[w] ψ0 (wt )Dw,
Z
where I[·] is the action functional defined in classical mechanics
t
m 2
I[w] = ẇ s − V (ws ) ds
0 2
and is the reduced Planck constant. Formally, if we take
m = 1, = −i
in (9.19) and (9.20), we obtain the Feynman-Kac formula. This was what
motivated Kac to propose the Feynman-Kac formula for the parabolic prob-
lem.
Feynman’s formal expression is yet to be made rigorous, even though
Kac’s reinterpretation for the parabolic equation is a rigorous mathematical
tool widely used in mathematical physics.
Exercises
9.1. Derive (9.17) via the path integral approach.
9.2. Derive the infinite-dimensional characteristic function for Wiener
process Wt
:
1 ; 1
1
exp i ϕ(t)dWt = exp − |ϕ|2 dt
0 2 0
using the path integral approach.
Notes
The idea of path integral originated from Feynman’s thesis in which solu-
tions to the Schrödinger equation were expressed formally as averages over
measures on the space of classical paths. Feynman’s formal expressions have
not yet been put on a firm mathematical basis. Kac had the insight that
Feynman’s work can be made rigorous if instead one considers the parabolic
analog of the Schrödinger equation. This can be understood as taking the
time variable to be imaginary in the Schrödinger equation.
One example of the application of the path integral approach is the
asymptotic analysis of Wiener processes (see Chapter 12 and Freidlin’s book
[Fre85]). For application of path integrals to physics, we refer to [Kle06].
Chapter 10
Random Fields
209
210 10. Random Fields
Our study of the random processes focused on two large classes: the
Gaussian processes and the Markov processes, and we devoted a large part
of this book to the study of diffusion processes, a particular class of Markov
processes with continuous paths. We will proceed similarly for random fields:
We will study Gaussian random fields and Markov random fields. However,
there are two important differences. The first is that for Markov processes,
a very powerful tool is the Fokker-Planck equation. This allowed us to use
differential equation methods. This has not been the case so far for Markov
random fields. The second is that constructing nontrivial continuous random
fields, a subject in constructive quantum field theory, is among the hardest
problems in mathematics. For this reason, there has not been much work
on continuous random fields, and for non-Gaussian random fields, we are
forced to limit ourselves to discrete fields.
Figure 10.2. The original PKU (Peking University) tower image and
the degraded image by adding Gaussian random noise.
some powerful tools. For random fields, perhaps the most general tool one
can think of is the functional integral representation, which is a generaliza-
tion of the path integrals.
We will skip the details of defining random fields, except for one ter-
minolgy: We say that a random field is homogeneous if its statistics are
translation invariant. For example, the point process is homogeneous if and
only if the density ρ is a constant.
where
i, j means that the sites i and j are nearest neighbors and λ > 0 is
a penalty parameter. Define the probability measure
1
(10.14) P (x) = exp (−βU (x)) .
Z
−1
Here β is the parameter that plays the role of temperature. As β → 0,
this converges to the uniform distribution on Ω, i.e., equal weights for all
configurations. As β → ∞, then P converges to the point mass centered at
the global minimum of U .
This has been used as a model for image segmentation with u being the
image [MD10].
Exercise
10.1. Find the covariance operator K of the two-dimensional Brownian
sheet.
Notes
For a mathematical introduction to random fields, we refer to the classical
book by Gelfand and Shilov [GS66]. Practical issues are discussed in the
well-known book by Vanmarcke [Van83]. For an interesting perspective on
the application of random fields to pattern recognition, we refer to [MD10].
Chapter 11
Introduction to
Statistical Mechanics
217
218 11. Introduction to Statistical Mechanics
from the premise stated above. The fact that the observed macrostates are
quite deterministic even though the microstates are very much random is a
consequence of the law of large numbers and large deviation theory.
In equilibrium statistical mechanics, we assume that the macrostate is
stationary. Our aim is to understand the connection between the micro-
and macrostates. Among other things, this will provide a foundation for
thermodynamics. For simplicity, we assume that the total volume V is
fixed. According to how the system exchanges particles and energy with its
environment, one can consider three kinds of situations:
(1) Isolated system: The system under consideration does not exchange
mass or energy with its environment. In this case, the particle
number N and total energy E are both fixed. We call this an
(N, V, E) ensemble, or microcanonical ensemble.
(2) Closed system: The system does not exchange mass with the en-
vironment but it does exchange energy with the environment. We
further assume that the system is at thermal equilibrium with the
environment; i.e., its temperature T is the same as the environment.
We call this an (N, V, T ) ensemble, or canonical ensemble.
(3) Open system: The system exchanges both mass and energy. We
further assume that it is at both thermal and chemical equilibrium
with the environment. The latter is equivalent to saying that the
chemical potential μ is the same as that of the environment. Phys-
ically one can imagine particles in a box but the boundary of the
box is both permeable and perfectly conducting for heat. We call
this a (μ, V, T ) ensemble, or grand canonical ensemble.
One can also consider the situation in which the volume is allowed to change.
For example, one can imagine that the position of the boundary of the box,
such as the piston inside a cylinder, is maintained at a constant pressure. In
this case, the system under consideration is at mechanical equilibrium with
the environment. This is saying that the pressure P is the same as that
of the environment. One can then speak about, for example, the (N, P, T )
ensemble, or isothermal-isobaric ensemble.
In nonequilibrium statistical mechanics, the systems are time-dependent
and generally driven by frictional force and random fluctuations. Since the
frictional and random forces are both induced from the same microscopic
origin, that is, the random impacts of surrounding particles, they must be
related to each other. A cornerstone in nonequilibrium statistical mechanics
is the fluctuation-dissipation theorem. It connects the response functions,
11.1. Thermodynamic Heuristics 219
which are related to the frictional component of the system, with the corre-
sponding correlation functions, which are related to the fluctuation compo-
nent. When the applied force is weak, the linear response theory is adequate
for describing the relaxation of the system. From the reductionist viewpoint,
one can derive the nonequilibrium dynamical model for a system interacting
with a heat bath using the Mori-Zwanzig formalism. We will demonstrate
this reduction for the simplest case, the Kac-Zwanzig model, which gives
rise to a generalized Langevin equation for a distinguished particle coupled
with many small particles via harmonic springs.
We should remark that the theory of nonequilibrium statistical mechan-
ics is still an evolving field. For example, establishing thermodynamic laws
and using them to understand experiments for biological systems is an active
research field in biophysics. It is even less known for quantum open systems.
where the summation is with respect to all possible states x and p(x) is the
probability of finding the system at state x. For an isolated system, we have,
as a consequence of the basic premise stated earlier,
1/g(E), H(x) = E,
(11.3) p(x) =
0, otherwise.
In this case, the Gibbs entropy reduces to the Boltzmann entropy (11.1).
220 11. Introduction to Statistical Mechanics
Let E1∗ be the value of E1 such that the summand attains its largest
value. Then
d
(11.5) (g1 (E1 )g2 (E − E1 )) = 0 at E1 = E1∗ .
dE1
Let σ1 (E1 ) = log g1 (E1 ), σ2 (E2 ) = log g2 (E2 ), E2∗ = E − E1∗ . Then the
corresponding entropy
S1 (E1 ) = kB σ1 (E1 ), S2 (E2 ) = kB σ2 (E2 )
by (11.1). Then the condition (11.5) becomes
∂S1 ∗ ∂S2 ∗
(11.6) (E ) = (E ).
∂E1 1 ∂E2 2
Define the temperatures
−1
∂S1 ∗ −1 ∂S2 ∗
(11.7) T1 = (E ) , T2 = (E ) .
∂E1 1 ∂E2 2
Then (11.6) says that
(11.8) T1 = T2 .
This means that if two systems are at thermal equilibrium, then their tem-
peratures must be the same.
We postulate that the main contribution to the summation in the defi-
nition of g(E) above comes from the maximum term of the summand at E1∗ .
As we see later, this is supported by large deviation theory.
Assume that the initial energy of the two systems when they are brought
into contact are E10 and E20 , respectively. Then
(11.9) g1 (E10 )g2 (E20 ) ≤ g1 (E1∗ )g2 (E2∗ );
i.e., the number of states of the combined system can only increase. We will
see later that the number of states of a system is related to its entropy. So
this says that entropy can only increase. This gives an interpretation of the
second law of thermodynamics from the microscopic point of view.
11.1. Thermodynamic Heuristics 221
The Mechanical Equilibrium States. In the same way, one can show
that if the two systems can only exchange volume, i.e., their temperature
and particle numbers remain fixed, then at equilibrium we must have
∂F1 ∂F2
(11.29) = .
∂V1 N1 ,T ∂V2 N2 ,T
This is the mechanical equilibrium. This defines the pressure:
∂F
(11.30) P =− .
∂V N,T
224 11. Introduction to Statistical Mechanics
This is the translational part of the temperature. If the particle has a shape,
say it shapes like a rod, and it can undergo other kinds of motion such as
rotation, then one can speak about the corresponding temperature such as
rotational temperature. The definition would be similar.
To get an expression for the pressure, assuming that the force between
the ith and jth particle is f ij , then the pressure of the system is given by
[E11]:
1
(11.35) P = (xi − xj ) · f ij .
V
i=j
where |L| is the cardinality of the set L and yj is the jth element in L. Given
β > 0, define the discrete probability
e−βyj /2
2
Then the canonical distribution on the N -particle state space ΩN has the
form
(11.51) PN,eq (dω) = λ(dx1 ) · · · λ(dxN )ρβ (dv1 ) · · · ρβ (dvN ).
For this problem, one can simply integrate out the degrees of freedom for
the position and consider the reduced canonical ensemble for the velocities
ω = (v1 , . . . , vN ). The probability distribution is then defined on ΩN =
{ω = (v1 , . . . , vN )}:
1
(11.52) PN,eq (dω) = e−βUN (ω) ρ(dv1 ) · · · ρ(dvN )
ZN (β)
N
j=1 |vj |
where UN (ω) = 2 /2 and
(11.53) ZN (β) = e−βUN (ω) ρ(dv1 ) · · · ρ(dvN )
ΩN
where J0 > 0 and the first part of H N indicates that only nearest neighbor
interactions are taken into account.
For such thermodynamic systems, we naturally define the Gibbs distri-
bution
1
(11.56) PN,eq (dx) = exp −H N (x; β, h) PN (dx)
ZN (β, h)
in the canonical ensemble, where PN is the product measure on B(S N ) with
identical one-dimensional marginals equal to 12 δ1 + 12 δ−1 and the partition
function
(11.57) ZN (β, h) = exp −H N (x; β, h) PN (dx).
SN
Macroscopically, suppose that we are interested in a particular observable,
the average magnetization
1
N
N
Y = O (X) = Xi , X ∈ SN ,
N
i=1
where PYN (dy) is simply the Lebesgue measure in the current setup. The
free energy per site FN (y; β, h) for Y has the form
(11.59)
FN (y; β, h)
6
1 1
= − log exp −H N (x; β, h) PN (dx) .
N ZN (β, h) {x∈S N ,ON (x)=y}
Evaluating the Partition Function via Transfer Matrices. First we will calcu-
late the log-asymptotics of the partition function ZN (β, h) using a powerful
technique known as transfer matrices. We write
2 −1
3
N
N
ZN (β, h) = 2−N exp βJ0 xi xj + βh xi
x∈S N i=1 i=1
(11.60)
1 % &
= ... a (x1 ) B (x1 , x2 ) . . . B (xN −1 , xN ) a (xN ) .
2
x1 =±1 x2 =±1 xN =±1
Recall from (11.59) that the free energy function F (y; β, h) can be thought of
as #
a rate function
$ for a large deviation principle for the sequence of measures
of PN,eq as N → ∞. Hence, an application of Laplace’s asymptotics gives
Y
J0 1 h 0 J0 1 h 0
F F
0.20
0.15
0.15
0.10
0.10
0.05
0.05
y y
1.0 0.5 0.5 1.0 1.0 0.5 0.5 1. 0
J0 1 h 0
F
0.20
0.15
0.10
0.05
y
1.0 0.5 0.5 1.0
J0 1 h 0 J0 1 h 0
F F
0.10
0.15
0.08
0.10 0.06
0.04
0.05
0.02
y y
1.0 0.5 0. 5 1.0 1. 0 0. 5 0. 5 1. 0
J0 1 h 0
F
0.15
0.10
0.05
y
1. 0 0. 5 0. 5 1. 0
h 0
Y
0. 8
0. 6
0. 4
0. 2
β
0.8 1.0 1.2 1.4
0
Y
1. 0
0. 5
h
0. 5 0. 5
0. 5
1. 0
Ising model. Indeed, the nearest neighbor Ising model only exhibits phase
transitions in more than one dimension [Gol92, Ell85].
In this chapter we will adopt the notation commonly used in the physics
literature for SDEs.
Consequently, we have
∞
(11.81) Vt = μ(t − s)ξs ds,
−∞
Ĉ(ω, ω
) = |μ̂(ω)|2 · 2Reγ̂+ (ω) · 2πδ(ω + ω
) = Ĝ(ω) · 2πδ(ω + ω
).
This is exactly the Fourier transform of G(|t − s|). This proves (11.78).
236 11. Introduction to Statistical Mechanics
Denote
t
(11.84)
δA(t) = χAB (t − s)F (s)ds,
0
where we used the weighted inner product (·, ·)eq with respect to the Gibbs
distribution peq (x) and the selfadjoint property (8.52) of L0 with respect to
(·, ·)eq .
On the other hand, the cross correlation satisfies
A(t)B(0)eq = dxdyA(x)B(y)peq (y) exp(tL∗0,x )δ(x − y)
= exp(tL0 )A(x), B(x) eq
(11.87) = A(x), exp(tL0 )B(x) eq ,
The relation (11.88) has another useful form in the frequency domain.
Define the so-called complex admittance μ̂(ω), the Fourier transform of the
response function
∞
(11.89) μ̂(ω) = χAB (t)e−iωt dt.
0
Due to the stationarity of the equilibrium process, we formally have
d
CAB (t) =
Ȧ(t)B(0)eq = −
A(t)Ḃ(0)eq .
dt
Thus we obtain the general form of the Green-Kubo relation
∞
(11.90) μ̂(ω) = β
A(t)Ḃ(0)eq e−iωt dt.
0
If we take B(x) = x and A(x) = u for the Langevin dynamics (11.72) in
one dimension, we obtain
∞
(11.91) μ̂(0) = β
u(t)u(0)eq dt.
0
Under the ergodicity and stationarity condition
lim
u(t)u(t + τ ) =
u(0)u(τ )eq ,
t→∞
we have
x2 (t) 1 t t
D = lim = lim dt1 dt2
u(t1 )u(t2 )
t→∞ 2t t→∞ 2t
∞ 0 0
we obtain
t
dϕ
(11.98) = etL PLϕ0 + e(t−s)L K(s)ds + R(t),
dt 0
where
(11.99) R(t) = etQL QLϕ0 , K(t) = PLR(t).
Equation (11.98) is a generalized Langevin equation; see (11.76). The first
term on the right-hand side of (11.98) is a Markovian term. The second
term is a memory-dependent term. The third term depends on the full
knowledge of the initial condition x0 . One can regard this term as a noise
term by treating the initial data for unresolved variable as being noise. The
240 11. Introduction to Statistical Mechanics
fact that the memory depends on the noise corresponds to the fluctuation-
dissipation relation, as will be exemplified in next section. To eliminate the
noise, one can simply apply P to both sides and obtain
t
d
Pϕ = Pe PLϕ0 +
tL
Pe(t−s)L K(s)ds
dt 0
1 2
N
1 k
N
(11.100) H(x, v) = v + U (x) + 2
mi vi + (xi − x)2 ,
2 2 2N
i=1 i=1
k N
N
(11.101) Ẍ N = −U
(X N )− (X −XiN ), X N (0) = x0 , Ẋ N (0) = v0 ,
N
i=1
k
N
(11.105) γN (t) = cos ωi t
N
i=1
and
k N
vi
(11.106) ξN (t) = β (xi − x0 ) cos ωi t + sin ωi t .
N ωi
i=1
The initial condition for the bath variables is random, and its distribution
can be obtained as the marginal distribution of the Gibbs distribution asso-
ciated with H(x, v). This gives the density
1
(11.107) ρ(x1 , . . . , xN , v1 , . . . , vN ) = exp(−β Ĥ(x, v)),
Z
where Z is a normalization constant and Ĥ is the reduced Hamiltonian
N
k vi2
Ĥ(x, v) = + (x i − x 0 ) 2
N
i=1
ωi2
k
N
(11.109) EξN (t)ξN (s) = cos ωi (t − s) = γN (t − s).
N
i=1
The noise term ξN (t) then converges to a Gaussian process ξ(t) with Eξ(t) =
0 and covariance function
k
N
Eξ(t)ξ(s) = lim cos ωi (t − s) = k cos ω(t − s)ρ(ω)dω
N →∞ N R
i=1
(11.111) = γ(t − s).
Exercises
11.1. By putting the system in contact with a heat reservoir, prove that
the Helmholtz free energy F achieves minimum at equilibrium using
the fact that the entropy S = kB log g(E) achieves maximum at
equilibrium in the canonical ensemble. Generalize this argument
to other ensembles.
11.2. Derive the Gibbs distribution p(x) by the maximum entropy prin-
ciple in the canonical ensemble: Maximize the Gibbs entropy
S = −kB p(x) log p(x)
x
11.4. Derive the zero mass limit of the Langevin equation in Exercise
11.3, i.e., the limit as m = ε → 0 when γ is a constant.
11.5. Prove the positivity of the energy spectrum of a stationary stochas-
tic process Xt ; i.e.,
ĈX (ω) ≥ 0
where CX(|t − s|) =
Xt Xs is the autocorrelation function and
∞
ĈX (ω) = −∞ CX (|t|)e−iωtdt.
11.6. Suppose that Xt is a stationary process and A(t) = A(Xt ), B(t) =
B(Xt ) are two observables. Prove that the cross-correlation func-
tions
A(t)B(s) satisfies the relation
Ȧ(t)B(0) = −
A(t)Ḃ(0) if
A(t), B(t) are differentiable.
11.7. Prove that under the ergodicity and stationarity condition
lim
u(t)u(t + τ ) =
u(0)u(τ )eq ,
t→∞
we have
t
t
∞
1
lim dt1 dt2
u(t1 )u(t2 ) =
u(t)u(0)eq dt.
t→∞ 2t 0 0 0
Notes
There are many good textbooks and monographs on the introduction of the
foundation of statistical mechanics. On the physics side we have chosen to
rely mainly on [Cha87, KK80, KTH95, Rei98]. On the mathematics side,
our main reference is [Ell85], a classic text on the mathematical foundation
of equilibrium statistical mechanics. It has been widely recognized that
the mathematical language behind statistical mechanics is ergodic theory
and large deviation theory. We discussed these topics from the viewpoint
of thermodynamic limit and phase transition. The interested readers are
referred to [Gal99, Sin89, Tou09] for further material.
The linear response theory is a general approach for studying different
kinds of dynamics ranging from the Hamiltonian dynamics, Langevin dy-
namics, Brownian dynamics to quantum dynamics. It provides the basis
for deriving response functions and sometimes equilibrium correlation func-
tions. We only considered Brownian dynamics as a simple illustration. See
[KTH95, MPRV08] for more details.
The fluctuation-dissipation theorem is a cornerstone in nonequilibrium
statistical mechanics. In recent years, interest on small systems in biology
and chemistry has given rise to further impetus to pursue this topic and
this resulted in the birth of a new field called stochastic thermodynamics.
Representative results include Jarzynski’s equality and Crooks’s fluctuation
theorem. See [JQQ04, Sei12, Sek10] for the latest developments.
Model reduction is an important topic in applied mathematics. The
Mori-Zwanzig formalism gives a general framework for eliminating uninter-
esting variables from a given system. However, approximations have to be
made in practical applications. How to effectively perform such approx-
imation is still an unresolved problem in many real applications. Some
interesting attempts can be found in [DSK09, HEnVEDB10, LBL16].
Chapter 12
Rare Events
Many phenomena in nature are associated with the occurence of some rare
events. A typical example is chemical reaction. The vibration of chemical
bonds occurs on the time scale of pico- to femto-seconds (i.e., 10−15 to 10−12
seconds). But a typical reaction may take seconds or longer to accomplish.
This means that it takes on the order of 1012 to 1015 or more trials in order
to succeed in one reaction. Such an event is obviously a rare event. Other
examples of rare events include conformational changes of biomolecules, the
nucleation process, etc.
Rare events are caused by large deviations or large fluctuations. Chemi-
cal reactions succeed when the chemical bonds of the participating molecules
manage to get into some particular configurations so that a sufficient amount
of kinetic energy is converted to potential energy in order to bring the sys-
tem over some potential energy barrier. These particular configurations are
usually quite far from typical configurations for the molecules and are only
visited very rarely. Although these events are driven by noise and conse-
quently are stochastic in nature, some of their most important features, such
as the transition mechanism, are often deterministic and can be predicted
using a combination of theory and numerical algorithms.
The analysis of such rare events seems to have proceeded in separate
routes in the mathematics and science literature. Mathematicians have over
the years built a rigorous large deviation theory, starting from the work of
Cramérs and Schilder and ultimately leading to the work of Donsker and
Varadhan and of Freidlin and Wentzell [FW98, Var84]. This theory gives
a rigorous way of computing the leading order behavior of the probability
of large deviations. It has now found applications in many branches of
mathematics and statistics.
245
246 12. Rare Events
ε=0.03
U(x)=(x2−1)2/4 1.5
0.6 1
0.5 0.5
0.4
0
0.3
0.2 −0.5
Xs
0.1 B− B+
−1
0
X− X− −1.5
0 0.5 1 1.5 2 2.5
−0.1
−1.5 −1 −0.5 0 0.5 1 1.5 x 10 5
In physical terms, the local minima or the basin of attractions are called
metastable states. Obviously, when discussing metastability, the key issue
is that of the time scale. In rare event studies, one is typically concerned
about the following three key questions:
(1) What is the most probable transition path? In high dimension, we
should also ask the following question: How should we compute it?
(2) Where is the transition state, i.e., the neighboring saddle point,
for a transition event starting from a metastable state? Presum-
ably saddle points can be identified from the eigenvalue analysis of
248 12. Rare Events
The readers are referred to [Gol80, Arn89] for more details about these
connections.
(12.11) Aτ (x) = −U
(x)τ
(x) + ετ
(x) = −1
for x ∈ (a, b) and τ |x=b = 0, τ
|x=a = 0.
The solution to this problem is given simply by
1 b U (y) y − U (z)
(12.12) τ (x) = e ε e ε dzdy.
x a
250 12. Rare Events
(x− ) and |U
(xs )|U
(x− )
for any x ≤ xs −δ0 , where δU = U (xs )−U (x− ) and δ0 is a positive constant.
Equation (12.14) tells us that the transition time is exponentially large
in O(δU/ε). This is typical for rare events. The transition time is insensitive
to where the particle starts from. Even if the particle starts close to xs , most
likely it will first relax to the neighborhood of x− and then transit to x+ .
The choice of the first passage point after xs does not affect much the final
result.
In the case considered above, it is natural to define the transition rate
1 |U
(xs )|U
(x− ) δU
(12.15) k= = exp − .
τ (x− ) 2π ε
This is the celebrated Kramers reaction rate formula, also referred to as the
Arrhenius’s law of reaction rates. Here δU is called the activation energy.
In the multidimensional case when xs is an index-one saddle point, one can
also derive the reaction rate formula using similar ideas:
9
|λs | det H− δU
(12.16) k= exp − .
2π | det Hs | ε
Here λs < 0 is the unique negative eigenvalue of the Hessian at xs , Hs =
∇2 U (xs ), H− = ∇2 U (x− ). Interested readers are referred to [Sch80] for
more details.
The proof of Theorem 12.1 is beyond the scope of this book. We will
give a formal derivation using path integrals.
It is obvious that the one-dimensional SDE
√
dXtε = εdWt , X0 = 0,
√
has the solution Xtε = εWt . Using the path integral representation, we
have that the probability distribution induced by {Xtε } on C[0, T ] can be
formally written as
(12.17)
T
1 1
dP ε [ϕ] = Z −1 exp − |ϕ̇(s)|2 ds Dϕ = Z −1 exp − IT [ϕ] Dϕ.
2ε 0 ε
This gives the statement of the theorem for the simple case in (12.4). Now
let us consider the general case:
√
dXtε = b(Xtε )dt + εσ(Xtε )dWt , X0 = y.
To understand how the action functional IT has the form (12.8)–(12.9), we
√
can reason formally as follows. From the SDE we have Ẇt = ( ε)−1 σ −1 (Xtε )·
(Ẋtε − b(Xtε )). Hence
T
T
Ẇt2 dt = ε−1 |σ −1 (Xtε )(Ẋtε − b(Xtε ))|2 dt.
0 0
√
Using the formal expression for the distribution (12.17) induced by εWt ,
we obtain
1
dP ε [ϕ] = Z −1 exp − IT [ϕ] Dϕ.
ε
From Theorem 12.1, we have
−ε log P ε (F ) ∼ inf IT [ϕ], ε → 0,
ϕ∈F
paths F in C[0, T ] we can define the optimal path in F as the path ϕ that
has the maximum probability or minimal action
IT [ϕ ] = inf IT [ϕ]
ϕ∈F
action of the noise, and ϕ2 simply follows the steepest descent dynamics
and therefore does not require help from the noise. Both satisfy
(12.23) ϕ̇(s) = ±∇U (ϕ(s)).
Paths that satisfy this equation are called the minimum energy path (MEP).
One can write (12.23) equivalently as
⊥
(12.24) ∇U (ϕ) = 0,
where (∇U (ϕ))⊥ denotes the component of ∇U (ϕ) normal to the curve
described by ϕ.
An important feature of the minimum energy paths is that they pass
through saddle points, more precisely, index one saddle points, i.e., saddle
points at which the Hessian of U has one negative eigenvalue. One can imag-
ine the following coarse-grained picture: The configuration space is mapped
into a network where the basins of the local minima are the nodes of the
network, neighboring basins are connected by edges, each edge is associated
with an index one saddle point. At an exponentially long time scale, the dy-
namics of the diffusion process is effectively described by a Markov chain on
this network, with hopping rates computed by a formula similar to (12.15).
For a concrete application of this reduction in micromagnetics, the readers
are referred to [ERVE03].
To implement (12.26), one can proceed in the following two simple steps:
Step 1. Evolution of the steepest descent dynamics. For example, one
can simply apply the forward Euler scheme
ϕ̃n+1
i = ϕni − Δt∇U (ϕni )
or Runge-Kutta type schemes for one or several steps.
Step 2. Reparameterization of the string. One can redistribute the
points {ϕ̃n+1
i } according to equi-arclength (or other parame-
terization). The positions of the new points along the string can
be obtained by simple interpolation:
a) Define s0 = 0, si = si−1 + |ϕ̃ni − ϕ̃ni−1 | for i = 1, . . . , N , and
α̃i = si /sN .
b) Interpolate ϕn+1 i at si = i/N using the data {α̃i , ϕ̃n+1 i }.
n+1
Here ϕ̃i is the value of ϕn+1 at α̃i .
Here we have assumed that the arclength is normalized so that
the total normalized arclength is 1.
A simple illustration of the application of the string method is shown in
Figure 12.2.
M = sup |b(x)|
x∈D
where φ(x, t|y) is defined in (12.7). Since φ(x, t|y) satisfies (12.5) and is a
monotonically decreasing function of t and is bounded from below if y is a
stable steady state of the dynamics ẋ = b(x), we obtain a Hamilton-Jacobi
equation for V (x; y):
(12.30) H(x, ∇x V ) = 0,
256 12. Rare Events
which yields
d
∂L
0= L(ϕ, θ−1 ϕ̇)θ
θ=1 = L(ϕ, ϕ̇) − ϕ̇ = −H(ϕ, p).
dθ ∂ ϕ̇
That is why the minimum action path starting from any critical point of
the dynamics ẋ = b(x) is also called the zero energy path.
For gradient systems, the potential energy barrier quantifies the difficulty
of transitions. Quasipotential plays a similar role for nongradient systems.
We show this by revisiting the exit problem. Let y = 0 be a stable fixed
point of the dynamical system ẋ = b(x) and let D be a smooth part of the
domain of attraction of 0 which also contains 0 in its interior. Consider
the diffusion process X ε , (12.3), initiated at X ε0 = 0. It is obvious that
V (0; 0) = 0 by definition. We are interested in the first exit time of X ε
from D, Tε and the most probable point of exit on ∂D, XTεε , as ε → 0.
12.6. Quasipotential and Energy Landscape 257
Theorem 12.3. Assume that (b(x), n̂(x)) < 0 for x ∈ ∂D, where n̂ is the
outward normal of ∂D. Then
lim ε log ETε = min V (x; 0)
ε→0 x∈∂D
and if there exists a unique minimizer x0 of V at ∂D, i.e.,
V (x0 ; 0) = min V (x; 0),
x∈∂D
then for any δ > 0,
P |X εTε − x0 | < δ → 1 as ε → 0.
where the first equality is the minimization with respect to the path
(ϕ(t), p(t)) in R2d that connects y and x such that H(ϕ, p) = 0, and the
second equality is with respect to the path ϕ(t) in Rd that connects y and
x such that p = ∂L/∂ ϕ̇ and H(ϕ, p) = 0 [Arn89].
In this formulation, let us choose a parametrization such that α ∈ [0, 1]
and ϕ ∈ (C[0, 1])d . From the constraints, we obtain
1 ∂L
(12.34) H(ϕ, p) = b(ϕ) · p + |p|2 = 0, p = = λϕ
(α) − b(ϕ),
2 ∂ ϕ̇
where λ ≥ 0 is the result of the parameter change from t to α. Solving
(12.34) we get
|b(ϕ)| |b(ϕ)|
(12.35) λ= , p= ϕ − b.
|ϕ
| |ϕ
|
Thus
(12.36) V (x; y) = inf ˆ
I[ϕ],
ϕ(0)=y,ϕ(1)=x
where
1
(12.37) ˆ =
I[ϕ] |b(ϕ)||ϕ
| − b(ϕ) · ϕ
dα.
0
The formulation (12.36) and (12.37) is preferable in numerical computations
since it removes the additional optimization with respect to time and can
be realized in an intrinsic way in the space of curves.
This result can also be obtained from a direct calculation.
258 12. Rare Events
where ψ(α) := ϕ(τ (α)) and τ : [0, 1] → [0, t] is any differentiable and
monotonically increasing parameterization. The equality holds if and only
if |b(ϕ)| = |ϕ̇|. This proves
inf inf It [ϕ] ≥ inf ˆ
I[ψ].
t>0 ϕ(0)=y,ϕ(t)=x ψ(0)=y,ψ(1)=x
Consider again the exit problem from the basin of attraction of the
metastable state y = 0. We suppose that the quasipotential is U (x) =
V (x; y), which satisfies the steady Hamilton-Jacobi equation. Define
(12.40) l(x) = b(x) + ∇U (x).
We have ∇U (x) · l(x) = 0 from the Hamilton-Jacobi equation (12.30). As-
sume U (0) = 0 and thus U ≥ 0 elsewhere. Using similar strategies to the
proof of Lemma 12.2, we get
T
1 T
IT [ϕ] = |ϕ̇ − ∇U (ϕ) − l(ϕ)| dt + 2
2
ϕ̇ · ∇U (ϕ)ds
2 0 0
≥ 2(U (x) − U (0)),
where the equality holds only when
ϕ̇ = ∇U (ϕ) + l(ϕ).
Exercises 259
This derivation shows that the quasipotential with respect to 0 has the form
V (x; 0) = 2(U (x) − U (0))
and the most probable path from 0 to x is not of a pure gradient type such
as ϕ̇ = ∇U (ϕ) but involves the nongradient term l(ϕ). The decomposi-
tion (12.40) suggests that the quasipotential plays the role of the Lyapunov
function for the deterministic ODEs
dU (x(t)) dx
= ∇U (x) · = −|∇U (x)|2 ≤ 0.
dt dt
Quasipotential measures the action required to hop from one point to
another. It depends on the choice of the starting point y, which is usually
taken as a stable fixed point of b. In this sense, V (x; y) should be thought
of as local quasipotentials of the system if it has more than one stable fixed
points. On the other hand, we can introduce another potential by defining
(12.41) U (x) = − lim ε log pε∞ (x),
ε→0
This definition of U does not depend on the starting point y, which is called
the global quasipotential of the dynamics (12.3). Note that the two limits
cannot be exchanged:
U (x) = − lim lim ε log pε (x, t|y) = − lim lim ε log pε (x, t|y) = V (x; y).
ε→0 t→∞ t→∞ ε→0
However, it can be shown that the local functions V (x; y) can be pruned
and stuck together to form the global quasipotential, U (x). The readers
may refer to [FW98, ZL16] for more details.
Exercises
12.1. Assume that xs is an extremal point of the potential function U
and that the Hessian Hs = ∇2 U (xs ) has n distinct eigenvalues
{λi }i=1,...,n and orthonormal eigenvectors {v i }i=1,...,n such that
Hs v i = λi v i . Prove that the (xs , v i ) are steady states of the
following dynamics:
ẋ = (I − 2v̂ ⊗ v̂) · (−∇U (x)),
v̇ = (I − v̂ ⊗ v̂) · (−∇2 U (x), v),
by imposing the normalization condition
v(0)
= 1 for the
initial value v(0). Prove that (xs , v i ) is linearly stable if and
260 12. Rare Events
Notes
The study of rare events is a classical but still a fast developing field. The
classical transition state theory has been a cornerstone in chemical physics
[HPM90]. The latest developments include the transition path theory
[EVE10], Markov state models [BPN14], and advanced sampling meth-
ods in molecular dynamics and Monte Carlo methods.
From a mathematical viewpoint, the large deviation theory is the start-
ing point for analyzing rare transition events. In this regard, the Freidlin-
Wentzell theory provides the theoretical basis for understanding optimal
transition paths [FW98, SW95, Var84]. In particular, the notion of quasi-
potential is useful for extending the concept of energy landscape, very im-
portant for gradient systems, to nongradient systems [Cam12, LLLL15].
From an algorithmic viewpoint, the string method is the most popular
path-based algorithm for computing the optimal transition paths. Another
closely related approach is the nudged elastic band (NEB) method [HJ00],
which is the most popular chain-of-state algorithm. The convergence anal-
ysis of the string method can be found in [CKVE11]. Besides GAD, other
ideas for finding saddle points include the dimer method [HJ99], shrinking
dimer method [ZD12], climbing string method [RVE13], etc.
Chapter 13
Introduction to
Chemical Reaction
Kinetics
In this chapter, we will study a class of jump processes that are often used
to model the dynamics of chemical reactions, particularly biochemical reac-
tions. We will consider only the homogeneous case; i.e., we will assume that
the reactants are well mixed in space and will neglect the spatial aspects of
the problem.
Chemical reactions can be modeled at several levels. At the most de-
tailed level, we are interested in the mechanism of the reactions and the
reaction rates. This is the main theme for the previous chapter, except that
to model the system accurately enough, one often has to resort to quantum
mechanics. At the crudest level, we are interested in the concentration of the
reactants, the products, and the intermediate states. In engineering applica-
tions such as combustion where one often has an abundance of participating
species, the dynamics of the concentrations is usually modeled by the re-
action rate equations, which is a system of ODEs for the concentrations
of different species. The rate equations model the result of averaging over
many reaction events. However, in situations when the discrete or stochastic
effects are important, e.g., when there are only a limited number of some
of the crucial species, then one has to model each individual molecule and
reaction event. This is often the case in cell biology where there are usually
a limited number of relevant proteins or RNAs. In this case, the preferred
model is by a Markov jump process. As we will see, in the large volume
limit, one recovers the reaction rate equations from the jump process.
261
262 13. Introduction to Chemical Reaction Kinetics
where c1 , c2 , c3 , c4 are reaction constants and the symbol ∅ means some type
of species which we are not interested in. For example, for the reaction
2S1 → S2 , two S1 -molecules are consumed and one S2 molecule is generated
after each firing of the forward reaction; two S1 -molecules are generated and
one S2 molecule is consumed after each firing of the reverse reaction. This
is an example of reversible reaction. The other reactions in this system are
irreversible.
Denote by xj the concentration of the species Sj . The law of mass
action states that the rate of reaction is proportional to the reaction constant
and the concentration of the participating molecules [KS09]. This gives us
the following reaction rate equations for the above decaying-dimerization
reaction:
dx1
(13.1) = −c1 x1 − 2c2 x21 + 2c3 x2 ,
dt
dx2
(13.2) = c2 x21 − c3 x2 − c4 x2 ,
dt
dx3
(13.3) = c 4 x2 .
dt
The term −2c2 x21 + 2c3 x2 in (13.1) and the term c2 x21 − c3 x2 in (13.2) are
the consequence of the reversible reaction discussed above. Notice that we
have, as a consequence of the reversibility for this particular reaction,
dx2 1 dx1
+ = 0.
dt 2 dt
Note that the constants kj in (13.6) are generally different from cj in the
reaction rate equations. We will come back to this in Section 13.5.
264 13. Introduction to Chemical Reaction Kinetics
N
N N
x xm
|ν −
j | = νj−,m , ν−
j ! = νj−,m !, = .
ν−
j νj−,m
m=1 m=1 m=1
M
P (x, t + dt|x0 , t0 ) = P (x − ν j , t|x0 , t0 )aj (x − ν j )dt
j=1
M
+ 1− aj (x)dt P (x, t|x0 , t0 ),
j=1
M
t
(13.9) Xt = X0 + ν j Pj aj (X s )ds ,
j=1 0
M
t+dt
t
X t+dt = X t + ν j Pj aj (X s )ds − Pj aj (X s )ds .
j=1 0 0
From independence and Markovianity of Pj (·), we know that the firing prob-
ability of the jth reaction channel in [t, t + dt) is exactly aj (X t )dt since X
continues to be X t before jumping when dt is an infinitesimal time. This ex-
plains why (13.9) makes sense as the SDEs describing the chemical reaction
kinetics. The readers are referred to [EK86] for rigorous statements.
Another less direct but more general SDE formulation is to use Poisson
random measure
M
A
(13.10) dX t = νj cj (a; X t− )λ(dt × da), X t |t=0 = X 0 .
j=1 0
Here A := maxX t {a0 (X t )}, which is assumed to be finite for a closed system,
and λ(dt × da) is called the reference Poisson random measure associated
with a Poisson point process (qt , t ≥ 0) taking values in (0, A]. That is,
t
for any Borel set B ⊂ (0, A] and 0 ≤ s < t. It is assumed that λ(dt × da)
satisfies Eλ(dt × da) = dt × da. The firing rates for M reaction channels are
defined through the acceptance-rejection functions
1, if a ∈ (hj−1 (X t ), hj (X t )],
(13.12) cj (a; X t ) = j = 1, 2, . . . , M,
0, otherwise,
where hj (x) = ji=1 aj (x) for j = 0, 1, . . . , M . One can see from the above
constructions that X t satisfies the same law for the chemical reaction kinet-
ics. For more on the Poisson random measures and SDEs for jump processes,
see [App04, Li07].
266 13. Introduction to Chemical Reaction Kinetics
where
κj ν − N
1 νj−,m
(13.14) aVj (x) = ãj (x) + O(V −1 ), ãj (x) = x j = κ
j x m ,
ν−
j ! ν −,m !
m=1 j
and aVj (x), ãj (x) ∼ O(1) if κj , x ∼ O(1). For simplicity, we will only
consider the case when aVj (x) = ãj (x). The general case can be analyzed
similarly.
Consider the rescaled process X Vt = X t /V , i.e., the concentration vari-
able, with the original jump intensity aj (X t ). We have
t
(13.15) X Vt = X V0 + b(X Vs )ds + M Vt
0
where
M
M
t
−1
b(x) := ν j ãj (x), M Vt := νjV P̃j V V V
aj (X s )ds ,
j=1 j=1 0
and P̃j (t) := Pj (t) − t is the centered Poisson process with mean 0. We will
assume that b(x) satisfies the Lipschitz condition: |b(x) − b(y)| ≤ L|x − y|.
Define the deterministic process x(t):
t
(13.16) x(t) = x0 + b(x(s))ds.
0
We have
t
|X Vt − x(t)| ≤ |X V0 − x0 | + L|X Vs − x(s)|ds + |M Vt |.
0
Using Gronwall’s inequality, we obtain
sup |X Vs − x(s)| ≤ |X V0 − x0 | + sup |M Vs | eLt
s∈[0,t] s∈[0,t]
for any t > 0. From the law of large numbers for Poisson processes (see
Exercise 13.1), we immediately get sups∈[0,t] |X Vs − x(s)| → 0 almost surely
as V → ∞ and the initial condition X V0 → x0 . This is the celebrated large
volume limit for chemical reaction kinetics, also known as the Kurtz limit
in the literature [Kur72, EK86].
By comparing (13.16) with the reaction rate equations in Section 13.1,
we also obtain the relation
κj kj −
(13.17) cj = − = − V |ν j |−1
νj ! νj !
between kj , κj , and cj . This connects the definition of rate constants at
different scales.
268 13. Introduction to Chemical Reaction Kinetics
The limit from (13.15) to (13.16) can also be derived using asymptotic
analysis of the backward operator introduced in Section 8.9. To this end,
we first note that the backward operator for the original process X t has the
form
M
Af (x) = aj (x) f (x + ν j ) − f (x) .
j=1
Under the same assumption as (13.13) and with the ansatz u(x, t) = u0 (x, t)
+ V −1 u1 (x, t) + o(V −1 ), we get, to leading order,
M
∂t u0 (x, t) = A0 u0 = ãj (x)ν j · ∇u0 (x, t).
j=1
This is the backward equation for the reaction rate equations (13.16).
M
t
M
t
=Y + V
0 νj ∇ãj (x(s)) · Y ds +
V
s ν j Wj ãj (x(s))ds + h.o.t.
j=1 0 j=1 0
13.7. The Tau-leaping Algorithm 269
This leads to a central limit theorem for the chemical jump process Y Vt ⇒
Y (t) and Y satisfies the following linear diffusion process:
M M 5
dY
(13.19) = ν j ∇ãj (x(t)) · Y (t) + νj ãj (x(t))Ẇj (t), Y 0 = 0,
dt
j=1 j=1
where we have used the definition Z V = Z/V . This is the chemical Langevin
equation in the chemistry literature [Gil00]. It can be shown that as V →
1
∞, Z Vt → x(t) and V 2 (Z Vt − x(t)) → Y t in fixed time interval [0, T ]. In
this sense, Z V approximates X V at the same order as x(t) + V − 2 Y t .
1
M
(13.25) X t+τ = X t + ν j Pj (aj (X t )τ )
j=1
where the Pj (aj (X t )τ ) are independent Poisson random variables with pa-
rameters λj = aj (X t )τ , respectively. The idea behind the tau-leaping
algorithm is that if the reaction rate aj (X t ) is almost a constant during
[t, t + τ ), then the number of fired reactions in the jth reaction channel will
be Pj (aj (X t )τ ).
To make the tau-leaping idea a practical scheme, one needs a robust
stepsize selection strategy. For this purpose, note that the state after tau-
leaping is
M
X→X+ ν j aj (X)τ := X + τ ξ.
j=1
It can be shown that the invariant distribution is unique with the multino-
mial distribution
n!
p(x) = cx1 cx2 · · · cxNN ,
x1 !x2 ! · · · xN ! 1 2
where xi is the number of the molecules for species Si and n is the total
number of molecules for all species; i.e.,
c·Q=0
N
with constraint i=1 ci = 1. Here Q ∈ RN ×N is the generator
⎛ ⎞
−k1+ k1+ ··· 0 0
⎜ k1− −(k1− + k2+ ) · · · 0 0 ⎟
⎜ ⎟
⎜ .. . . . .. ⎟
Q=⎜ . .. .. .. . ⎟.
⎜ ⎟
⎝ 0 0 · · · −(k(N −2)− + k(N −1)+ ) k(N −1)+ ⎠
0 0 ··· k(N −1)− −k(N −1)−
k1 = k2 = 105 , k3 = k4 = 1, k5 = k6 = 105 .
13.9. Multiscale Analysis of a Chemical Kinetic System 273
Let us partition the reactions into two sets: the slow ones and the fast ones.
The fast reactions Rf = {(af , ν f )} are
M
(13.36) ∂t u(x, t) = aj (x)(u(x + ν j , t) − u(x, t)) = Lu(x, t),
j=1
for all ν fj . The set of such vectors form a linear subspace in RN . Let
{b1 , b2 , . . . , bJ } be a set of basis vectors of this subspace, and let
(13.41) zj = bj · x for j = 1, . . . , J.
Let z = (z1 , . . . , zJ ) and denote by Z the space over which z is defined.
The stoichiometric vectors associated with these slow variables are naturally
defined as
(13.42) ν̄ si = (b1 · ν si , . . . , bJ · ν si ), i = 1, . . . , Ns .
We now derive the effective dynamics on the slow time scale using singu-
lar perturbation theory. We assume, for the moment, that z is a complete
set of slow variables; i.e., for each fixed value of the slow variable z, the
virtual fast process admits a unique equilibrium distribution μz (x) on the
state space x ∈ X .
The backward Kolmogorov equation for the multiscale chemical kinetic
system reads
1
(13.43) ∂t u = L0 u + L1 u,
ε
13.9. Multiscale Analysis of a Chemical Kinetic System 275
where L0 and L1 are the infinitesimal generators associated with the slow
and fast reactions, respectively. For any f : X → R,
Ms
(L0 f )(x) = asj (x)(f (x + ν sj ) − f (x)),
j=1
(13.44) Mf
(L1 f )(x) = afj (x)(f (x + ν fj ) − f (x)),
j=1
where
(13.53) āsi (z) = (Pasi )(z) = asi (x)μz (x).
x∈X
Equation (13.53) gives the effective rates on the slow time scale. The ef-
fective reaction kinetics on this time scale are given in terms of the slow
variables by
Ms
(13.55) ∂t u0 (x, t) = ãsi (x)(u0 (x + ν si , t) − u0 (x, t)),
i=1
where ãsi (x) = āsi (b · x), in the sense that if u0 (x, t = 0) = f (b · x), then
(13.56) u0 (x, t) = Ū (b · x, t)
where Ū (z, t) solves (13.52) with the initial condition Ū (z, 0) = f (z). Note
that in (13.55), all of the quantities can be formulated over the original state
space using the original variables. This fact is useful for developing numerical
algorithms for this kind of problems [CGP05, ELVE05b, ELVE07], which
will be introduced in next section.
From a numerical viewpoint, parallel to the discussion in the last sec-
tion, one would like to have algorithms that automatically capture the slow
dynamics of the system, without the need to precompute the effective dy-
namics or make ad hoc approximations. We will discuss an algorithm pro-
posed in [ELVE05b] which is a simple modification of the original SSA, by
adding a nested structure according to the time scale of the rates. For sim-
ple systems with only two time scales, the nested SSA consists of two SSAs
organized with one nested inside the other: an outer SSA for the slow reac-
tions only, but with modified slow rates which are computed in an inner SSA
that models the fast reactions only. Averaging is performed automatically
“on-the-fly”.
where X kτis the result of the kth replica of the virtual fast process
at virtual time τ whose initial value is X kt=0 = X n and T0 is a
parameter we choose in order to minimize the effect of the transients
to the equilibrium in the virtual fast process.
- Outer SSA: Run one step of the SSA for the modified slow reactions
R̃s = (ãs , ν s ) to generate (tn+1 , X n+1 ) from (tn , X n ).
- Repeat the above nested procedures until the final time T .
Note that the algorithm is formulated in the original state space. Obvi-
ously it can be extended to the situation with more time scales. Error and
cost analysis can also be performed. See [ELVE05b, ELVE07].
Exercises
13.1. Prove the strong law of large numbers for the unirate Poisson
process P(t) using Doob’s martingale inequality; i.e., for any
fixed T > 0
lim sup |n−1 P̃(ns)| = 0, a.s.,
n→∞ s≤T
Notes
Chemical reaction kinetics modeled by jump processes has become a fun-
damental language in systems biology. See [ARM98, ELSS02, GHP13,
LX11], etc. In this chapter, we only considered cases with algebraic propen-
sity functions. In more general situations such as the enzymatic reactions,
the propensity functions may take the form of rational functions. The most
common example is the Michaelis-Menten kinetics. Readers may consult
[KS09] for more details.
The tau-leaping algorithm was proposed in 2001 [Gil01]. Since then,
much work has been done to improve its performance and robustness. These
efforts included robust stepsize selection [CGP06], avoiding negative popu-
lations [CVK05, MTV16, TB04], multilevel Monte Carlo methods
[AH12], etc.
278 13. Introduction to Chemical Reaction Kinetics
(0) = 0. We
further assume that for any c > 0, there exists b > 0 such that
h(x) − h(0) ≤ −b if |x| ≥ c.
Assume also that h(x) → −∞ fast enough as x → ∞ to ensure the conver-
gence of F for t = 1.
Lemma A.1 (Laplace method). We have the asymptotics
√
F (t) ∼ 2π(−th
(0))− 2 exp(th(0))
1
(A.1) as t → ∞,
where the equivalence f (t) ∼ g(t) means that limt→∞ f (t)/g(t) = 1.
This formulation is what we will use in the large deviation theory. Its
abstract form in the infinite-dimensional setting is embodied in the so-called
Varadhan’s lemma to be discussed later [DZ98, DS84, Var84].
(0)x2 /2, h
(0) < 0, we
279
280 Appendix
have
√
2π(−th
(0))− 2 .
1
eth(x) dx =
R
In general, for any > 0, there exists δ > 0 such that if |x| ≤ δ,
h
(0) 2
h(x) − x
≤ x2 .
2
It follows that
tx2
(0) + 2 ) dx.
[−δ,δ] 2
By assumption, for this δ > 0, there exists η > 0 such that h(x) ≤ −η if
|x| ≥ δ. Thus
−(t−1)η
exp th(x) dx ≤ e eh(x) dx ∼ O(e−αt ), α > 0, for t > 1.
|x|≥δ R
(0) + 2 ) dx
R R 2
tx2
− exp (h
(0) + 2 ) dx + O(e−αt )
|x|≥δ 2
√ − 1
+ O(e−βt )
2
= 2π t(−h (0) − 2 )
(0)/2 here.
The proof of the lower bound is similar. By the arbitrary smallness of
, we have
√
lim F (t)/ 2π(−th
(0))− 2 = 1,
1
t→∞
which completes the proof.
Definition A.2 (Large deviation principle). Let X be a complete separable
metric space and let {Pε }ε≥0 be a family of probability measures on the Borel
subsets of X . We say that Pε satisfies the large deviation principle if there
exists a rate functional I : X → [0, ∞] such that:
(i) For any < ∞,
{x : I(x) ≤ } is compact.
(ii) Upper bound. For each closed set F ⊂ X ,
lim ε ln Pε (F ) ≤ − inf I(x).
ε→0 x∈F
B. Gronwall’s Inequality 281
Theorem A.3 (Varadhan’s lemma). Suppose that Pε satisfies the large de-
viation principle with rate functional I(·) and F ∈ Cb (X ). Then
1
(A.2) lim ε ln exp F (x) Pε (dx) = sup (F (x) − I(x)).
ε→0 X ε x∈X
B. Gronwall’s Inequality
Theorem B.1 (Gronwall’s inequality). Assume that the function f : [0, ∞)
→ R+ satisfies the inequality
t
f (t) ≤ a(t) + b(s)f (s)ds,
0
where a(t), b(t) ≥ 0. Then we have
t
t
(B.1) f (t) ≤ a(t) + a(s)b(s) exp b(u)du ds.
0 s
t
Proof. Let g(t) = 0 b(s)f (s)ds. We have
g
(t) ≤ a(t)b(t) + b(t)g(t).
t
Define h(t) = g(t) exp(− 0 b(s)ds). We obtain
t
where we assume the arithmetic rules (2.7) on the extended reals R̄+ .
When the set function μ takes values in R̄ = [−∞, ∞] and only the
countable additivity condition is assumed, μ is called a signed measure.
The readers may refer to [Bil79, Cin11, Hal50] for more details.
D. Martingales
Consider the probability space (Ω, F , P) and a filtration {Fn }n∈N or {Ft }t≥0 .
Definition D.1 (Martingale). A continuous time stochastic process Xt is
called an Ft -martingale if Xt ∈ L1ω for any t, Xt is Ft -adapted, and
(D.1) E(Xt |Fs ) = Xs for all s ≤ t.
Xt is called a submartingale or supermartigale if the above equality is re-
placed by ≥ or ≤. These concepts can be defined for discrete time stochastic
processes if we replace (D.1) by E(Xn |Fm ) = Xm for any m ≤ n. The dis-
crete submartingale or supermartingale can be defined similarly.
E. Strong Markov Property 285
F. Semigroup of Operators
Let B be a Banach space equipped with the norm
·
.
Definition F.1 (Operator semigroup). A family of bounded linear opera-
tors {S(t)}t≥0 : B → B forms a strongly continuous semigroup if for any
f ∈ B:
(i) S(0)f = f ; i.e., S(0) = I.
(ii) S(t)S(s)f = S(s)S(t)f = S(t + s)f for any s, t ≥ 0.
(iii)
S(t)f − f
→ 0 for any f ∈ B as t → 0+.
We will call {S(·)} a contraction semigroup if
S(t)
≤ 1 for any t, where
S(t)
is the operator norm induced by the metric
·
.
The closed graph theorem ensures that (λI − A)−1 is a bounded linear
operator if λ ∈ ρ(A).
Theorem F.6 (Hille-Yosida theorem). Let A be a closed, densely defined
linear operator on B. Then A generates a contraction semigroup {S(t)}t≥0
if and only if
1
λ ∈ ρ(A) and
Rλ
≤ for all λ > 0.
λ
Interested readers are referred to [Eva10, Paz83, Yos95] for more de-
tails on semigroup theory.
Bibliography
[AC08] A. Abdulle and S. Cirilli, S-ROCK: Chebyshev methods for stiff stochastic dif-
ferential equations, SIAM J. Sci. Comput. 30 (2008), no. 2, 997–1014, DOI
10.1137/070679375. MR2385896
[ACK10] D. F. Anderson, G. Craciun, and T. G. Kurtz, Product-form stationary distribu-
tions for deficiency zero chemical reaction networks, Bull. Math. Biol. 72 (2010),
no. 8, 1947–1970, DOI 10.1007/s11538-010-9517-4. MR2734052
[AGK11] D. F. Anderson, A. Ganguly, and T. G. Kurtz, Error analysis of tau-leap simula-
tion methods, Ann. Appl. Probab. 21 (2011), no. 6, 2226–2262, DOI 10.1214/10-
AAP756. MR2895415
[AH12] D. F. Anderson and D. J. Higham, Multilevel Monte Carlo for continuous time
Markov chains, with applications in biochemical kinetics, Multiscale Model. Simul.
10 (2012), no. 1, 146–179, DOI 10.1137/110840546. MR2902602
[AL08] A. Abdulle and T. Li, S-ROCK methods for stiff Itô SDEs, Commun. Math. Sci.
6 (2008), no. 4, 845–868. MR2511696
[And86] H. L. Anderson, Metropolis, Monte Carlo and the MANIAC, Los Alamos Science
14 (1986), 96–108.
[App04] D. Applebaum, Lévy processes and stochastic calculus, Cambridge Studies in
Advanced Mathematics, vol. 93, Cambridge University Press, Cambridge, 2004.
MR2072890
[ARLS11] M. Assaf, E. Roberts, and Z. Luthey-Schulten, Determining the stability of ge-
netic switches: Explicitly accounting for mrna noise, Phys. Rev. Lett. 106 (2011),
248102.
[ARM98] A. Arkin, J. Ross, and H. H. McAdams, Stochastic kinetic analysis of developmen-
tal pathway bifurcation in phage lambda-infected Escherichia coli cells, Genetics
149 (1998), 1633–1648.
[Arn89] V. I. Arnold, Mathematical methods of classical mechanics, 2nd ed., translated
from the Russian by K. Vogtmann and A. Weinstein, Graduate Texts in Mathe-
matics, vol. 60, Springer-Verlag, New York, 1989. MR997295
[BB01] P. Baldi and S. Brunak, Bioinformatics, 2nd ed., The machine learning approach;
A Bradford Book, Adaptive Computation and Machine Learning, MIT Press,
Cambridge, MA, 2001. MR1849633
289
290 Bibliography
[KS76] J. G. Kemeny and J. L. Snell, Finite Markov chains, reprinting of the 1960 origi-
nal, Undergraduate Texts in Mathematics, Springer-Verlag, New York-Heidelberg,
1976. MR0410929
[KS91] I. Karatzas and S. E. Shreve, Brownian motion and stochastic calculus, 2nd
ed., Graduate Texts in Mathematics, vol. 113, Springer-Verlag, New York, 1991.
MR1121940
[KS07] L. B. Koralov and Y. G. Sinai, Theory of probability and random processes, 2nd
ed., Universitext, Springer, Berlin, 2007. MR2343262
[KS09] J. Keener and J. Sneyd, Mathematical physiology, 2nd ed., Springer-Verlag, New
York, 2009.
[KTH95] R. Kubo, M. Toda, and N. Hashitsume, Statistical physics, vol. 2, Springer-Verlag,
Berlin, Heidelberg and New York, 1995.
[Kub66] R. Kubo, The fluctuation-dissipation theorem, Rep. Prog. Phys. 29 (1966), 255.
[Kur72] T. G. Kurtz, The relation between stochastic and deterministic models for chem-
ical reactions, J. Chem. Phys. 57 (1972), 2976–2978.
[KW08] M. H. Kalos and P. A. Whitlock, Monte Carlo methods, 2nd ed., Wiley-Blackwell,
Weinheim, 2008. MR2503174
[KZW06] S. C. Kou, Q. Zhou, and W. H. Wong, Equi-energy sampler with applica-
tions in statistical inference and statistical mechanics, with discussions and
a rejoinder by the authors, Ann. Statist. 34 (2006), no. 4, 1581–1652, DOI
10.1214/009053606000000515. MR2283711
[Lam77] J. Lamperti, Stochastic processes: A survey of the mathematical theory, Ap-
plied Mathematical Sciences, Vol. 23, Springer-Verlag, New York-Heidelberg, 1977.
MR0461600
[Lax02] P. D. Lax, Functional analysis, Pure and Applied Mathematics (New York),
Wiley-Interscience [John Wiley & Sons], New York, 2002. MR1892228
[LB05] D. P. Landau and K. Binder, A guide to Monte Carlo simulations in statistical
physics, 2nd ed., Cambridge University Press, Cambridge, 2005.
[LBL16] H. Lei, N. A. Baker, and X. Li, Data-driven parameterization of the generalized
Langevin equation, Proc. Natl. Acad. Sci. USA 113 (2016), no. 50, 14183–14188,
DOI 10.1073/pnas.1609587113. MR3600515
[Li07] T. Li, Analysis of explicit tau-leaping schemes for simulating chemically re-
acting systems, Multiscale Model. Simul. 6 (2007), no. 2, 417–436, DOI
10.1137/06066792X. MR2338489
[Liu04] J. S. Liu, Monte Carlo strategies in scientific computing, Springer-Verlag, New
York, 2004.
[LL17] T. Li and F. Lin, Large deviations for two-scale chemical kinetic processes, Com-
mun. Math. Sci. 15 (2017), no. 1, 123–163, DOI 10.4310/CMS.2017.v15.n1.a6.
MR3605551
[LLLL14] C. Lv, X. Li, F. Li, and T. Li, Constructing the energy landscape for genetic
switching system driven by intrinsic noise, PLoS One 9 (2014), e88167.
[LLLL15] C. Lv, X. Li, F. Li, and T. Li, Energy landscape reveals that the budding yeast cell
cycle is a robust and adaptive multi-stage process, PLoS Comp. Biol. 11 (2015),
e1004156.
[LM15] B. Leimkuhler and C. Matthews, Molecular dynamics: With deterministic and
stochastic numerical methods, Interdisciplinary Applied Mathematics, vol. 39,
Springer, Cham, 2015. MR3362507
[Loè77] M. Loève, Probability theory. I, 4th ed., Graduate Texts in Mathematics, Vol. 45,
Springer-Verlag, New York-Heidelberg, 1977. MR0651017
296 Bibliography
[Lov96] L. Lovász, Random walks on graphs: a survey, Combinatorics, Paul Erdős is eighty,
Vol. 2 (Keszthely, 1993), Bolyai Soc. Math. Stud., vol. 2, János Bolyai Math. Soc.,
Budapest, 1996, pp. 353–397. MR1395866
[LX11] G. Li and X. S. Xie, Central dogma at the single-molecule level in living cells,
Nature 475 (2011), 308–315.
[Mar68] G. Marsaglia, Random numbers fall mainly in the planes, Proc. Nat. Acad. Sci.
U.S.A. 61 (1968), 25–28, DOI 10.1073/pnas.61.1.25. MR0235695
[MD10] D. Mumford and A. Desolneux, Pattern theory: The stochastic analysis of real-
world signals, Applying Mathematics, A K Peters, Ltd., Natick, MA, 2010.
MR2723182
[Mey66] P.-A. Meyer, Probability and potentials, Blaisdell Publishing Co. Ginn and Co.,
Waltham, Mass.-Toronto, Ont.-London, 1966. MR0205288
[MN98] M. Matsumoto and T. Nishimura, Mersenne twister: A 623-dimensionally equidis-
tributed uniform pseudo-random number generator, ACM Trans. Mod. Comp.
Simul. 8 (1998), 3–30.
[MP92] E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme,
Europhys. Lett. 19 (1992), 451–458.
[MP10] P. Mörters and Y. Peres, Brownian motion, with an appendix by Oded Schramm
and Wendelin Werner, Cambridge Series in Statistical and Probabilistic Mathe-
matics, vol. 30, Cambridge University Press, Cambridge, 2010. MR2604525
[MPRV08] U. M. B. Marconi, A. Puglisi, L. Rondoni, and A. Vulpiani, Fluctuation-
dissipation: Response theory in statistical physics, Phys. Rep. 461 (2008), 111–
195.
[MS99] C. D. Manning and H. Schütze, Foundations of statistical natural language pro-
cessing, MIT Press, Cambridge, MA, 1999. MR1722790
[MS01] M. Meila and J. Shi, A random walks view of spectral segmentation, Proceedings
of the Eighth International Workshop on Artificial Intelligence and Statistics (San
Francisco), 2001, pp. 92–97.
[MT93] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability, Commu-
nications and Control Engineering Series, Springer-Verlag London, Ltd., London,
1993. MR1287609
[MT04] G. N. Milstein and M. V. Tretyakov, Stochastic numerics for mathematical
physics, Scientific Computation, Springer-Verlag, Berlin, 2004. MR2069903
[MTV16] A. Moraes, R. Tempone, and P. Vilanova, Multilevel hybrid Chernoff tau-leap,
BIT 56 (2016), no. 1, 189–239, DOI 10.1007/s10543-015-0556-y. MR3486459
[Mul56] M. E. Muller, Some continuous Monte Carlo methods for the Dirichlet prob-
lem, Ann. Math. Statist. 27 (1956), 569–589, DOI 10.1214/aoms/1177728169.
MR0088786
[MY71] A. S. Monin and A. M. Yaglom, Statistical fluid mechanics: Mechanics of turbu-
lence, vol. 1, MIT Press, Cambridge, 1971.
[Nor97] J. R. Norris, Markov chains, Cambridge University Press, Cambridge, 1997.
[Oks98] B. Øksendal, Stochastic differential equations: An introduction with applications,
5th ed., Universitext, Springer-Verlag, Berlin, 1998. MR1619188
[Pap77] G. C. Papanicolaou, Introduction to the asymptotic analysis of stochastic equa-
tions, Modern modeling of continuum phenomena (Ninth Summer Sem. Appl.
Math., Rensselaer Polytech. Inst., Troy, N.Y., 1975), Lectures in Appl. Math.,
Vol. 16, Amer. Math. Soc., Providence, R.I., 1977, pp. 109–147. MR0458590
[Pav14] G. A. Pavliotis, Stochastic processes and applications: Diffusion processes, the
Fokker-Planck and Langevin equations, Texts in Applied Mathematics, vol. 60,
Springer, New York, 2014. MR3288096
Bibliography 297
[Shi92] A. N. Shiryaev (ed.), Selected works of A.N. Kolmogorov, vol. II, Kluwer Academic
Publishers, Dordrecht, 1992.
[Shi96] A. N. Shiryaev, Probability, 2nd ed., translated from the first (1980) Russian
edition by R. P. Boas, Graduate Texts in Mathematics, vol. 95, Springer-Verlag,
New York, 1996. MR1368405
[Sin89] Y. Sinai (ed.), Dynamical systems II: Ergodic theory with applications to dynam-
ical systems and statistical mechanics, Encyclopaedia of Mathematical Sciences,
vol. 2, Springer-Verlag, Berlin and Heidelberg, 1989.
[SM00] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pat-
tern Anal. Mach. Intel. 22 (2000), 888–905.
[Soa94] P. M. Soardi, Potential theory on infinite networks, Lecture Notes in Mathematics,
vol. 1590, Springer-Verlag, Berlin, 1994. MR1324344
[Sus78] H. J. Sussmann, On the gap between deterministic and stochastic ordinary differ-
ential equations, Ann. Probability 6 (1978), no. 1, 19–41. MR0461664
[SW95] A. Shwartz and A. Weiss, Large deviations for performance analysis, Queues,
communications, and computing, with an appendix by Robert J. Vanderbei, Sto-
chastic Modeling Series, Chapman & Hall, London, 1995. MR1335456
[Szn98] A.-S. Sznitman, Brownian motion, obstacles and random media, Springer Mono-
graphs in Mathematics, Springer-Verlag, Berlin, 1998. MR1717054
[Tai04] K. Taira, Semigroups, boundary value problems and Markov processes, Springer
Monographs in Mathematics, Springer-Verlag, Berlin, 2004. MR2019537
[Tan96] M. A. Tanner, Tools for statistical inference: Methods for the exploration of pos-
terior distributions and likelihood functions, 3rd ed., Springer Series in Statistics,
Springer-Verlag, New York, 1996. MR1396311
[TB04] T. Tian and K. Burrage, Binomial leap methods for simulating stochastic chemical
kinetics, J. Chem. Phys. 121 (2004), 10356–10364.
[TKS95] M. Toda, R. Kubo, and N. Saitô, Statistical physics, vol. 1, Springer-Verlag, Berlin,
Heidelberg and New York, 1995.
[Tou09] H. Touchette, The large deviation approach to statistical mechanics, Phys. Rep.
478 (2009), no. 1-3, 1–69, DOI 10.1016/j.physrep.2009.05.002. MR2560411
[TT90] D. Talay and L. Tubaro, Expansion of the global error for numerical schemes
solving stochastic differential equations, Stochastic Anal. Appl. 8 (1990), no. 4,
483–509 (1991), DOI 10.1080/07362999008809220. MR1091544
[Van83] E. Vanmarcke, Random fields: Analysis and synthesis, MIT Press, Cambridge,
MA, 1983. MR761904
[Var84] S. R. S. Varadhan, Large deviations and applications, CBMS-NSF Regional Con-
ference Series in Applied Mathematics, vol. 46, Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1984. MR758258
[VK04] N. G. Van Kampen, Stochastic processes in physics and chemistry, 2nd ed., Else-
vier, Amsterdam, 2004.
[Wal02] D. F. Walnut, An introduction to wavelet analysis, Applied and Numerical Har-
monic Analysis, Birkhäuser Boston, Inc., Boston, MA, 2002. MR1854350
[Wel03] L. R. Welch, Hidden Markov models and the Baum-Welch algorithm, IEEE Infor.
Theory Soc. Newslett. 53 (2003), 10–13.
[Wie23] N. Wiener, Differential space, J. Math. and Phys. 2 (1923), 131–174.
[Win03] G. Winkler, Image analysis, random fields and Markov chain Monte Carlo meth-
ods, 2nd ed., A mathematical introduction, with 1 CD-ROM (Windows), Stochas-
tic Modelling and Applied Probability, Applications of Mathematics (New York),
vol. 27, Springer-Verlag, Berlin, 2003. MR1950762
Bibliography 299
[WZ65] E. Wong and M. Zakai, On the convergence of ordinary integrals to stochastic inte-
grals, Ann. Math. Statist. 36 (1965), 1560–1564, DOI 10.1214/aoms/1177699916.
MR0195142
[XK02] D. Xiu and G. E. Karniadakis, The Wiener-Askey polynomial chaos for stochas-
tic differential equations, SIAM J. Sci. Comput. 24 (2002), no. 2, 619–644, DOI
10.1137/S1064827501387826. MR1951058
[YLAVE16] T. Yu, J. Lu, C. F. Abrams, and E. Vanden-Eijnden, Multiscale implementation
of infinite-swap replica exhange molecular dynamics, Proc. Nat. Acad. Sci. USA
113 (2016), 11744–11749.
[Yos95] K. Yosida, Functional analysis, reprint of the sixth (1980) edition, Classics in
Mathematics, Springer-Verlag, Berlin, 1995. MR1336382
[ZD12] J. Zhang and Q. Du, Shrinking dimer dynamics and its applications to sad-
dle point search, SIAM J. Numer. Anal. 50 (2012), no. 4, 1899–1921, DOI
10.1137/110843149. MR3022203
[ZJ07] J. Zinn-Justin, Phase transitions and renormalization group, Oxford Graduate
Texts, Oxford University Press, Oxford, 2007. MR2345069
[ZL16] P. Zhou and T. Li, Construction of the landscape for multi-stable systems: Poten-
tial landscape, quasi-potential, A-type integral and beyond, J. Chem. Phys. 144
(2016), 094109.
Index
301
302 Index
GSM/199
www.ams.org