Sei sulla pagina 1di 169

Lecture Notes on Probability Theory

by

Dmitry Panchenko

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Contents

1 Probability spaces. 1

2 Random variables. 7

3 Kolmogorovs consistency theorem. 14

4 Inequalities for sums of independent random variables. 17

5 Laws of large numbers. 23

6 0 1 laws, convergence of random series, Kolmogorovs SLLN. 30

7 Stopping times, Walds identity, Markov property. Another proof of the SLLN. 35

8 Convergence of laws. Selection theorem. 39

9 Characteristic functions. 44

10 The Central Limit Theorem. 50

11 Lindebergs CLT. Three Series Theorem. 56

12 Conditional expectations and distributions. 62

13 Martingales. Uniform Integrability. 69

14 Stopping times. 73

15 Doobs inequalities and convergence of martingales. 78

16 Bounded Lipschitz functions. 84

17 Convergence of laws on metric spaces. 88

18 Strassens Theorem. Relationships between metrics. 96

19 Kantorovich-Rubinstein Theorem. 102

20 Brunn-Minkowski and Prekopa-Leindler inequalities. 111

21 Stochastic Processes. Brownian Motion. 119

ii

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


22 Donsker Invariance Principle. 123

23 Convergence of empirical process to Brownian bridge. 127

24 Reflection principles for Brownian motion. 133

25 Skorohods Imbedding and Laws of the Iterated Logarithm. 139

26 Moment problem and de Finettis theorem for coin flips. 145

27 The general de Finetti and Aldous-Hoover representations. 149

28 The Dovbysh-Sudakov representation. 157

29 Poisson processes. 160

Bibliography 165

iii

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 1

Probability spaces.

(, A , P) is a probability space if (, A ) is a measurable space and P is a measure on A such that P() = 1. Let us
recall some definitions from measure theory. A pair (, A ) is a measurable space if A is a -algebra of subsets of
. A collection A of subsets of is called an algebra if:

(i) A,
(ii) C, B A = C B,C B A,
(iii) B A = \ B A.

A collection A of subsets of is called a -algebra if it is an algebra and

(iv) Ci A for all i 1 = i1Ci A .

Elements of the -algebra A are often called events. P is a probability measure on the -algebra A if

(1) P() = 1,
(2) P(A) 0 for all events A A ,
(3) P is countably additive: given any disjoint Ai A for i 1, i.e. Ai A j = 0/ for all i , j,

[ 
P Ai = P(Ai ).
i=1 i=1

It is sometimes convenient to use an equivalent formulation of property (3):

(30 ) P is finitely additive and continuous, i.e. for any decreasing sequence of events Bn A , Bn Bn+1 ,
\
B= Bn = P(B) = lim P(Bn ).
n
n1

Lemma 1 Properties (3) and (30 ) are equivalent.

Proof. First, let us show that (3) implies (30 ). If we denote Cn = Bn \ Bn+1 then Bn is the disjoint union knCk B


and, by (3),
P(Bn ) = P(B) + P(Ck ).
kn

Since the last sum is the tail of convergent series, limn P(Bn ) = P(B). Next, let us show that (30 ) implies (3). If,
given disjoint sets (An ), we define Bn = in+1 Ai , then
[ [ [ [ [
Ai = A1 A2 An Bn ,
i1

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and, by finite additivity,
[  n
P Ai = P(Ai ) + P(Bn ).
i1 i=1

/ By (30 ), limn P(Bn ) = 0 and (3) follows.


Clearly, Bn Bn+1 and, since (An ) are disjoint, n1 Bn = 0. t
u

Let us give several examples of probability spaces. One most basic example of a probability space is ([0, 1], B([0, 1]), ),
where B([0, 1]) is the Borel -algebra on [0, 1] and is the Lebesgue measure. Let us quickly recall how this measure
is constructed. More generally, let us consider the construction of the Lebesgue-Stieltjes measure on R corresponding
to a non-decreasing right-continuous function F(x). One considers an algebra of finite unions of disjoint intervals
n[ o
A= (ai , bi ] n 1, all (ai , bi ] are disjoint

in

and defines the measure F on the sets in this algebra by (we slightly abuse the notations here)
[  n 
F (ai , bi ] = F(bi ) F(ai ) .
in i=1

It is not difficult to show that F is countably additive on the algebra A, i.e.,



[ 
F Ai = F(Ai )
i=1 i=1

whenever all Ai and i1 Ai are finite unions of disjoint intervals. The proof is exactly the same as in the case of
F(x) = x corresponding to the Lebesgue measure case. Once countable additivity is proved on the algebra, it remains
to appeal to the following key result. Recall that, given an algebra A, the -algebra A = (A) generated by A is the
smallest -algebra that contains A.

Theorem 1 (Caratheodorys extension theorem) If A is an algebra of sets and : A R is a non-negative countably


additive function on A, then can be extended to a measure on the -algebra (A). If is -finite, then this extension
is unique.

Therefore, F above can be uniquely extended to the measure on the -algebra (A) generated by the algebra of finite
unions of disjoint intervals. This is the -algebra B(R) of Borel sets on R. Clearly, (R, B(R), F) will be a probability
space if

(1) F(x) = F((, x]) is non-decreasing and right-continuous,


(2) limx F(x) = 0 and limx F(x) = 1.

The reason we required F to be right-continuous corresponds to our choice that intervals (a, b] in the algebra are closed
on the right, so the two conventions agree and the measure F is continuous, as it should be, e.g.
\ 
(a, b + n1 ] = lim F (a, b + n1 ] .
 
F (a, b] = F
n
n1

In Probability Theory, functions satisfying properties (1) and (2) above are called cumulative distribution functions, or
c.d.f. for short, and we will give an alternative construction of the probability space (R, B(R), F) in the next section.

Other basic ways to define a probability is through a probability function or density function. If the measurable space
is such that all singletons are measurable, we can simply assign some weights pi = P(i ) to a sequence of distinct
points i , such that i1 pi = 1, and let

P(A) = P A {i }i1 .

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The function i pi = P({i }) is called a probability function. Now, suppose that we already have a -finite measure
Q on (, A ), and consider any measurable function f : R+ such that
Z
f () dQ() = 1.

Then we can define a probability measure P on (, A ) by


Z
P(A) = f () dQ().
A

The function f is called the density function of P with respect to Q and, in a typical setting when = Rk and Q is the
Lebesgue measure , f is simply called the density function of P.

Examples. (1) The probability measure on R corresponding to the probability function

i
pi = P({i}) = e
i!
for integer i 0 is called the Poisson distribution with the parameter > 0. (Notation: given a set A, we will denote
by I(x A) or IA (x) the indicator that x belongs to A.) (2) A probability measure on R corresponding to the density
function f (x) = e x I(x 0) is called the exponential distribution with the parameter > 0. (3) A probability
measure on R corresponding to the density function
1 2
f (x) = ex /2
2
is called the standard normal, or standard Gaussian, distribution on R. t
u

Recall that a measure P is called absolutely continuous with respect to another measure Q, P Q, if for all A A ,

Q(A) = 0 = P(A) = 0,

in which case the existence of the density is guaranteed by the following classical result from measure theory.

Theorem 2 (Radon-Nikodym) On a measurable space (, A ) let be a -finite measure and be a finite measure
absolutely continuous with respect to , . Then there exists the Radon-Nikodym derivative h L 1 (, A , )
such that Z
(A) = h() d()
A
for all A A . Such h is unique modulo -a.e. equivalence.

Of course, the Radon-Nikodym theorem also applies to finite signed measures , which can be decomposed into
= + for some finite measures + , the so-called Hahn-Jordan decomposition. Let us recall a proof of the
Radon-Nikodym theorem for convenience.

Proof. Clearly, we can assume that is a finite


R
measure. Consider the Hilbert space H = L 2 (, A , + ) and the
linear functional T : H R given by T ( f ) = f d. Since
Z Z
f d | f | d( + ) Ck f kH ,

R R
T is a continuous
R
linear functional and, by the Riesz-Frechet theorem, f d = f g d( + ) for some g H. This
R
implies f d = f (1 g) d( + ). Now g() 0 for ( + )-almost all , which can be seen by taking f () =
I(g() < 0), and similarly g() 1 for ( + )-almost all . Therefore, we can take 0 g 1. Let E = { : g() =
1}. Then Z Z
(E) = I( E) d() = I( E)(1 g()) d( + )() = 0,

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


both integral to E c , and replacing f by
R R
and since , (E) = 0. Since f g d = f (1 g) d, we can restrict
c
R R
f /(1 g) and denoting h := g/(1 g) (defined on E ) we get E c f d = E c f h d (more Rcarefully, one can truncate
f /(1 g) first and then use the monotone convergence theorem). Therefore, (A E c ) = AE c h d and this finishes
the proof if we set h = 0 on E. To prove uniqueness, consider two such h and h0 and let A = { : h() > h0 ()}. Then
0 = A (h h0 )d and, therefore, (A) = 0.
R
t
u

Let us now write down some important properties of -algebras and probability measures.

Lemma 2 (Approximation property) If A is an algebra of sets then for any B (A) there exists a sequence Bn A
such that limn P(B 4 Bn ) = 0.

Proof. Here B 4 Bn denotes the symmetric difference (B Bn ) \ (B Bn ). Let

D = B (A) lim P(B 4 Bn ) = 0 for some Bn A .



n

We will prove that D is a -algebra and, since A D, this will imply that (A) D. One can easily check that
Z
d(B,C) := P(B 4C) = |IB () IC ()| dP()

is a semi-metric, which satisfies

(a) d(B C, D E) d(B, D) + d(C, E),


(b) |P(B) P(C)| d(B,C),
(c) d(Bc ,Cc ) = d(B,C).

Now, consider D1 , . . . , DN D. If a sequence Cin A for n 1 approximates Di , i.e.

lim P(Cin 4 Di ) = 0,
n

by the properties (a) (c), CnN = iN Cin approximates DN = iN Di , which means that DN D. Let D =
S S
then,
i1 Di . Since P(D \ D ) 0 as N , it is clear that D D, so D is a -algebra.
S N t
u

Dynkins theorem. We will now describe a tool, the so-called Dynkins theorem, or theorem, which is often quite
useful in checking various properties of probabilities.

-systems: A collection of sets P is called a -system if it is closed under taking intersections, i.e.

1. if A, B P then A B P.

-systems: A collection of sets L is called a -system if

1. L ,

2. if A L then Ac L ,
3. if An L are disjoint for n 1 then n1 An L .

Given any collection of sets C , by analogy with the -algebra (C ) generated by C , we will denote by L (C ) the
smallest -system that contains C . If is easy to see that the intersection of all -systems that contain C is again a
-system that contains C , so this intersection is precisely L (C ).

Theorem 3 (Dynkins theorem) If P is a -system, L is a -system and P L , then (P) L .

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


We will give typical examples of application of this result below.

Proof. First of all, it should be obvious that the collection of sets which is both a -system and a -system is a
-algebra. Therefore, if we can show that L (P) is a -system then it is a -algebra and

P (P) L (P) L ,

which proves the result. Let us prove that L (P) is a -system. For a fixed set A , let us define

GA = B B A L (P) .


Step 1. Let us show that if A L (P) then GA is a -system. Obviously, GA . If B GA then B A L (P) and,
since Ac L (P) is disjoint from B A,
c
Bc A = (B A) Ac L (P).

This means that Bc GA . Finally, if Bn GA are disjoint then Bn A L (P) are disjoint and

(n1 Bn ) A = n1 (Bn A) L (P),

so n1 Bn GA . We showed that GA is a -system.

Step 2. Next, let us show that if A P then L (P) GA . Since P L (P), by Step 1, GA is a -system. Also,
since P is a -system, closed under taking intersections, P GA . This implies that L (P) GA . In other words, we
showed that if A P and B L (P) then A B L (P).

Step 3. Finally, let us show that if B L (P) then L (P) GB . By step 2, GB contains P and, by Step 1, GB is a
-system. Therefore, L (P) GB . We showed that if B L (P) and A L (P) then A B L (P), so L (P) is
a -system. t
u

Example. Suppose that is a topological space with the Borel -algebra B generated by open sets. Given two
probability measures P1 and P2 on (, B), the collection of sets

L = B B P1 (B) = P2 (B)


is trivially a -system, by the properties of probability measures. On the other hand, the collection P of all open sets
is a -system and, therefore, if we know that P1 (B) = P2 (B) for all open sets then, by Dynkins theorem, this holds for
all Borel sets B B. Similarly, one can see that a probability on the Borel -algebra on the real line is determined by
probabilities of the sets (,t] for all t R. t
u

Regularity of measures. Let us now consider the case of = S where (S, d) is a metric space, and let A be the Borel
-algebra generated by open (or closed) sets. A probability measure P on this space is called closed regular if

P(A) = sup P(F) F A, F - closed (1.0.1)

for all A A . Similarly, probability measure P is called regular if



P(A) = sup P(K) K A, K - compact (1.0.2)

for all A A . It is a standard result in measure theory that every finite measure on (Rk , B(Rk )) is regular. In the
setting of complete separable metric spaces, this is known as Ulams theorem, which we will prove below.

Theorem 4 Every probability measure P on a metric space (S, d) is closed regular.

Proof. Let us consider a collection of sets

L = A A both A and Ac satisfy (1.0.1) .



(1.0.3)

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


First of all, let us show that each closed set F L ; we only need to show that an open set U = F c satisfies (1.0.1). Let
us consider sets 
Fn = s S d(s, F) 1/n .
It is obvious that all Fn are closed, and Fn Fn+1 . One can also easily check that, since F is closed, n1 Fn = U and,
by the continuity of measure,
P(U) = lim P(Fn ) = sup P(Fn ).
n n1

This proves that U = satisfies (1.0.1) and F L . Next, one can easily check that L is a -system, which we will
Fc
leave as an exercise below. Since the collection of all closed sets is a -system, and closed sets generate the Borel
-algebra A , by Dynkins theorem, all measurable sets are in L . This proves that P is closed regular. t
u

Theorem 5 (Ulam) If (S, d) is a complete separable metric space then every probability measure P is regular.

Proof. First, let us show that there exists a compact set K S such that P(S \ K) . Consider a sequence {s1 , s2 , . . .}
1 1
that is dense in S. For any m 1, S =
S
i=1 B(si , m ), where B(si , m ) is the closed ball of radius 1/m centered at si . By
the continuity of measure, for large enough n(m),

 n(m)
[  1 
P S\ B si , m.
i=1
m 2

If we take
\ n(m)
[  1
K= B si ,
m1 i=1
m

then

P(S \ K) m
= .
m1 2
Obviously, by construction, K is closed and totally bounded. Since S is complete, K is compact. By the previous
theorem, given A A , we can find a closed subset F A such that P(A \ F) . Therefore, P(A \ (F K)) 2, and
since F K is compact, this finishes the proof. t
u

Exercise. Let F = {F N : N \ F is finite} be the collection of all sets in N with finite complements. F is a filter,
which means that (a) 0/ < F , (b) if F1 , F2 F then F1 F2 F , and (c) if F F and F G then G F . It is well
known (by Zorns lemma) that F U for some ultrafilter U , which in addition to (a) (c) also satisfies: (d) for any
set A N, either A or N \ A is in U . If we define P(A) = 1 for A U and P(A) = 0 for A < U , show that P is finitely
additive, but not countably additive, on 2N . If P could be interpreted as a probability, suppose that two people pick
a number in N according to P, and whoever picks the bigger number wins. If one shows his number first, what is the
probability that the other one wins?

Exercise. Suppose that C is a class of subsets of and B (C ). Show that there exists a countable class CB C
such that B (CB ).

Exercise. Check that L in (1.0.3) is a -system. (In fact, one can check that this is a -algebra.)

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 2

Random variables.

Let (, A , P) be a probability space and (S , B) be a measurable space where B is a -algebra of subsets of S .


Recall that a function X : S is called measurable if for all B B,

X 1 (B) = X() B A .


In Probability Theory, such functions are called random variables, especially, when (S , B) = (R, B(R)). Depending
on the target space S , X may be called a random vector, sequence, or, more generally, a random element in S . Recall
that measurability can be checked on the sets that generate the -algebra B and, in particular, the following holds.

Lemma 3 X : R is a random variable if and only if, for all t R,

{X t} := X() (,t] A .


Proof. Only if direction requires proof. We will prove that

D = D R X 1 (D) A


is a -algebra. Since sets (,t] D, this will imply that B(R) D. The fact that D is a -algebra follows simply
because taking pre-image preserves set operations. For example, if we consider a sequence Di D for i 1 then
[  [
X 1 Di = X 1 (Di ) A ,
i1 i1

because X 1 (Di ) A and A is a -algebra. Therefore, D. Other properties can be checked similarly, so
S
i1 Di
D is a -algebra. t
u
Given a random element X on (, A , P) with values in (S , B), let us denote the image measure on B by PX = PX 1 ,
which means that for B B,
PX (B) = P(X B) = P(X 1 (B)) = P X 1 (B).
(S , B, PX ) is called the sample space of a random element X and PX is called the law of X, or the distribution of X.
Clearly, on this space a random variable : S S defined by the identity (s) = s has the same law as X.When
S = R, the function F(t) = P(X t) is called the cumulative distribution function (c.d.f.) of X. Clearly, this function
satisfies the following properties that already appeared in the previous section:

(1) F(x) = F((, x]) is non-decreasing and right-continuous,


(2) limx F(x) = 0 and limx F(1) = 1.

On the other hand, any such function is a c.d.f. of some random variable, for example, the random variables X(x) = x
on the space (R, B(R), F) constructed in the previous section, since
 
P x : X(x) t = F (,t] = F(t).

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


0 1

Figure 2.1: A random variable defined by the quantile transformation.

Another construction can be given on the probability space ([0, 1], B([0, 1]), ) with the Lebesgue measure , using
the so-called quantile transformation. Given a c.d.f. F, let us define a random variable X : [0, 1] R by the quantile
transformation (see Figure 2.1): 
X(x) = inf s R F(s) x .
What is the c.d.f. of X? Notice that, since F is right-continuous,

X(x) t inf{s | F(s) x} t lim F(s) x F(t) x.


st

This implies that F is the c.d.f. of X, since

(x : X(x) t) = (x : F(t) x) = F(t).

This means that, to define the probability space (R, B(R), F), we can start with ([0, 1], B([0, 1]), ) and let F = X 1
be the image of the Lebesgue measure by the quantile transformation, or the law of X on R. A related inverse property
is left as an exercise below.
Given random element X : (, A ) (S , B), the -algebra

(X) = X 1 (B) B B


is called a -algebra generated by X. It is obvious that this collection of sets is, indeed, as -algebra.
Example. Consider a random variable X on ([0, 1], B([0, 1]), ) defined by

0, 0 x 1/2,
X(x) =
1, 1/2 < x 1.
Then, the -algebra generated by X consists of the sets
n h 1i 1 i o
(X) = 0, / 0, , , 1 , [0, 1] ,
2 2
and P(X = 0) = P(X = 1) = 1/2. t
u

Lemma 4 Consider a probability space (, A , P), a measurable space (S , B) and random elements X : S
and Y : R. Then the following are equivalent:

1. Y = g(X) for some (Borel) measurable function g : S R,


2. Y : R is (X)-measurable.

It should be obvious from the proof that R can be replaced by any separable metric space.
Proof. The fact that 1 implies 2 is obvious, since for any Borel set B R the set B0 = g1 (B) B and, therefore,

Y = g(X) B = X g1 (B) = B0 = X 1 (B0 ) (X).


 

Let us show that 2 implies 1. For all integer n and k, consider sets
n h k k + 1 o h k k + 1 
An,k = : Y () n , n = Y 1 , .
2 2 2n 2n

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


By 2, An,k (X) = {X 1 (B) | B B} and, therefore, An,k = X 1 (Bn,k ) for some Bn,k B. Let us consider a function
k
gn (x) = 2n I(x Bn,k ).
kZ

By construction, |Y gn (X)| 2n , since


h k k+1 k
Y () n , n X() Bn,k gn (X()) = n .
2 2 2
It is easy to see that gn (x) gn+1 (x) and, therefore, the limit g(x) = limn gn (x) exists and is measurable as a limit
of measurable functions. Clearly, Y = g(X). t
u
Independence. Consider a probability space (, A , P). Then, -algebras Ai A , i n, are independent if

P(A1 An ) = P(Ai )
in

for all Ai Ai . Similarly, -algebras Ai A for i n are pairwise independent if

P(Ai A j ) = P(Ai )P(A j )

for all Ai Ai , A j A j , i , j. Random variables Xi : (, A ) (S , B) for i n are independent if the -algebras


(Xi ) are independent, which is just another convenient way to state that

P(X1 B1 , . . . , Xn Bn ) = P(X1 B1 ) . . . P(Xn Bn )

for any events B1 , . . . , Bn B. Pairwise independence is defined similarly.


Example. Consider a regular tetrahedron die, Figure 2.2, with red, green and blue sides and a red-green-blue base. If

g
b
r
g
r b

Figure 2.2: Pairwise independent, but not independent, random variables.

we roll this die then the colors provide an example of pairwise independent random variables that are not independent,
since
1 1
P(r) = P(b) = P(g) = and P(rb) = P(rg) = P(bg) = ,
2 4
while  1 3
1
P(rbg) = , P(r)P(b)P(g) = .
4 2
t
u
First of all, independence can be checked on generating algebras.

Lemma 5 If algebras Ai , i n are independent then -algebras (Ai ) are independent.

Proof. Obvious by the Approximation Lemma 2. t


u
A more flexible criterion follows from Dynkins theorem.

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Lemma 6 If collections of sets Ci for i n are -systems (closed under finite intersections) then their independence
implies the independence of the -algebras (Ci ) they generate.

Proof. Le us consider the collection C of sets C A such that


n
P(C C2 Cn ) = P(C) P(Ci )
i=2

for all Ci Ci for 2 i n. It is obvious that C is a -system, and it contains C1 by assumption. Since C1 is a
-system, by Dynkins theorem, C contains (C1 ). This means that we can replace C1 by (C1 ) in the statement of
the theorem and, similarly, we can continue to replace each Ci by (Ci ). t
u

Lemma 7 Consider random variables Xi : R on a probability space (, A , P). (a) Random variables (Xi ) are
independent if and only if, for all ti R,
n
P(X1 t1 , . . . , Xn tn ) = P(Xi ti ). (2.0.1)
i=1

(b) If the laws of Xi have densities fi on R then these random variables are independent if and only if a joint density f
on Rn of the vector (Xi ) exists and
n
f (x1 , ..., xn ) = fi (xi ).
i=1

Proof. (a) This is obvious by Lemma 6, because the collection of sets (,t] for t R is a -system that generates
the Borel -algebra on R.

(b) Let us start with the if part. If we denote X = (X1 , . . . , Xn ) then, for any Ai B(R),
n
\  Z n
P {Xi Ai } = P(X A1 An ) = fi (xi ) dx1 dxn
i=1 A1 An i=1
n Z n
= fi (xi ) dxi = P(Xi Ai ).
i=1 Ai i=1

Next, we prove the only if part. First of all, by independence,


n Z n
P(X A1 An ) = P(Xi Ai ) = fi (xi ) dx1 dxn .
i=1 A1 An i=1

We would like to show that this implies that


Z n
P(X A) = fi (xi ) dx1 dxn .
A i=1

for all A in the Borel -algebra on Rn , which would means that the joint density exists and is equal to the product of
individual densities. One can prove the above equality for all A B(Rn ) by appealing to the Monotone Class Theorem
from measure theory, or the Caratheodory Extension Theorem 1, since the above equality, obviously, can be extended
from the semi-algebra of measurable rectangles A1 An to the algebra of disjoint unions of measurable rectangles,
which generates the Borel -algebra. However, we can also appeal to the Dynkins theorem, since the family L of sets
A that satisfy the above equality is a -system by properties of measures and integrals, and it contains the -system
P of measurable rectangles A1 An that generates the Borel -algebra, B(Rn ) = (P). t
u

More generally, a collection of -algebras At A indexed by t T for some set T is called independent if any finite
subset of these -algebras are independent. Let T = T1 Tn be a partition of T into disjoint sets. In this case, the
following holds.

10

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Lemma 8 (Grouping lemma) The -algebras
[ 
Bi = At = At
_

tTi tTi

generated by the subsets of -algebras (At )tTi are independent.

Proof. For each i n, consider a collection of sets


n\ o
Ci = At for all finite F Ti and At At .

tF

It is obvious that Bi = (Ci ) since At Ci for all t Ti , each Ci is a -system, and C1 , . . . , Cn are independent by the
definition of independence of the -algebras At for t T . Using Lemma 6 finishes the proof. (Of course, one should
recognize from measure theory that Ci is a semi-algebra that generates Bi .) t
u
If we would like to construct finitely many independent random variables (Xi )in with arbitrary distributions (Pi )in
on R, we can simply consider the space = Rn with the product measure

P1 . . . Pn

and define a random variable Xi by Xi (x1 , . . . , xn ) = xi . The main result in the next section will imply that one can
construct an infinite sequence of independent random variables with arbitrary distributions on the same probability
space, and here we will give a sketch of another construction on the space ([0, 1], B([0, 1]), ). We will write P = to
emphasize that we think of the Lebesgue measure as a probability.
Step 1. If we write the dyadic decomposition of x [0, 1],

x= 2n n (x),
n1

then it is easy to see that (n )n1 are independent random variables with the distribution P(n = 0) = P(n = 1) = 1/2,
since for any n 1 and any ai {0, 1},

P(x : 1 (x) = a1 , . . . , n (x) = an ) = 2n ,

since fixing the first n coefficients in the dyadic expansion places x into an interval of length 2n .
Step 2. Let us consider injections km : N N for m 1 such that their ranges km (N) are all disjoint and let us define

Xm = Xm (x) = 2n km (n) (x).


n1

It is an easy exercise to check that each Xm is well defined and has the uniform distribution on [0, 1], which can be seen
by looking at the dyadic intervals first. Moreover, by the Grouping Lemma above, the random variables (Xm )m1 are
all independent since they are defined in terms groups of independent random variables.
Step 3. Given a sequence of probability distributions (Pm )m1 on R, let (Fm )m1 be the sequence of the corresponding
c.d.f.s and let (Qm )m1 be their quantile transforms. We have seen above that each Ym = Qm (Xm ) has the distribution
Pm on R, and they are obviously independent of each other. Therefore, we constructed a sequence of independent
random variables Ym on the space ([0, 1], B([0, 1]), ) with arbitrary distributions Pm . t
u
Expectation. If X : R is a random variable on (, A , P) then the expectation of X is defined as
Z
EX = X() dP().

In other words, expectation is just another term for the integral with respect to a probability measure and, as a result,
expectation has all the usual properties of the integrals in measure theory: convergence theorems, change of variables
formula, Fubinis theorem, etc. Let us write down some special cases of the change of variables formula.

11

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Lemma 9 (1) If F is the c.d.f. (and the law) of X on R then, for any measurable function g : R R,
Z
Eg(X) = g(x) dF(x).
R

(2) If the distribution of X is discrete, i.e. P(X {xi }i1 ) = 1, then

Eg(X) = g(xi )P(X = xi ).


i1

(3) If the distribution of X : Rn on Rn has the density function f (x) then, for any measurable function g : Rn R,
Z
Eg(X) = g(x) f (x) dx.
Rn

Proof. All these properties follow by making the change of variables x = X(),
Z Z Z
Eg(X) = g(X()) dP() = g(x) dP X 1 (x) = g(x) dPX (x),

where PX = P X 1 is the law of X on R or Rn . t


u

Another simple fact is the following.

Lemma 10 If X,Y : R are independent and E|X|, E|Y | < then EXY = EXEY.

Proof. Independence implies that the distribution of (X,Y ) on R2 is the product measure P Q, where P and Q are the
distributions of X and Y on R, and, therefore,
Z Z Z
EXY = xy d(P Q)(x, y) = x dP(x) y dQ(y) = EXEY,
R2 R R

by the change of variables and Fubini theorems. t


u

Exercise. If a random variable X has continuous c.d.f. F(t), show that F(X) is uniform on [0, 1], i.e. the law of F(X)
is the Lebesgue measure on [0, 1].
R
Exercise. If F is a continuous distribution function, show that F(x) dF(x) = 1/2.

Exercise. ch( ) is a moment generating function of a random variable X with distribution P(X = 1) = 1/2, since

e + e
Ee X = = ch( ).
2
Does there exist a (bounded) random variable X such that Ee X = chm ( ) for 0 < m < 1? (Hint: compute several
derivatives at zero.)

Exercise. Consider a measurable function f : X Y R and a product probability measure P Q on X Y. For


0 < p q, prove that
kk f kL p (P) kLq (Q) kk f kLq (Q) kL p (P) ,
i.e. Z Z q/p 1/q Z Z  p/q 1/p
p
| f (x, y)| dP(x) dQ(y) | f (x, y)|q dQ(y) dP(x) .

Assume that both sides are well-defined, for example, that f is bounded.

Exercise. If the event A (P) is independent of the -system P then P(A) = 0 or 1.

Exercise. Suppose X is a random variable and g : R R is measurable. Prove that if X and g(X) are independent then
P(g(X) = c) = 1 for some constant c.

12

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Suppose that (em )1mn are i.i.d. exponential random variables with the parameter > 0, and let

e(1) e(2) . . . e(n)

be the order statistics (the random variables arranged in the increasing order). Prove that the spacings

e(1) , e(2) e(1) , . . . , e(n) e(n1)

are independent exponential random variables, and e(k+1) e(k) has the parameter (n k) .

Exercise. Suppose that (en )n1 are i.i.d. exponential random variables with the parameter = 1. Let Sn = e1 + . . . + en
and Rn = Sn+1 /Sn for n 1. Prove that (Rn )n1 are independent and Rn has density nx(n+1) I(x 1). Hint: Let R0 = e1
and compute the joint density of (R0 , R1 , . . . , Rn ) first.

Exercise. Let N be a Poisson random variable with the mean , i.e. P(N = j) = j e / j! for integer j 0. Then,
consider N i.i.d. random variables, independent of N, taking values 1, . . . , k with probabilities p1 , . . . , pk . Let N j be
the number of these random variables taking value j, so that N1 + . . . + Nk = N. Prove that N1 , . . . , Nk are independent
Poisson random variables with means p1 , . . . , pk .

Additional exercise. Suppose that a measurable subset P [0, 1] and the interval I = [a, b] [0, 1] are such that
(P) = (I), where is the Lebesgue measure on [0, 1]. Show that there exists a measure-preserving transformation
T : [0, 1] [0, 1], i.e. T 1 = , such that T (I) P and T is one-to-one (injective) outside a set of measure zero.
(Additional exercises never need to be turned in.)

13

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 3

Kolmogorovs consistency theorem.

In this section we will describe a typical way to construct an infinite family of random variables (Xt )tT on the same
probability space, namely, when we are given all their finite dimensional marginals. This means that for any finite
subset N T , we are given 
PN (B) = P (Xt )tN B
for all B in the Borel -algebra BN = B(RN ). Clearly, these laws must satisfy a natural consistency condition,

PN (B) = PM (B RM\N ), (3.0.1)

for any finite subsets N M and any Borel set B BN . (Of course, to be careful, we should define these probabilities
for ordered subsets and also make sure they are consistent under rearrangements, but the notations for unordered sets
is clear and should not cause any confusion.)
Our goal is to construct a probability space (, A , P) and random variables Xt : R that have (PN ) as their
finite dimensional distributions. We take
= RT = : T R


to be the space of all real-valued functions on T , and let Xt be the coordinate projection

Xt = Xt () = (t).

For the coordinate projections to be measurable, the following collection of events,

A = B RT \N B BN ,


must be contained in the -algebra A . It is easy to see that A is, in fact, an algebra of sets, and it is called the cylindrical
algebra on RT . We will then take A = (A) to be the smallest -algebra on which all coordinate projections are
measurable. This is the so-called cylindrical -algebra on RT . A set B RT \N is called a cylinder. As we already
agreed, the probability P on the sets in algebra A is given by

P B RT \N = PN (B).


Given two finite subsets N M T and B BN , the same set can be represented as two different cylinders,

B RT \N = B RM\N RT \M .


However, by the consistency condition, the definition of P will not depend on the choice of the representation. To finish
the construction, we need to show that P can be extended from algebra A to a probability measure on the -algebra
A . By the Caratheodory Extension Theorem 1, we only need to show that the following holds.

Theorem 6 P is countably additive on the cylindrical algebra A.

14

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. Equivalently, it is enough to show that P satisfies continuity of measure property on A, namely, given a sequence
Bn A, \
Bn Bn+1 , Bn = 0/ = lim P(Bn ) = 0.
n
n1
T
We will prove that if there exists > 0 such that P(Bn ) > for all n then n1 Bn , 0.
/ We can represent the cylinders
Bn as
Bn = Cn RT \Nn
for some finite subset Nn T and Cn BNn . Since Bn Bn+1 , we can assume that Nn Nn+1 . First of all, by the
regularity of probability on a finite dimensional space, there exists a compact set Kn Cn such that

PNn Cn \ Kn .
2n+1
It is easy to see that
Ci RT \Ni \ Ki RT \Ni (Ci \ Ki ) RT \Ni
\ \ [

in in in

and, therefore,
\  [ 
Ci RT \Ni \ Ki RT \Ni P (Ci \ Ki ) RT \Ni
\
P
in in in
 
P (Ci \ Ki ) RT \Ni i+1 .
in in 2 2

Since we assumed that \ 


P(Bn ) = P Ci RT \Ni > ,
in

this implies that \ 


P Ki RT \Ni > 0.
in
2

Let us rewrite this intersection as

Ki RT \Ni = (Ki RNn \Ni ) RT \Nn = K n RT \Nn ,


\ \

in in

where
(Ki RNn \Ni )
\
Kn =
in

is a compact in RNn , since Kn is a compact in RNn . We proved that


\ 
PNn (K n ) = P(K n RT \Nn ) = P Ki RT \Ni > 0
in

and, therefore, there exists a point n = ( n (t))tNn K n . By construction, we also have the following inclusion
property. For m > n,
m K m K n RNm \Nn
and, therefore, ( m (t))tNn K n . Any sequence on a compact has a converging subsequence. Let (n1k )k1 be such that
1
( nk (t))tN1 ((t))tN1 K 1

as k . Then we can take a subsequence (n2k )k1 of the sequence (n1k )k1 such that
2
( nk (t))tN2 ((t))tN2 K 2 .

15

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


m1
Notice that the values of (t) must agree on N1 . Iteratively, we can find a subsequence (nm
k )k1 of (nk )k1 such that
m
( nk (t))tNm ((t))tNm K m .

This proves the existence of a point


K n RT \Nn
\ \
Bn ,
n1 n1

so this last set is not empty. This finishes the proof. t


u

This results gives us another way to construct a sequence (Xn )n1 of independent random variables with given distri-
butions (Pn )n1 as the coordinate projections on the infinite product space RN with the cylindrical (product) -algebra
A = B N and infinite product measure P = n1 Pn .

Exercise. Does the set C([0, 1], R) of continuous functions on [0, 1] belong to the cylindrical -algebra A on R[0,1] ?
Hint: An exercise in Section 1 might be helpful.

16

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 4

Inequalities for sums of independent


random variables.

When we toss an unbiased coin many times, we expect the number of heads or tails to be close to a half a phe-
nomenon called the law of large numbers. More generally, if (Xn )n1 is a sequence of independent identically dis-
tributed (i.i.d.) random variables, we expect that their average

Sn 1 n
Xn = = Xi
n n i=1

is, in some sense, close to the expectation = EX1 , assuming that it exists. In the next section, we will prove a general
qualitative result of this type, but we begin with more quantitative statements in special cases. Let us begin with the
case of i.i.d. Rademacher random variables (n )n1 such that

1
P(n = 1) = P(n = 1) = ,
2
which is, basically, equivalent to tossing a coin with heads and tails replaced by 1. We will need the following.

Lemma 11 (Chebyshevs inequality) If a random variable X 0 then, for t > 0,


EX
P(X t) .
t

Proof. This follows from a simple sequence of inequalities,

EX = EXI(X < t) + EXI(X t) EXI(X t) tEI(X t) = tP(X t),

where we used that X 0. t


u

This inequality is often used in the exponential form,

P(X t) et Ee X for 0,

which is obvious if we rewrite {X t} = {e X et }. We will use it to prove the following.

Theorem 7 (Hoeffdings inequality) For any a1 , . . . , an R and t 0,


n   t2 
P i ai t exp .
i=1 2 ni=1 ai2

17

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. We begin by writing for 0,
n   n  n
P i ai t et E exp i ai = et E exp( i ai ),
i=1 i=1 i=1

where in the last step we used Lemma 10. One can easily check the inequality

ex + ex 2
ex /2 ,
2
for example, using the Taylor expansions, and, therefore,
1 1 2 2
E exp( i ai ) = e ai + e ai e ai /2
2 2
and
n   2 n 2
P i ai t exp t + ai .
i=1 2 i=1
Optimizing over 0 finishes the proof. t
u
We can also apply the same inequality to (i ) and, combining both cases,
 n   t2 
P i ai t 2 exp .

i=1 2 ni=1 a2i

This implies, for example, that


 1 n   nt 2 
P i t 2 exp .

n i=1 2
This show that, no matter how small t > 0 is, the probability that the average deviates from the expectation E1 = 0 by
more than t decreases exponentially fast with n. Let us now consider a more general case. For p, q [0, 1], we consider
the function
p 1 p
D(p, q) = p log + (1 p) log ,
q 1q
which is called the Kullback-Leibler divergence. To see that D(p, q) 0, with equality only if p = q, just use that
log x x 1 with equality only if x = 1.

Theorem 8 (Hoeffding-Chernoff inequality) Suppose that 0 Xi 1 and = EX. Then


1 n 
Xi + t enD(+t,)
n
P
i=1

for any t 0 such that + t 1.

Proof. Notice that the probability is zero when +t > 1, since the average can not exceed 1, so +t 1 is not really
a constraint here. Using the convexity of the exponential function, we can write for x [0, 1] that

e x = e (x1+(1x)0) xe + (1 x)e 0 = 1 x + xe ,

which implies that


Ee X 1 EX + EXe = 1 + e .
Using this, we get the following bound for any 0,
n 
P Xi n( + t) e n(+t) Ee Xi = e n(+t) (Ee X )n
i=1
n
e n(+t) 1 + e .

18

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The derivative of the right hand side in is equal to zero at:

n( + t)e n(+t) (1 + e )n + n(1 + e )n1 e e n(+t) = 0,

( + t)(1 + e ) + e = 0,
(1 )( + t)
e = 1,
(1 t)
so the critical point 0, as required. Substituting this back into the bound,
n !n
(1 t) +t
  
 (1 )( + t)
P Xi n( + t) 1 +
i=1 (1 )( + t) 1 t
 +t  1t !n
1
=
+t 1 t
  
+t 1 t
= exp n ( + t) log + (1 t) log ,
1

which finishes the proof. t


u

We can also apply this bound to Zi = 1 Xi , with the mean Z = 1 , to get


1 n  1 n 
P Xi t = P Zi Z + t enD(z +t,Z ) = enD(1+t,1) .
n i=1 n i=1

These inequalities show that, no matter how small t > 0 is, the probability that the average X n deviates from the
expectation by more than t in either direction decreases exponentially fast with n. Of course, the same conclusion
applies to any bounded random variables, |Xi | M, by shifting and rescaling the interval [M, M] into the interval
[0, 1].
Even though the Hoeffding-Chernoff inequality applies to all bounded random variables, in real-world applica-
tions in engineering, computer science, etc., one would like to improve the control of the probability by incorporating
other measures of closeness of the random variable X to the mean , for example, the variance

2 = Var(X) = E(X )2 = EX 2 (EX)2 .

The following inequality is classical. Let us denote

(x) = (1 + x) log(1 + x) x.

We will center the random variables (Xn ) and instead work with Zn = Xn .

Theorem 9 (Bennetts inequality) Let us consider i.i.d. (Zn )n1 such that EZ = 0, EZ 2 = 2 and |Z| < M. Then,
1 n  
n 2
 
tM
P iZ t exp ,
n i=1 M2 2

for all t 0.

Proof. As above, using that (Zi ) are i.i.d.,


 n  n
P Zi nt e nt Ee Z .
i=1

19

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Using Taylor series expansion and the fact that EZ = 0, EZ 2 = 2 and |Z| M, we can write

( Z)k

EZ k
Ee Z = E = k
k=0 k! k=0 k!
k 2 k2

k k2 2
= 1+ EZ Z 1+ M
k=2 k! k=2 k!
2 k Mk 2  
= 1+ 2 = 1 + 2 e M 1 M
M k=2 k! M
 2  
exp e M 1 M
M2
where in the last inequality we used 1 + x ex . Therefore,
n   2  
P Zi nt exp n t + 2 e M 1 M .
i=1 M

Now, we optimize this bound over 0. We find the critical point:

2  M 
t + Me M = 0,
M2
tM
e M =
+ 1,
2
 
1 tM
= log 1 + 2 .
M
Since this 0, plugging it into the above bound,
n  
t

tM

2 tM
 
tM

P Zi nt exp n log 1 + 2 + 2 + 1 1 log 1 +
i=1 M M 2 2
 2    
tM tM tM tM
= exp n log 1 + log 1 +
M2 2 2 2 2
 2    
tM tM tM
= exp n 1 + 2 log 1 + 2
M2 2
2
  
n tM
= exp 2 ,
M 2
finishes the proof. t
u
To simplify the bound in Bennetts inequality, one can notice that (we leave it as an exercise)

x2
(x) ,
2(1 + x/3)
which implies that
n
nt 2
1  
P Zi t exp
n i=1 2( 2 + tM/3)
.

Combining with the same inequality for (Zi ), and recalling that Zi = Xi , we obtain another classical inequality,
Bernsteins inequality:
 1 n  
nt 2

P Xi t 2 exp .

n i=1 2( 2 + tM/3)
For small t, the denominator is of order 2 2 , and we get a better control of the probability when the variance is
small.

20

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Azumas inequality. So far we have considered sum of independent random variables. As a digression, we will now
give one example of a concentration inequality for general functions f = f (X1 , . . . , Xn ) of n independent random
variables. We do not assume that these random variables are identically distributed, but we will assume the following
stability condition on the function f :

f (x1 , . . . , xi , . . . , xn ) f (x1 , . . . , xi0 , . . . , xn ) ai

for some constants a1 , . . . , an . This means that modifying ith coordinate of f can change its value by not more than ai .
Let us begin with the following observation.

Lemma 12 If a random variable X satisfies |X| 1 and EX = 0 then, for any 0,


2 /2
Ee X e .

Proof. Let us write X as a convex combination


1+X 1X
X = + ( ),
2 2
where (1 + X)/2, (1 X)/2 [0, 1] and their sum is equal to one. By the convexity of exponential function,

1 + X 1 X
e X e + e ,
2 2
2 /2
and taking expectations we get Ee X ch( ) e . t
u

Using this, we will now prove the following analogue of Hoeffdings inequality.

Theorem 10 (Azumas inequality) Under the above stability condition, for any t 0,

  t2 
P f E f t exp .
2 ni=1 ai
2

Proof. For i = 1, . . . , n, let Ei denote the expectation in Xi+1 , . . . , Xn with the random variables X1 , . . . , Xi fixed. One
can think of (X1 , . . . , Xn ) as defined on a product space with the product measure, and Ei denotes the integration over
the last n i coordinates. Let us denote Yi = Ei f Ei1 f and note that En f = f and E0 f = E f . Then we can write
f E f = ni=1 Yi (this is called martingale-difference representation) and as before, for 0,
  n 
P f E f t = P Yi t et EeY1 +...+Yn .
i=1

Notice that Ei1Yi = Ei1 f Ei1 f = 0. Also, the stability condition implies that |Yi | ai . Since Y1 , . . . ,Yn1 do not
depend on Xn (only Yn does), if we average in Xn first, we can write

EeY1 +...+Yn = E eY1 +...+Yn1 En1 eYn .




If we apply the previous lemma to Yn /an viewed as a function of Xn , we get


2 a2 /2
En1 eYn = En1 e an (Yn /an ) e n

2 n 2
and, therefore, EeY1 +...+Yn e an EeY1 +...+Yn1 . Proceeding by induction on n, we get EeY1 +...+Yn e i=1 ai and
n
   2 n 2
P Yi t exp t + ai .
2 i=1
i=1

Optimizing over 0 finishes the proof. t


u

21

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Notice that in the above proof, we did not use the fact that Xi s are random variables. They could be random vectors
or arbitrary random elements taking values in some measurable spaces. We only used the assumption that they are
independent. Keeping this in mind, let us give one example of application of Azumas inequality.

Example. Consider an Erdos-Renyi random graph G(n, p) on n vertices, where each edge is present with probability p
independently of other edges. Let f = (G(n, p)) be the chromatic number of this graph, which is the smallest number
of colors needed to color the vertices so that no two adjacent vertices share the same color. Let us denote the vertices
by v1 , . . . , vn and let Xi denote the randomness in the set of possible edges between the vertex vi and v1 , . . . , vi1 . In
other words, Xi = (X1i , . . . , Xi1
i ) where X i is 1 if the edge between v and v is present and 0 otherwise. Notice that
k k i
vectors X1 , . . . , Xn are independent and the chromatic number is clearly a function f = f (X1 , . . . , Xn ). To apply Azumas
inequality, we need to determine the stability constants a1 , . . . , an . Observe that changing the set of edges connected to
one vertex vi can only affect the chromatic number by at most 1 because, in the worst case, we can assigns a new color
to this vertex. This means that ai = 1 and Azumas inequality implies that
t2
 
P (G(n, p)) E(G(n, p)) t 2e 2n


(we simply apply the inequality to f and f here). For example, if we take
t = 2n log n, we get the bound 2/n, which
means that with high probability the chromatic number will be within 2n log n from its expected value E(G(n, p)).
it is known (but non-trivial) that this expected value is close to c p n/ log n with c p = log(1 p)/2,
When p is fixed,
so the deviation 2n log n is of much smaller order.

Exercise. (Hoeffding-Chernoffs inequality) Prove that for 0 < 1/2 and 0 t < ,

t2
D(1 + t, 1 ) .
2(1 )

Hint: compare second derivatives.

Exercise. Let X1 , . . . , Xn be independent flips of a fair coin, i.e. P(Xi = 0) = P(Xi = 1) = 1/2. If X n is their average
show that for t 0
2
P |X n 1/2| > t 2e2nt .


Hint: use previous problem twice.

Exercise. Suppose that the random variables X1 , . . . , Xn , X10 , . . . , Xn0 are independent and, for all i n, Xi and Xi0 have
the same distribution. Prove that
n n 1/2 
P (Xi Xi0 ) 2t (Xi Xi0 )2 et .
i=1 i=1

Hint: think about a way to introduce Rademacher random variables i into the problem and then use Hoeffdings
inequality.

Exercise. (Bernsteins inequality) Let X1 , . . . , Xn be i.i.d. random variables such that |Xi | M, EX = and
var(X) = 2 . If X n is their average, make a change of variables in the Bernstein inequality to show that for t > 0,
r
 2 2t 2Mt 
P Xn + + et .
n 3n

22

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 5

Laws of large numbers.

In this section, we will study two types of convergence of the average to the mean, in probability and almost surely.
Consider a sequence of random variables (Yn )n1 on some probability space (, A , P). We say that Yn converges in
p
probability to a random variable Y (and write Yn Y ) if, for all > 0,

lim P(|Yn Y | ) = 0.
n

We say that Yn converges to Y almost surely, or with probability 1, if



P : lim Yn ( = Y ()) = 1.
n

Let us begin with an easier case.

Theorem 11 (Weak law of large numbers) Consider a sequence of random variables (Xn )n1 that are centered, EXn =
0, have uniformly bounded second moments, EXn2 K < , and are uncorrelated, EXi X j = 0 for i , j. Then

1
Xn = Xi 0
n in

in probability.

Proof. By Chebyshevs inequality, also using that EXi X j = 0,

 EX 2
= P X 2n 2 2n

P |X n 0|

1 2 1 n nK K
= 2 2
E(X 1 + + Xn ) = 2 2 EXi2 2 2 = 2 0,
n n i=1 n n

as n , which finishes the proof. t


u

Of course, if (Xn )n1 are independent then they are automatically uncorrelated, since

EXi X j = EXi EX j = 0.

Before we move on to the almost sure convergence, let us give one more application of the above argument to the
problem of approximation of continuous functions. Consider an i.i.d. sequence (Xn )n1 with the distribution P on R
that depends on some parameter R, and suppose that

EXi = , Var(Xi ) = 2 ( ).

Then the following holds.

23

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 12 If u : R R is continuous and bounded and 2 ( ) is bounded on compacts then
Eu(X n ) u( )
uniformly on compacts.
Proof. For any > 0, we can write
|Eu(X n ) u( )| E|u(X n ) u( )|

= E|u(X n ) u( )| I(|X n | ) + I(|X n | > )
max |u(x) u( )| + 2kuk P(|X n | > ).
|x |

The last probability can be bounded as in the previous theorem,


2 ( )
P(|X n | > ) ,
n 2
and, therefore,
2kuk 2 ( )
|Eu(X n ) u( )| max |u(x) u( )| + .
|x | n 2
The statement of the theorem should now be obvious. t
u
Example. Let (Xi ) be i.i.d. with the Bernoulli distribution B( ) with probability of success [0, 1],
P(Xi = 1) = , P(Xi = 0) = 1 ,
and let u : [0, 1] R be continuous. Then, by the above theorem, the following linear combination of the Bernstein
polynomials,
n    n n   
k  k n k
Bn ( ) := Eu(X n ) = u P Xi = k = u (1 )nk
k=0 n i=1 k=0 n k
approximate u( ) uniformly on [0, 1]. This gives an explicit example of polynomials in the Weierstrass theorem that
approximate a continuous function on [0, 1]. t
u
Example. Suppose (Xi ) have the Poisson distribution ( ) with the parameter > 0,
k
P (Xi = k) = e for integer k 0.
k!
Then, it is well known (and easy to check) that EXi = , 2 ( ) = , and the sum X1 + . . . + Xn has Poisson distribution
(n ). Therefore, if u is bounded and continuous on [0, +) then
   n
k   
k (n )k n
Eu(X n ) = u P Xi = k = u e u( )
k=0 n i=1 k=0 n k!
uniformly on compact sets. t
u
Before we turn to the almost sure convergence results, let us note that convergence in probability is weaker than a.s.
convergence. For example, consider a probability space which is a circle of circumference 1 with the uniform measure
on it. Consider a sequence of r.v. on this probability space defined by
 h 1 1 1  
Xk (x) = I x 1 + + + , 1 + + mod 1 .
2 k k+1
Then, Xk 0 in probability, since for 0 < < 1,
1
P(|Xk 0| ) = 0.
k+1
However, Xk does not have an almost sure limit, because the series k1 1/k diverges and, as a result, each point x on
the sphere will fall into the above intervals infinitely many times, i.e. it will satisfy Xk (x) = 1 for infinitely many k. Let
us begin with the following lemma.

24

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Lemma 13 Consider a sequence (pi )i1 such that pi [0, 1). Then

(1 pi ) = 0 pi = +.
i1 i1

Proof. =. Using that 1 p ep we get


 
(1 pi ) exp i 0 as n .
p
in in

=. We can assume that pi 1/2 for i m for large enough m, because, otherwise, the series obviously diverges.
Since 1 p e2p for p 1/2 we have
 
(1 pi ) exp 2 pi
min min

and the result follows. t


u
The following result plays a key role in Probability Theory. Consider a sequence (An )n1 of events An A on the
probability space (, A , P). We will denote by
\ [
An i.o. := Am
n1 mn

the event that An occur infinitely often, which consists of all that belong to infinitely many events in the sequence
(An )n1 . Then the following holds.

Lemma 14 (Borel-Cantelli lemma)

(1) If n1 P(An ) < then P(An i.o.) = 0.


(2) If An are independent and n1 P(An ) = + then P(An i.o.) = 1.

then Bn Bn+1 and, by the continuity of measure,


S
Proof. (1) If Bn = mn Am
\ 
P(An i.o.) = P Bn = lim P(Bn ).
n
n1

On the other hand, [ 


P(Bn ) = P Am P(Am ),
mn mn

which goes to zero as n +, because the series m1 P(Am ) < .


(2) We can write
  [  \ 
P \ Bn = P \ Am = P ( \ Am )
mn mn
{by independence} = P( \ Am ) = (1 P(Am )) = 0,
mn mn

t
u
T
by Lemma 13, since mn P(Am ) = +. Therefore, P(Bn ) = 1 and P(An i.o.) = P( n1 Bn ) = 1.

Let us show how this implies the strong law of large number for bounded random variables. Recall that in the case
when
= EX1 , 2 = Var(X1 ) and |X1 | M,
the average satisfies Bernsteins inequality proved in the last section,

nt 2
 

P |X n | t 2 exp .
2( 2 + tM/3)

25

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


If we take
 8 2 log n 1/2
t = tn =
n
then, for n large enough such that tn M/3 2, we have

nt 2
 
2
P |X n | tn 2 exp n2 = 2 .

4 n

Since the series n1 2n2 converges, the Borel-Cantelli lemma implies that

P |X n | tn i.o. = 0.

This means that for P-almost all , the difference |X n () | will become smaller than tn for large enough
n n0 (). If we recall the definition of almost sure convergence, this means that X n converges to almost surely
the so-called strong law of large number. Next, we will show that this holds even for unbounded random variables
under a minimal assumption that the expectation = EX1 exists.

Strong law of large numbers. The following simple observation will be useful: if a random variable X 0 then
Z
EX = P(X x) dx.
0

Indeed, if F is the law of X on R then


Z Z Z x Z Z Z
EX = x dF(x) = 1 dsdF(x) = 1 dF(x)ds = P(X s) ds.
0 0 0 0 s 0

For X 0 such that EX < this implies


Z
P(X i) 0
P(X s) ds = EX < . (5.0.1)
i1

As before, let (Xn )n1 be i.i.d. random variables on the same probability space.

Theorem 13 (Strong law of large numbers) If E|X1 | < then

1 n
Xn = Xi = EX1 almost surely.
n i=1

Proof. The proof will proceed in several steps.

Step 1. First, without loss of generality we can assume that Xi 0. Indeed, for signed random variables we can
decompose Xi = Xi+ Xi , where

Xi+ = Xi I(Xi 0) and Xi = |Xi | I(Xi < 0)

and the general result would follow, since

1 n 1 n 1 n
Xi = Xi+ Xi EX1+ EX1 = EX1 .
n i=1 n i=1 n i=1

Thus, from now on we assume that Xi 0.

Step 2. (Truncation) Next, we can replace Xi by Yi = Xi I(Xi i) using the Borel-Cantelli lemma. Since

P(Xi , Yi ) = P(Xi > i) EX1 < ,


i1 i1

26

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


the Borel-Cantelli lemma implies that P({Xi , Yi } i.o.) = 0. This means that for some (random) i0 and for i i0 we
have Xi = Yi and, therefore,
1 n 1 n
lim Xi = lim Yi .
n n n n
i=1 i=1

It remains to show that if Tn = ni=1 Yi then Tn /n EX almost surely.

Step 3. (Limits over subsequences) We will first prove almost sure convergence along the subsequences of the type
n(k) = b k c for > 1. For any > 0,
 1 1
P |Tn(k) ETn(k) | n(k) 2 n(k)2 Var(Tn(k) ) = 2 n(k)2 Var(Yi )
k1 k1 k1 in(k)
1 1 1
2 n(k)2 EYi2 = EYi2 n(k)2 .
k1 in(k)
2 i1 k:n(k)i

To bound the last sum, let us note that


k
n(k) = b k c k
2
and if k0 = min{k : k i} then

1 4 4 K
2
2k = 2k 1
2,
n(k)i
n(k) k i
0 (1 2 ) i

where K = K() depends on only. Therefore, we showed that

21 1 i 2
 Z
n(k)
P |T ETn(k) | n(k) K i i2
EY = K 2 0 x dF(x),
k1 i1 i1 i

where F is the law of X. We can bound the last sum as follows


Z i Z m+1 Z m+1
1 1 1
i2 x2 dF(x) = i2 x2 dF(x) = i2 x2 dF(x)
i1 0 i1 m<i m m0 i>m m
Z m+1 Z m+1
2
x2 dF(x) 2 x dF(x) = 2EX < .
m0 m + 1 m m0 m

We proved that 
P |Tn(k) ETn(k) | n(k) <
k1

and the Borel-Cantelli lemma implies that



P |Tn(k) ETn(k) | n(k) i.o. = 0.

If we take a sequence m = m1 , this implies that

P m 1, |Tn(k) ETn(k) | m1 n(k) i.o. = 0,




and this proves that


Tn(k) ETn(k)
0
n(k) n(k)
with probability one. On the other hand, it is obvious that
ETn(k) 1
n(k)
= EX1 I(X1 i) EX
n(k) in(k)

27

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


as k , so we proved that
Tn(k)
EX almost surely.
n(k)

Step 4. Finally, for j such that


n(k + 1)
n(k) j < n(k + 1) = n(k) n(k) 2
n(k)
we can write
1 Tn(k) Tj Tn(k+1)
2
2
n(k) j n(k + 1)
and, therefore, with probability one,
1 Tj Tj
2
EX lim inf lim sup 2 EX.
j j j j

Taking = 1 + m1 and letting m proves that lim j T j / j = EX almost surely. t


u

Exercise. Suppose that random variables (Xn )n1 are i.i.d. such that E|X1 | p < for some p > 0. Show that maxin |Xi |/n1/p
goes to zero in probability.
Exercise. (Weak LLN for U-statistics) If (Xn )n1 are i.i.d. such that EX1 = and 2 = Var(X1 ) < , show that
 1
n
2 Xi X j 2
1i< jn

in probability as n .
Exercise. If u : [0, 1]k R is continuous then show that
j  
1 jk  n
u n , . . . , n ji xiji (1 xi )n ji u(x1 , . . . , xk )
0 j1 ,..., j n
k ik

as n , uniformly on [0, 1]k .


Exercise. If E|X| < and limn P(An ) = 0, show that limn EXIAn = 0. (Hint: use the Borel-Cantelli lemma over
some subsequence.)
Exercise. Suppose that An is a sequence of events such that

lim P(An ) = 0 and


n
P(An \ An+1 ) < .
n1

Prove that P(An i.o.) = 0.


Exercise. Let {Xn }n1 be i.i.d. and Sn = X1 + . . . + Xn . If Sn /n 0 a.s., show that E|X1 | < . (Hint: use the idea in
(5.0.1) and the Borel-Cantelli lemma.)
Exercise. Suppose that (Xn )n1 are independent random variables. Show that P(supn1 Xn < ) = 1 if and only if
n1 P(Xn > M) < for some M > 0.
Exercise. If (Xn )n1 are i.i.d., not constant with probability one, then P(Xn converges) = 0.
Exercise. Let {Xn }n1 be independent and exponentially distributed, i.e. with c.d.f. F(x) = 1 ex for x 0. Show
that  Xn 
P lim sup = 1 = 1.
n log n

Exercise. Suppose that (Xn )n1 are i.i.d. with EX1+ < and EX1 = . Show that Sn /n almost surely.

28

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Suppose that (Xn )n1 are i.i.d. such that E|X1 | < and EX1 = 0. If (cn ) is a bounded sequence of reals
numbers, prove that
1 n
ci Xi 0 almost surely.
n i=1
Hint: either (a) group the close values of c or (b) examine the proof of the strong law of large numbers.

Additional exercise. Suppose (Xn )n1 are i.i.d. standard normal. Prove that
 |Xn | 
P lim sup = 1 = 1.
n 2 log n

29

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 6

0 1 laws, convergence of random series,


Kolmogorovs SLLN.

Consider a sequence (Xi )i1 of random variables on the same probability space (, F , P) and let (X

 i )i1 be the
-algebra generated by this sequence. An event A (Xi )i1 is called a tail event if A (Xi )in for all n 1.
In other words,
AT =
\ 
(Xi )in ,
n1

where T is the so-called tail -algebra. For example, if Ai (Xi ) then


\[
Ai i.o. = Ai
n1 in

is a tail event. It turns out that when (Xi )i1 are independent then all tail events have probability 0 or 1.

Theorem 14 (Kolmogorovs 01 law) If (Xi )i1 are independent then P(A) = 0 or 1 for all A T .

Proof. For a finite subset F = {i1 , . . . , in } N, let us denote by XF = (Xi1 , . . . , Xin ). The -algebra (Xi )i1 is
generated by the algebra of sets

A = {XF B} finite F N, B B(R|F| ) .




This is an algebra, because any set operations on finite number of such sets can again be expressed in terms of finitely
many random variables Xi . By the Approximation Lemma, we can approximate any set A (Xi )i1 by sets in A .


Therefore, for any > 0, there exists a set A0 A such that P(A 4 A0 ) and, therefore,

|P(A) P(A0 )| , |P(A) P(A A0 )| .

By definition, A0 (X1 , ..., Xn ) for large enough n and, since A is a tail event, A ((Xi )in+1 ). The Grouping Lemma
implies that A and A0 are independent and P(A A0 ) = P(A)P(A0 ). Finally, we get

P(A) P(A A0 ) = P(A)P(A0 ) P(A)P(A),

and letting 0 proves that P(A) = P(A)2 . t


u

Example. The event A = the series i1 Xi converges is a tail event, so it has probability 0 or 1 when Xi s are
independent. t
u
Example. Consider the series i1 Xi zi on the complex plane, for z C. Its radius of convergence is
1
r = lim inf |Xi | i .
i

30

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


For any x 0, the event {r x} is, obviously, a tail event. This implies that r = const with probability 1. t
u
Next we will prove a stronger result under a more restrictive assumption that the random variables (Xi )i1 are not only
independent, but also identically distributed. A set B RN is called symmetric if, for all n 1,
(x1 , x2 , . . . , xn , xn+1 , . . .) B = (xn , x2 , ..., xn1 , x1 , xn+1 , ...) B.
In other words, the set B is symmetric under the permutations of finitely many coordinates. We will call an event
A ((Xi )i1 ) symmetric if it is of the form {(Xi )i1 B} for some symmetric set B RN in the cylindrical -algebra
B on RN . For example, any event in the tail -algebra T is symmetric.
Theorem 15 (Savage-Hewitt 01 law) If (Xi )i1 are i.i.d. and A ((Xi )i1 ) is symmetric then P(A) = 0 or 1.
Proof. By the Approximation Lemma, for any > 0, there exists n 1 and An (X1 , ..., Xn ) such that P(An 4 A) .
Of course, we can write this set as

An = (X1 , . . . , Xn ) Bn (X1 , ..., Xn )
for some Borel set Bn on Rn . Let us denote
A0n = (Xn+1 , . . . , X2n ) Bn (Xn+1 , ..., X2n ).


This set is independent of An , so P(An A0n ) = P(An )P(A0n ). Given x = (x1 , x2 , . . .) RN , let us define an operator
x = (xn+1 , . . . , x2n , x1 , . . . , xn , x2n+1 , . . .)
that switches the first n coordinates with the second n coordinates.  Denote
X = (Xi )i1 and recall that A = {X B}
for some symmetric set B RN that, by definition, satisfies B = x x B = B. Now we can write
P(A0n 4 A) = P (Xn+1 , . . . , X2n ) Bn 4 X B
  
  
= P (Xn+1 , . . . , X2n ) Bn 4 X B
  
= P (Xn+1 , . . . , X2n ) Bn 4 X B
  
{using that Xi s are i.i.d.} = P (X1 , . . . , Xn ) Bn 4 X B
= P(An 4 A) .
(An A0n ) 4 A

This implies that P 2 and we can conclude that

P(A) P(An ), P(A) P(An A0n ) = P(An )P(A0n ) P(A)2 .


Letting 0 again implies that P(A) = P(A)2 . t
u
Example. Let Sn = X1 + . . . + Xn and let
Sn an
r = lim sup
.
n bn
The event {r x} is symmetric, since changing the order of a finite set of coordinates does not affect Sn for large
enough n. As a result, P(r x) = 0 or 1, which implies that r = const with probability 1. t
u
Random series. We already saw above that, by Kolmogorovs 01 law, the series i1 Xi for independent (Xi )i1
converges with probability 0 or 1. This means that either Sn = X1 + . . . + Xn converges to its limit S with probability
one, or with probability one it does not converge. We know that almost sure convergence is stronger that convergence
in probability, so in the case when with probability one Sn does not converge, is it still possible that it converges to
some random variable in probability? The answer is no, because we will now prove that for random series convergence
in probability implies almost sure convergence. We will need the following.

Theorem 16 (Kolmogorovs inequality) Suppose that (Xi )i1 are independent and Sn = X1 +. . .+Xn . If for all j n,
P(|Sn S j | a) p < 1, (6.0.1)
then, for x > a,
  1
P max |S j | x P(|Sn | > x a).
1 jn 1 p

31

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. First of all, let us notice that this inequality is obvious without the maximum, because (6.0.1) is equivalent to
1 p P(|Sn S j | < a) and we can write
  
(1 p)P |S j | x P |Sn S j | < a P |S j | x

= P |Sn S j | < a, |S j | x P(|Sn | > x a).

The equality in the middle holds because the events {|S j | x} and {|Sn S j | < a} are independent, since the first
depends only on X1 , ..., X j and the second only on X j+1 , ..., Xn . The last inequality holds, because if |Sn S j | < a and
|S j | x then, by triangle inequality, |Sn | > x a.
To deal with the maximum, instead of looking at an arbitrary partial sum S j , we will look at the first partial sum
that crosses the level x. We define this first time by

= min j n : |S j | x

and let = n + 1 if all |S j | < x. Notice that the event { = j} also depends only on X1 , ..., X j , so we can again write
  
(1 p)P = j P |Sn S j | < a P = j

= P |Sn S j | < a, = j P(|Sn | > x a, = j).

The last inequality is true, because when = j we have |S j | x and


 
|Sn S j | < a, = j |Sn | > x a, = j .

It remains to add up over j n to get

(1 p)P( n) P(|Sn | > x a, n) P(|Sn | > x a)

and notice that { n} = {max jn |S j | x}. t


u

We will need one more simple lemma.

Lemma 15 A sequence Yn Y almost surely if and only if Mn = maxin |Yi Y | 0 in probability.

Proof. The only if direction is obvious, so we only need to prove the if part. Since the sequence Mn is decreasing,
it converges to some limit Mn M 0 everywhere. Since for all > 0,

P(M ) P(Mn ) 0 as n ,

this means that P(M = 0) = 1 and Mn 0 almost surely. Of course, this implies that Yn Y almost surely. t
u

We are now ready to prove the result mentioned above.

Theorem 17 If the series i1 Xi converges in probability then it converges almost surely.

Proof. Suppose that the partial sums Sn converge to some random variable S in probability, i.e., for any > 0, for large
enough n n0 () we have P(|Sn S| ) . If k j n n0 () then

P(|Sk S j | 2) P(|Sk S| ) + P(|S j S| ) 2.

Next, we apply Kolmogorovs inequality with x = 4, a = 2 and p = 2 to the partial sums Xn+1 + . . . + X j to get
  1 2
P max |S j Sn | 4 P(|Sk Sn | 2) 3,
n jk 1 2 1 2

for small . The events {maxn jk |S j Sn | 4} are increasing as k and, by the continuity of measure,
 
P max |S j Sn | 4 3.
n j

32

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Finally, since P(|Sn S| ) we get
 
P max |S j S| 5 4.
n j

This means that the maximum maxn j |S j S| 0 in probability and, by previous lemma, Sn S almost surely. t
u
Let us give one easy-to-check criterion for convergence of random series. Again, we will need one auxiliary result.
Lemma 16 Random sequence (Yn )n1 converges in probability to some limit Y if and only if it is Cauchy in probability,
which means that
lim P(|Yn Ym | ) = 0
n,m
for all > 0.
Proof. Again, the only if direction is obvious and we only need to prove the if part. Given = l 2 , we can find
m(l) large enough such that, for n, m m(l),
 1 1
P |Yn Ym | 2 2 . (6.0.2)
l l
Without loss of generality, we can assume that m(l + 1) m(l) so that
 1 1
P |Ym(l+1) Ym(l) | 2 2 .
l l
Then,  1 1
P |Ym(l+1) Ym(l) | l 2 l 2 <
l1 l1
and, by the Borel-Cantelli lemma,
 1 
P |Ym(l+1) Ym(l) | 2 i.o. = 0.
l
As a result, for large enough (random) l and for k > l,
1 1
|Ym(k) Ym(l) | 2
< .
il i l 1

This means that, with probability one, (Ym(l) )l1 is a Cauchy sequence and there exists an almost sure limit Y =
liml Ym(l) . Together with (6.0.2) this implies that Yn Y in probability. t
u

Theorem 18 If (Xi )i1 is a sequence of independent random variables such that


EXi = 0 and EXi2 <
i1

then the series i1 Xi converges almost surely.

Proof. It is enough to prove convergence in probability. For m < n,


1 1
P(|Sn Sm | ) E(Sn Sm )2 = 2 EXi2 0
2 m<in

as n, m , since the series i1 EXi2 converges. This means that (Sn )n1 is Cauchy in probability and, by previous
lemma, Sn converges to some limit S in probability. t
u
Example. Consider the random series i1 i /i where P(i = 1) = 1/2. We have
 2 1 1
i
E i = i2 < if > 2 ,
i1 i1

so the series converges almost surely for > 1/2. t


u
One famous application of Theorem 18 is the following form of the strong law of large numbers.

33

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 19 (Kolmogorovs strong law of large numbers) Let (Xi )i1 be independent random variables such that
EXi = 0 and EXi2 < . Suppose that bi bi+1 and bi . If i1 EXi2 /b2i < then limn b1
n Sn = 0 almost surely.

This follows immediately from Theorem 18 and the following well-known calculus lemma.

Lemma 17 (Kroneckers lemma) Suppose that a sequence (bi )i1 is such that all bi > 0 and bi . Given another
sequence (xi )i1 , if the series i1 xi /bi converges then limn b1 n
n i=1 xi = 0.

Proof. Because the series converges, rn := in+1 xi /bi 0 as n . Notice that we can write xn = bn (rn1 rn )
and, therefore,
n n n1
xi = bi (ri1 ri ) = (bi+1 bi )ri + b1 r0 bn rn .
i=1 i=1 i=1

Since ri 0, given > 0, we can find n0 such that for i n0 we have |ri | and | n1i=n0 +1 (bi+1 bi )ri | bn .
Therefore,
n n0
1
bn xi b1 1
n (bi+1 bi )ri + + bn b1 |r0 | + |rn |.

i=1 i=1

Letting n and then 0 finishes the proof. t


u

A couple of examples of application of Kolmogorovs strong law of large numbers will be given in the exercises below.

Exercise. Let {Sn : n 0} be a simple random walk which starts at zero, S0 = 0, and at each step moves to the right
with probability p and to the left with probability 1 p. Show that the event {Sn = 0 i.o.} has probability 0 or 1. (Hint:
Hewitt-Savage 0 1 law.)

Exercise. In the setting of the previous problem, show: (a) if p , 1/2 then P(Sn = 0 i.o.) = 0; (b) if p = 1/2 then
P(Sn = 0 i.o.) = 1. Hint: use the fact that the events
n on o
lim inf Sn 1/2 , lim sup Sn 1/2
n n

are exchangeable.

Exercise. Suppose that (Xn )n1 are i.i.d. with EX1 = 0 and EX12 = 1. Prove that for > 0,
n
1
n1/2 (log n)1/2+
Xi 0
i=1

almost surely. Hint: use Kolmogorovs strong law of large numbers.

Exercise. Let (Xn ) be i.i.d. random variables with continuous distribution F. We say that Xn is a record value if Xn > Xi
for i < n. Let In be the indicator of the event that Xn is a record value.

(a) Show that the random variables (In )n1 are independent and P(In = 1) = 1/n. Hint: if Rn {1, . . . , n} is the rank
of Xn among the first n random variables (Xi )in , prove that (Rn ) are independent.
(b) If Sn = I1 + . . . + In is the number of records up to time n, prove that Sn / log n 1 almost surely. Hint: use
Kolmogorovs strong law of large numbers.

Exercise. Suppose that two sequences of random variables Xn : 1 R and Yn : 2 R for n 1 on two different
probability spaces have the same distributions in the sense that all their finite dimensional distributions are the same,
L ((Xi )in ) = L ((Yi )in ) for all n 1. If Xn converges almost surely to some random variable X on 1 as n ,
prove that Yn also converges almost surely on 2 .

34

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 7

Stopping times, Walds identity, Markov


property. Another proof of the SLLN.

In this section, we will have our first encounter with two concepts, stopping times and Markov property, in the setting of
the sums of independent random variables. Later, stopping times will play an important role in the study of martingales,
and Markov property will appear again in the setting of the Brownian motion.
Consider a sequence (Xi )i1 of independent random variables and an integer valued random variable {1, 2, . . .}.
We say that is independent of the future if { n} is independent of ((Xi )in+1 ). Suppose that is independent of
the future and E|Xi | < for all i 1. We can formally write

ES = ES I( = k) = ESk I( = k)
k1 k1
()
= EXn I( = k) = EXn I( = k) = EXn I( n).
k1 nk n1 kn n1

In (*) we can interchange the order of summation if, for example, the double sequence is absolutely summable, by
the Fubini-Tonelli theorem. Since is independent of the future, the event { n} = { n 1}c is independent of
(Xn ) and we get
ES = EXn P( n). (7.0.1)
n1

This implies the following.

Theorem 20 (Walds identity.) If (Xi )i1 are i.i.d., E|X1 | < and E < , then ES = EX1 E.

Proof. By (7.0.1) we have,

ES = EXn P( n) = EX1 P( n) = EX1 E.


n1 n1

The reason we can interchange the order of summation in (*) is because under our assumptions the double sequence
is absolutely summable since

E|Xn |I( = k) = E|Xn |I( n) = E|X1 |E < ,


n1 kn n1

so we can apply the Fubini-Tonelli theorem. t


u
We say that is a stopping time if { n} (X1 , . . . , Xn ) for all n. Clearly, a stopping time is independent of the
future. One example of a stopping time is = min{k 1 : Sk 1}, since
[
{ n} = {Sk 1} (X1 , . . . , Xn ).
kn

35

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Given a stopping time , we would like to describe all events that depend on and the sequence X1 , . . . , X up to
this stopping time. A formal definition of the -algebra generated by the sequence up to a stopping time is the
following: 
= A A { n} (X1 , . . . , Xn ) for all n 1 .
When is a stopping time, one can easily check that this is a -algebra, and the meaning of the definition is that, if
we know that n then the corresponding part of the event A is expressed only in terms of X1 , . . . , Xn .

Theorem 21 (Markov property) Suppose that (Xi )i1 are i.i.d. and is a stopping time. Then the sequence T =
(X+1 , X+2 , . . .) is independent of the -algebra and
d
T = (X1 , X2 , . . .),
d
where = means the equality in distribution.

In words, this means that the sequence T = (X+1 , X+2 , . . .) after the stopping time is an independent copy of the
entire sequence, also independent of everything that happens before the stopping time.

Proof. Consider an event A and B B , the cylindrical -algebra on RN . We can write,


  
P A {T B} = P A { = n} {T B} = P A { = n} {Tn B} ,
n1 n1

where Tn = (Xn+1 , Xn+2 , . . .). By the definition of the -algebra ,

A { = n} = A { n} \ A { n 1} (X1 , . . . , Xn ).

On the other hand, {Tn B} (Xn+1 , . . .) and, therefore, is independent of A { = n}. Using this and the fact that
(Xi )i1 are i.i.d.,
  
P A {T B} = P A { = n} P Tn B
n1
   
= P A { = n} P T1 B = P A P T1 B ,
n1

and this finishes the proof. t


u

Let us give one interesting application of the Markov property and Walds identity that will yield another proof of the
Strong Law of Large Numbers.

Theorem 22 Suppose that (Xi )i1 are i.i.d. such that EX1 > 0. If Z = infn1 Sn then P(Z > ) = 1.

This means that partial sums can not drift down to if the mean EX1 > 0. Of course, this is obvious by the strong
law of large number, but we want to prove this independently, since this will give another proof of the SLLN.

Proof. Let us define (see Fig. 7.1 below),


 (2)
1 = min k 1 : Sk 1 , Z1 = min Sk , Sk = S1 +k S1 ,
k1

 (2) (2) (3) (2) (2)


2 = min k 1 : Sk 1 , Z2 = min Sk , Sk = S2 +k S2 ,
k2

and recursively,
 (n) (n) (n+1) (n) (n)
n = min k 1 : Sk 1 , Zn = min Sk , Sk = Sn +k Sn .
kn

We mentioned above that 1 is a stopping time. It is easy to check that Z1 is 1 -measurable and, by the Markov
property, it is independent of the sequence T1 = (X1 +1 , X1 +2 , . . .), which has the same distribution as the original

36

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


1
0
z2 !2

1
0 !1
z1

Figure 7.1: A sequence of stopping times.

sequence. Since 2 and Z2 are defined exactly the same way as 1 and Z1 , only in terms of this new sequence T1 , Z2 is
an independent copy of Z1 . Now, it should be obvious that (Zn )n1 are i.i.d. random variables. Clearly,

Z = inf Sk = inf Z1 , S1 + Z2 , S1 +2 + Z3 , ... ,
k1

and, since, by construction, S1 ++k1 k 1,


 [ [
Z < N S1 +...+k1 + Zk N k 1 + Zk N .
k1 k1

Therefore,

P(Z < N) P(k 1 + Zk N) = P(Zk N k + 1)


k1 k1
= P(Z1 N k + 1) = P(Z1 j) P(|Z1 | j) 0
k1 jN jN

as N , if we can show that E|Z1 | < , since

P(|Z1 | j) E|Z1 | < .


j1

By Walds identity,
E|Z1 | E |Xi | = E|X1 |E1 <
i1

if we can show that E1 < . This is left as an exercise below. We proved that P(Z < N) 0 as N , which, of
course, implies that P(Z > ) = 1. t
u
This result gives another proof of the Strong Law of Large Numbers.

Theorem 23 If (Xi )i1 are i.i.d. and EX1 = 0 then Sn /n 0 almost surely.

Proof. Given > 0 we define Xi = Xi + so that EX1 = > 0. By the above result, infn1 (Sn + n) > with
probability one. This means that for all n 1, Sn + n M > for some random variable M. Dividing both sides
by n and letting n we get
Sn
lim inf
n n
with probability one. We can then let 0 over some sequence. Similarly, we prove that
Sn
lim sup 0
n n
with probability one, which finishes the proof. t
u

37

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Let (Xi )i1 be i.i.d. and EX1 > 0. Given a > 0, show that E < for = inf{k 1 : Sk > a}. (Hint: truncate
Xi s and and use Walds identity).

Exercise. Let S0 = 0, Sn = ni=1 Xi be a random walk with i.i.d. (Xi ), P(Xi = +1, 1) = p, 1 p for p > 1/2. Consider
integer b 1 and let = min{n 1 : Sn = b}. Show that for 0 < s 1,
 1 (1 4pqs2 )1/2 b
Es =
2qs
and compute E.

Exercise. Suppose that we play a game with i.i.d. outcomes (Xn )n1 such that E|X1 | < . If we play n rounds, we gain
the largest of the first n outcomes. In addition, to play each round we have to pay amount c > 0, so after n rounds our
total profit (or loss) is
Yn = max Xn cn.
1mn

In this problem we will find the best strategy to play the game, in some sense.

1. Given a R, let p = P(X1 > a) > 0 and consider the stopping time T = inf{n : Xn > a}. Compute EYT . (Hint:
sum over sets {T = n}.)
2. Consider such that E(X1 )+ = c. For a = show that EYT = .
3. Show that Yn + nm=1 ((Xm )+ c) (for any , actually).

4. Use Walds identity to conclude that for any stopping time such that E < we have EY .

This means that stopping at time T results in the best expected profit .

38

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 8

Convergence of laws. Selection theorem.

In this section we begin the discussion of weak convergence of distributions on metric spaces. Let (S, d) be a metric
space with the metric d. Consider a measurable space (S, B) with the Borel -algebra B generated by open sets and
let (Pn )n1 and P be some probability distributions on B. We define

Cb (S) = f : S R continuous and bounded .

We say that Pn P weakly if Z Z


lim f dPn = f dP for all f Cb (S). (8.0.1)
n S S
d
This is often denoted by Pn P or Pn = P. Of course, the idea here is that the measure P on the Borel -algebra is
determined uniquely by the integralsR of f CRb (S), so we compare closeness of the measures by closeness of all these
integrals. To see this, suppose that f dP = f dQ for all f Cb (S). Consider any open set U in S and let F = U c .
Using that d(x, F) = 0 if and only if x F (because F is closed), it is easy to see that

fm (x) = min 1, m d(x, F) I(x U) as m .

Since fm Cb (S), by monotone convergence theorem we get that P(U) = Q(U) and, by Dynkins theorem, P = Q. We
will also say that random variables Xn X in distribution if their laws converge weakly. Notice that in this definition
the random variables need not be defined on the same probability space, as long as they take values in the same metric
space S. We will come back to the study of convergence on general metric spaces later in the course, and in this section
we will prove only one general result, the Selection Theorem. Other results will be proved only on R or Rn to prepare
us for the most famous example of convergence of laws - the Central Limit Theorem (CLT). First of all, let us notice
that on the real line the convergence of probability measures can be expressed in terms of their c.d.f.s, as follows.
 
Theorem 24 If S = R then Pn P weakly if and only if Fn (t) = Pn (,t] F(t) = P (,t] for any point of
continuity t of the c.d.f. F.

Proof. = Suppose that (8.0.1) holds. Let us approximate the indicator I(x t) by continuous functions so that

I(x t ) 1 (x) I(x t) 2 (x) I(x t + ),

as in Fig. 8.1 below. Obviously, 1 , 2 Cb (R). Then, using (8.0.1) for 1 and 2 ,

"1 "2
x
t!! t t+ !

Figure 8.1: Approximating an indicator.

39

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Z Z Z Z
F(t ) 1 dF = lim 1 dFn lim Fn (t) lim 2 dFn = 2 dF F(t + ).
n n n

Therefore, for any > 0,


F(t ) lim Fn (t) F(t + ).
n

More carefully, we should write lim inf and lim sup but, since t is a point of continuity of F, letting 0 proves that
the limit limn Fn (t) exists and is equal to F(t).

= Let PC(F) be the set of points of continuity of F. Since F is monotone, the set PC(F) is dense in R.
Take M large enough such that both M, M PC(F) and P((M, M]c ) . Clearly, for large enough n 1 we
have Pn ((M, M]c ) 2. For any k > 1, consider a sequence of points M = x1k x2k . . . xkk = M such that all
k xk | 0 as k . Given a function f C (R), consider an approximating function
xi PC(F) and maxi |xi+1 i b

f (xik ) I x (xi1
k
, xik ] + 0 I x < (M, M] .
 
fk (x) =
1<ik

Since f in continuous,
k (M) = sup fk (x) f (x) 0, k .
|x|M

Since all xik PC(F), by assumption, we can write


Z  n  Z
fk dFn = fk (xik ) Fn (xik ) Fn (xi1
k
fk (xik ) F(xik ) F(xi1
k
) = fk dF.
1<ik 1<ik

On the other hand, Z Z


f dF fk dF k f k P (M, M]c + k (M) k f k + k (M)


and, similarly, for large enough n 1,


Z Z
f dFn fk dFn k f k Pn (M, M]c + k (M) k f k 2 + k (M).


R R
Letting n , then k and, finally, 0 (or M ), proves that f dFn f dF. t
u

Example. If Pn ({n1 }) = 1 and P({0}) = 1 then Pn P weakly, but their c.d.f.s, obviously, do not converge at the
point of discontinuity x = 0.

Our typical strategy for proving the convergence of probability measures will be based on the following elementary
observation.

Lemma 18 If for any sequence (n(k))k1 there exists a subsequence (n(k(r)))r1 such that Pn(k(r)) P weakly then
Pn P weakly.

Proof. Suppose not. Then for some f Cb (S) and for some > 0 there exists a subsequence (n(k)) such that
Z Z
f dPn(k) f dP > .

But this contradicts the fact that for some subsequence Pn(k(r)) P weakly. t
u

As a result, to prove that Pn P weakly, it is enough to demonstrate two things:

1. First, that the sequence (Pn ) is such that one can always find converging subsequences for any sequence (Pn(k) ).
This describes a sort of relative compactness of the sequence (Pn ) and will be a consequence of uniform
tightness and the Selection Theorem proved below.
2. Second, we need to show that any subsequential limit is always the same, P, and this will be done by some ad
hoc methods, for example, using the method of characteristic functions in the case of the CLT.

40

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


We say that a sequence of distributions (Pn )n1 on a metric space (S, d) is uniformly tight if, for any > 0, there exists
a compact K S such that Pn (K) 1 for all n. Checking this property can be difficult and, of course, one needs
to understand how compacts look like for a particular metric space. In the case of the CLT on Rn , this will be a trivial
task. The following fact is quite fundamental.

Theorem 25 (Selection Theorem) If (Pn )n1 is a uniformly tight sequence of laws on the metric space (S, d) then there
exists a subsequence (n(k)) such that Pn(k) converges weakly to some probability law P.

Let us recall the following well-known result.

Lemma 19 (Cantors diagonalization) Let A be a countable set and fn : A R, n 1. Then there exists a subsequence
(n(k)) such that fn(k) (a) converges for all a A, possibly to .

Proof. Let A = {a1 , a2 , . . .}. Take (n1 (k)) such that fn1 (k) (a1 ) converges. Take (n2 (k)) (n1 (k)) such that fn2 (k) (a2 )
converges. Recursively, take (nl (k)) (nl1 (k)) such that fnl (k) (al ) converges. Now consider the sequence (nk (k)).
Clearly, fnk (k) (al ) converges for any l because for k l, nk (k) {nl (k)} by construction. t
u
Proof of Theorem 25. We will prove the Selection Theorem for arbitrary metric spaces, since this result will be useful
to us later when we study the convergence of laws on general metric spaces. However, when S = R one can see this in
a much more intuitive way, as follows.
(The case S = R). Let A be a dense set of points in R. Given a sequence of probability measures Pn on R and their
c.d.f.s Fn , by Cantors diagonalization, there exists a subsequence (n(k)) such that Fn(k) (a) F(a) for all a A. For
x R \ A, we can extend F by 
F(x) = inf F(a) x < a, a A .
Obviously, F(x) is non-decreasing but not necessarily right-continuous. However, at the points of discontinuity, we
can redefine it to be right continuous. Then F(x) will be a cumulative distribution function, since the fact that Pn are
uniformly tight ensures that F(x) 0 or 1 if x or +. In order to prove weak convergence of Pn(k) to the
measure P with the c.d.f. F, let x be a point of continuity of F(x) and let a, b A such that a < x < b. We have,

F(a) = lim Fn(k) (a) lim inf Fn(k) (x) lim sup Fn(k) (x) lim Fn(k) (b) = F(b).
k k k k

Since x is a point of continuity and A is dense,

lim F(a) = F(x), lim F(b) = F(x),


A3ax A3bx

and this proves that Fn(k) (x) F(x) for all such x. By Theorem 24, this means that the laws Pn converge to P. Let us
now prove the general case on general metric spaces.
(The general case). If K is a compact then, obviously, Cb (K) = C(K). Later in these lectures, when we deal in more
detail with convergence on general metric spaces, we will prove the following fact, which is well-known and is a
consequence of the Stone-Weierstrass theorem.

C(K) is separable w.r.t. ` norm || f || = supxK | f (x|.

Since Pn are uniformly tight, for any r 1 we can find a compact Kr such that Pn (Kr ) > 1 1/r for all n 1. Let
Cr C(Kr ) be a countable dense subset of C(Kr ). By Cantors diagonalization, there exists a subsequence (n(k)) such
that Pn(k) ( f ) converges for all f Cr for all r 1. Since Cr is dense in C(Kr ), this implies that Pn(k) ( f ) converges for
all f C(Kr ) for all r 1. Next, for any f Cb (S),
Z Z
|| f ||
Z
| f | dPn(k) || f || Pn(k) (Krc )

f dPn(k) f dPn(k) .
Kr c
Kr r
This implies that the limit Z
I( f ) := lim f dPn(k) (8.0.2)
k

41

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


R
exists. The question is why this limit is an integral I( f ) = f dP for some probability measure P? Basically, this is a
consequence of the Riesz representation theorem for positive linear functionals on locally compact Hausdorff spaces.
One can define the functional Ir ( f ) on each compact Kr , find a measure on it using the Riesz representation theorem,
check that these measures agree on the intersections, and extend to a probability measure on their union. However,
instead, we will use a more general version of the Riesz representation theorem the Stone-Daniell theorem from
measure theory which states the following.
Given a set S, a family of function = { f : S R} is called a vector lattice if

f , g = c f + g , c R and f g, f g .

Vector lattice is called a Stone vector lattice if f 1 for any f . For example, any vector lattice that contains
constants is automatically a Stone vector lattice.
A functional I : R is called a pre-integral if

1. I(c f + g) = cI( f ) + I(g),

2. f 0, I( f ) 0,
3. fn 0, || fn || < = I( fn ) 0.

See R.M. Dudley Reals Analysis and Probability for a proof of the following:
R
(Stone-Daniell theorem) If is a Stone vector lattice and I is a pre-integral on then I( f ) = f d for some unique
measure on the minimal -algebra on which all functions in are measurable.

We will use this theorem with = Cb (S) and I defined in (8.0.2). The first two properties are obvious. To prove the
third one, let us consider a sequence such that

fn 0, 0 fn (x) f1 (x) || f1 || .

On any compact Kr , fn 0 uniformly, i.e.


n
|| fn ||,Kr n,r 0.
Since
1
Z Z Z
fn dPn(k) = fn dPn(k) + fn dPn(k) n,r + k f1 || ,
Kr Krc r
we get
1
Z
I( fn ) = lim fn dPn(k) n,r + || f1 || .
k r
Letting n and r , we get that I( fn ) 0. By the Stone-Daniell theorem,
Z
I( f ) = f dP

for some unique measure P on (Cb (S)). The choice of f = 1 shows that I( f ) = 1 = P(S), which means that P is a
probability measure. Finally, let us show that (Cb (S)) is the Borel -algebra B generated by open sets. Since any
f Cb (S) is measurable on B we get (Cb (S)) B. On the other hand, let F S be any closed set and take a function
f (x) = min(1, d(x, F)). We have, | f (x) f (y)| d(x, y) so f Cb (S) and

f 1 ({0}) (Cb (S)).

However, since F is closed, f 1 ({0}) = {x : d(x, F) = 0} = F and this proves that B (Cb (S)). t
u

Conversely, the following holds (we will prove this result later for any complete separable metric space).

Theorem 26 If Pn converges weakly to P on Rk then (Pn )n1 is uniformly tight.

42

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. For any > 0, there exists large enough M > 0, such that P(|x| > M) < . Consider a function

0, s M,
(s) = 1, s 2M,
(s M)/M, M s 2M,

and let (x) := (|x|) for x Rk . Since Pn P weakly,


 Z Z 
lim sup Pn |x| > 2M lim sup (x)dPn (x) = (x)dP(x) P |x| > M .
n n

For n large enough, n n0 , we get Pn (|x| > 2M) 2. For n < n0 choose Mn so that Pn (|x| > Mn ) 2. Take
M 0 = max{M1 , . . . , Mn0 1 , 2M}. As a result, Pn (|x| > M 0 ) 2 for all n 1. t
u

Finally, let us relate convergence in distribution to other forms of convergence. Consider random variables X and Xn
on some probability space (, A , P) with values in a metric space (S, d). Let P and Pn be their corresponding laws
on Borel sets B in S. Convergence of Xn to X in probability and almost surely is defined exactly the same way as for
S = R by replacing |Xn X| with d(Xn , X).

Lemma 20 Xn X in probability if and only if for any sequence (n(k)) there exists a subsequence (n(k(r))) such
that Xn(k(r)) X almost surely.

Proof. =. Suppose Xn does not converge to X in probability. Then, for small > 0, there exists a subsequence
(n(k))k1 such that P(d(X, Xn(k) ) ) . This contradicts the existence of a subsequence Xn(k(r)) that converges to
X almost surely.

=. Given a subsequence (n(k)) let us choose (k(r)) so that


 1 1
P d(Xn(k(r)) , X) 2.
r r
By the Borel-Cantelli lemma, these events can occur i.o. with probability 0, which means that with probability one,
for large enough r,
1
d(Xn(k(r)) , X) ,
r
i.e. Xn(k(r)) X almost surely. t
u

Lemma 21 Xn X in probability then Xn X in distribution.

Proof. By Lemma 20, for any subsequence (n(k)) there exists a subsequence (n(k(r))) such that Xn(k(r)) X almost
surely. Given f Cb (R), by the dominated convergence theorem,

E f (Xn(k(r)) ) E f (X),

i.e. Xn(k(r)) X weakly. By Lemma 18, Xn X in distribution. t


u

Exercise. Let Xn be random variables on the same probability space with values in a metric space S. If for some point
s S, Xn s in distribution, show that Xn s in probability.

Exercise. For the following sequences of laws Pn on R having densities fn , which are uniformly tight? (a) fn = I(0
x n)/n, (b) fn = nenx I(x 0), (c) fn = ex/n /nI(x 0).

Exercise. Suppose that Xn X in distribution on R and Yn c R in probability. Show that XnYn cX in distribution,
assuming that Xn ,Yn are defined on the same probability space.

Exercise. Suppose that random variables (Xn ) are independent and Xn X in probability. Show that X is almost surely
constant.

43

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 9

Characteristic functions.

In the next section, we will prove one of the most classical results in Probability Theory the central limit theorem
and one of the main tools will be the so-called characteristic functions. Let X = (X1 , . . . , Xk ) be a random vector on Rk
with the distribution P and let t = (t1 , . . . ,tk ) Rk . The characteristic function of X is defined by
Z
f (t) = Eei(t,X) = ei(t,x) dP(x).

In this section, we will collect various important and useful facts about the characteristic functions. The standard
normal distribution N(0, 1) on R with the density
1 2
p(x) = ex /2
2
will play a central role, so let us start by computing its characteristic function. First of all, notice that this is indeed a
density, since
"
 1 Z 2 1 1 2 r2 /2
Z Z
x2 /2 2 2
e dx = e(x +y )/2 dxdy = e r drd = 1.
2 R 2 R2 2 0 0
If X has the standard normal distribution N(0, 1) then, obviously, EX = 0 and
1 1
Z Z
2 /2 2 /2
Var(X) = EX 2 = x2 ex dx = x d(ex ) = 1,
2 R 2 R

by integration by parts. To motivate the computation of the characteristic function, let us first notice that for R,
1 x2 2 1 (x )2 2 1 x2 2
Z Z Z
Ee X = e x 2 dx = e 2 e 2 dx = e 2 e 2 dx = e 2 .
2 2 2
For complex = it, we begin similarly by completing the square,
t2 1 (xit)2 t2
Z Z
EeitX = e 2 e 2 dx = e 2 (z) dz,
2 it+R

where we denoted
1 z2
(z) = e 2 for z C.
2
Since is analytic, by Cauchys theorem, integral over a closed path is equal to 0. Let us take a closed path it + x
for x from M to +M, M + iy for y from t to 0, x from M to M and, finally, M + iy for y from 0 to t. For large
M, the function (z) is very small on the intervals M + iy, so letting M we get
Z Z
(z) dz = (z) dz = 1,
it+R R

44

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and we proved that
t2
f (t) = EeitX = e 2 . (9.0.1)
It is also easy to check that Y = (X + ) has mean = EY, variance 2 = Var(Y ) and the density

1  (x )2 
exp .
2 2 2

This is the so-called normal distribution N(, 2 ) and its characteristic function is given by
2 2 /2
EeitY = Eeit(+ X) = eitt . (9.0.2)

Our next very important observation is that integrability of X is related to smoothness of its characteristic function.

Lemma 22 If X is a real-valued random variable such that E|X|r < for integer r 1 then f (t) Cr (R) and

f ( j) (t) = E(iX) j eitX

for j r.

Proof. If r = 0, we use the fact that |eitX | 1 to conclude that

f (t) = EeitX EeisX = f (s) if t s,

by the dominated convergence theorem. This means that f C(R), i.e. characteristic functions are always continuous.
If r = 1, E|X| < , we can use itX
e eisX

t s |X|

and, therefore, by the dominated convergence theorem,

eitX eisX
f 0 (t) = lim E = EiXeitX .
st t s
Also, by dominated convergence theorem, EiXeitX C(R), which means that f C1 (R). We proceed by induction.
Suppose that we proved that
f ( j) (t) = E(iX) j eitX
and that r = j + 1, E|X| j+1 < . Then, we can use that
(iX) j eitX (iX) j eisX

|X| j+1 ,
t s

so that, by the dominated convergence theorem, f ( j+1) (t) = E(iX) j+1 eitX C(R). t
u
Next, we want to show that the characteristic function uniquely determines the distribution. This is usually proved
using convolutions. Let X and Y be two independent random vectors on Rk with the distributions P and Q. We denote
by P Q the convolution of P and Q, which is the distribution L (X +Y ) of the sum X +Y. We have,
ZZ
P Q(A) = EI(X +Y A) = I(x + y A) dP(x)dQ(y)
ZZ Z
= I(x A y) dP(x)dQ(y) = P(A y) dQ(y).

If P has density p then


ZZ ZZ
P Q(A) = I(x + y A)p(x) dxdQ(y) = I(z A)p(z y) dzdQ(y)
ZZ Z Z 
= p(z y) dzdQ(y) = p(z y) dQ(y) dz,
A A

45

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


which means that P Q has density Z
f (x) = p(x y) dQ(y). (9.0.3)

If, in addition, Q has density q then Z


f (x) = p(x y)q(y) dy. (9.0.4)

Let us denote by N(0, 2 I) the distribution of the random vector X = (X1 , . . . , Xk ) of i.i.d. random variables with the
normal distribution N(0, 2 ). The density of X is given by
k
1 1 2 xi2
 1 k 1 2 |x|2
2 e 2 =
2
e 2 .
i=1

Given a distribution P on Rk , let us denote


P = P N(0, 2 I).
It turns out that P is always absolutely continuous and its density can be written in terms of the characteristic function
of P.

Lemma 23 The convolution P = P N(0, 2 I) has density


 1 k Z 2 2
p (x) = f (t)ei(t,x) 2 |t| dt
2

where f (t) = ei(t,x) dP(x).


R

Proof. By (9.0.3), P N(0, 2 I) has density

1 k
Z
1 2 |xy|2

p (x) = e 2 dP(y).
2
Using (9.0.1), we can write
1
Z
1 2 (xi yi )2 1 1 2
e 2 = ei (xi yi )zi e 2 zi dzi
2
and taking a product over i k we get
 1 k Z
1 2 |xy|2 1 1 2
e 2 = ei (xy,z) e 2 |z| dz.
2
Then we can continue
 1 k Z Z 1 1 2

p (x) = ei (xy,z) 2 |z| dzdP(y)
2
 1 k Z Z 1 1 2
= ei (xy,z) 2 |z| dP(y)dz
2
 1 k Z  z  1 1 2
= f ei (x,z) 2 |z| dz.
2
Making the change of variables z = t finishes the proof. t
u

This immediately implies that the characteristic function uniquely determines the distribution.

Theorem 27 (Uniqueness) If Z Z
ei(t,x) dP(x) = ei(t,x) dQ(x)

then P = Q.

46

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. By the above Lemma, P = Q . If X P and g N(0, I) then X + g X almost surely as 0 and,
therefore, P P weakly. Similarly, Q Q. t
u

This immediately implies the following stability property of the normal distribution.

Lemma 24 If X j for j = 1, 2 are independent and have normal distributions N( j , 2j ) then X1 + X2 has normal
distribution N(1 + 2 , 12 + 22 ).

Proof. By independence and (9.0.2), the characteristic function of the sum is equal to
2 2 /2 2 2 /2 2 ( 2 + 2 )/2
Eeit(X1 +X2 ) = EeitX1 EeitX2 = eit1 t 1 eit2 t 2 = eit(1 +2 )t 1 2 .

This is the characteristic function of the normal distribution N(1 + 2 , 12 + 22 ), so the claim follows from the
Uniqueness Theorem. One can also prove this by a straightforward computation using the formula (9.0.4) for the
density of the convolution, which is left as an exercise below. t
u

Lemma 23 gives some additional information when the characteristic function is integrable.
R
Lemma 25 (Fourier inversion formula) If | f (t)| dt < then P itself has density
 1 k Z
p(x) = f (t)ei(t,x) dt.
2

Proof. Since 1 2 |t|2


f (t)ei(t,x) 2 f (t)ei(t,x)
pointwise as 0 and
1 2 2
f (t)ei(t,x) 2 |t| | f (t)|,

the integrability of f (t) implies, by the dominated convergence theorem, that p (x) p(x). Since P P weakly,
for any g Cb (Rk ), Z Z
g(x)p (x) dx g(x) dP(x).

On the other hand, since


 1 k Z
|p (x)| | f (t)|dt < ,
2
by the dominated convergence theorem, for any compactly supported g Cc (Rk ),
Z Z
g(x)p (x) dx g(x)p(x) dx.

Therefore, for any such g Cc (Rk ), Z Z


g(x) dP(x) = g(x)p(x) dx.

Of course, this means that p(x) is the density of P. t


u

Because of the uniqueness theorem, the convergence of characteristic functions implies convergence of distributions
under some mild additional assumptions.

Lemma 26 If (Pn ) is uniformly tight on Rk and the characteristic functions converge to some function f ,
Z
fn (t) = ei(t,x) dPn (x) f (t),

then f (t) = ei(t,x) dP(x) is a characteristic function of some distribution P and Pn P weakly.
R

47

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. For any sequence (n(k)), by the Selection Theorem, there exists a subsequence (n(k(r))) such that Pn(k(r))
converges weakly to some distribution P. Since ei(t,x) is bounded and continuous,
Z Z
ei(t,x) dPn(k(r)) ei(t,x) dP(x)

as r and, therefore, f is the characteristic function of P. By the uniqueness theorem, the distribution P does not
depend on the sequence (n(k)). By Lemma 18, Pn P weakly. t
u

Even though in many cases uniform tightness of (Pn ) is not difficult to check directly, it also follows automatically
from the continuity of the limit of characteristic functions at zero. To show this, we will need the following bound.

Lemma 27 If X is a real-valued random variable then


1 7 u
 Z
P |X| > (1 Re f (t)) dt.
u u 0
R
Proof. Since Re f (t) = costx dP(x), we can write
Z u ZuZ
1 1
(1 Re f (t)) dt = (1 costx) dP(x)dt
u 0 u
0 R
Z Zu
1
= (1 costx) dtdP(x)
u
R 0
sin xu 
Z
= 1 dP(x)
xu
R
sin xu 
Z 
1 dP(x)
xu
|xu|1
sin y sin 1 1  1
n o Z
since < if y > 1 (1 sin 1) 1dP(x) P |X| ,
y 1 7 u
|xu|1

which finishes the proof. t


u

Theorem 28 (Levys continuity theorem) Let (Xn ) be a sequence of random variables on Rk . Suppose that

fn (t) = Eei(t,Xn ) f (t)

and f (t) is continuous at 0 along each axis. Then there exists a probability distribution P such that
Z
f (t) = ei(t,x) dP(x)

and Pn = L (Xn ) P weakly.

Proof. By Lemma 26, we only need to show that {L (Xn )} is uniformly tight. If we denote Xn = (Xn,1 , . . . , Xn,k ) then
the characteristic functions of the ith coordinate:

fni (ti ) := fn (0, . . . ,ti , 0, . . . 0) = Eeiti Xn,i f (0, . . . ,ti , . . . 0) =: f i (ti ).

Since fn (0) = 1, we have f (0) = 1. By the continuity of f i at 0, for any > 0 we can find > 0 such that for all i k,
| f i (ti ) 1| if |ti | . By the dominated convergence theorem,

1 1 1
Z Z Z
1 Re fni (ti ) dti = 1 Re f i (ti ) dti 1 f i (ti ) dti .
 
lim
n 0 0 0

48

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Together with the previous lemma this implies that
1 7
 Z
1 Re fni (ti ) dti 7 2,

P |Xn,i | >
0
for large enough n. Since |Xn |2 = ki=1 |Xn,i |2 , the union bound implies that

 k
P |Xn | > 14k,

which means that the sequence Pn = L (Xn ) is uniformly tight. t
u
Using characteristic functions, we can prove a couple more basic properties of convergence in distributions.

Lemma 28 (Continuous Mapping.) Suppose that Pn P weakly on X and G : X Y is a continuous map. Then
Pn G1 P G1 on Y. In other words, if Zn Z in distribution then G(Zn ) G(Z) in distribution.

Proof. This is obvious, because for any f Cb (Y ) we have f G Cb (X) and, therefore, E f (G(Zn )) E f (G(Z)),
which is the definition of convergence G(Zn ) G(Z) in distribution. t
u

Lemma 29 If Pn P on Rk and Qn Q on Rm then Pn Qn P Q on Rk+m .

Proof. By Fubinis theorem, the characteristic function


Z Z Z Z Z Z
ei(t,x) dPn Qn (x) = ei(t1 ,x1 ) dPn ei(t2 ,x2 ) dQn ei(t1 ,x1 ) dP ei(t2 ,x2 ) dQ = ei(t,x) dP Q.

By Lemma 26, it remains to show that (Pn Qn ) is uniformly tight. By Theorem 26, since Pn P, (Pn ) is uniformly
tight. Therefore, there exists a compact K on Rk such that Pn (K) > 1 . Similarly, for some compact K 0 on Rm ,
Qn (K 0 ) > 1 . We have, Pn Qn (K K 0 ) > 1 2 and K K 0 is a compact on Rk+m . t
u

Corollary 1 If both Pn P and Qn Q on Rk then Pn Qn P Q.

Proof. Since the function G : Rk+k Rk given by G(x, y) = x + y is continuous, by the continuous mapping lemma,

Pn Qn = (Pn Qn ) G1 (P Q) G1 = P Q,

which finishes the proof. t


u
Exercise. Prove Lemma 24 using the formula (9.0.4) for the density of the convolution.
4
Exercise. Does there exist a random variable X such that EeitX = et ?
Exercise. Prove that a probability distribution on Rk is uniquely determined by the probabilities of half-spaces.
Exercise. Prove that if P Q = P on Rk then Q({0}) = 1. (Hint: use that a characteristic function is continuous, in
particular, in the neighborhood of zero.)
Exercise. Find the characteristic function of the law on R with density max(1 |x|, 0). Then use the Fourier inversion
formula to conclude that max(1 |t|, 0) is a characteristic function of some distribution.
Exercise. Let f be a continuous function on R with f (0) = 1, f (t) = f (t) for all t, and such that f restricted to [0, )
is convex, with f (t) 0 as t . Show that f is a characteristic function. (Hints: Approximate the general case by
piecewise linear. Then use that max(1 |t|, 0) is a characteristic function and that characteristic functions are closed
under convex combinations and scaling f (t) f (at) for a R.)
Exercise. Does there exist a random variable X such that EeitX = e|t| ?
Exercise. Show that there are distributions P1 , P2 whose characteristic functions coincide on some interval (a, b).
One the other hand, show that if P is compactly supported then its characteristic function on any open interval (a, b)
uniquely determines P. (Hints: Show that in this case f (z) = ezx dP(x) is analytic on C.)
R

49

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 10

The Central Limit Theorem.

We have seen in the Hoeffding-Chernoff inequality in Section 4 that if X1 , . . . , Xn are independent flips of a fair coin,
P(Xi = 0) = P(Xi = 1) = 1/2, and X n is their average then
2
P |X n 1/2| > t 2e2nt .


If we denote
Sn n
Zn = 2 n(X n 1/2) = ,
n 2
where = EX1 = 1/2 and 2 = Var(X1 ) = 1/4, then the Hoeffding-Chernoff inequality can be rewritten as
2 /2
P(|Zn | t) 2et .
We will see that this inequality is rather accurate for large sample size n, since we will show that
2 2
Z
2 /2 t 2
lim P(|Zn | t) = ex dx et /2 .
n 2 t t 2
The asymptotic equivalence as t is a simple exercise, and the large n limit is a consequence of a more general
result, Z z
1 2
lim P(Zn z) = (z) = ex /2 dx, (10.0.1)
n 2
2
for all z R. In other words, Zn converges in distribution to a standard normal distribution with the density ex /2 / 2
and the c.d.f. (z). Moreover, this holds for any i.i.d. sequence (Xn ) with = EX1 and 2 = Var(X1 ) < , and
Sn n 1 n Xi
Zn = = . (10.0.2)
n 2 n i=1
Before we use characteristic functions to prove this result as well as its multivariate version, let us describe a more
intuitive approach via the so-called Lindebergs method, which emphasizes a little bit more explicitly the main reason
behind the central limit theorem, namely, the stability property of the normal distribution proved in Lemma 24.
Theorem 29 Consider an i.i.d. sequence (Xi )i1 such that EX1 = , EX12 = 2 and E|X1 |3 < . Then the distribution
of Zn defined in (10.0.2) converges weakly to standard normal distribution N(0, 1).
Remark. One can easily modify the proof we give below to get rid of the unnecessary assumption E|X1 |3 < . This
will appear as an exercise in the next section, where we will state and prove a more general version of the Lindebergs
CLT for non i.i.d. random variables.
Proof. First of all, notice that the random variables (Xi )/ have mean 0 and variance 1. Therefore, it is enough to
prove the result for
1 n
Zn = Xi
n i=1

50

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


under the assumption that EX1 = 0, EX12 = 1. Let (gi )i1 be independent standard normal random variables. Then, by
the stability property,
1 n
Z = gi
n i=1
also has standard normal distribution N(0, 1). If, for 1 m n + 1, we define
1 
Tm = g1 + . . . + gm1 + Xm + . . . + Xn
n
then, for any bounded function f : R R, we can write
n  n
E f (Zn ) E f (Z) = E f (Tm ) E f (Tm+1 ) E f (Tm ) E f (Tm+1 ) .
m=1 m=1

If we denote
1 
Sm = g1 + . . . + gm1 + Xm+1 + . . . + Xn
n

then Tm = Sm + Xm / n and Tm+1 = Sm + gm / n. Suppose now that f has uniformly bounded third derivative. Then,
by Taylors formula,
f 0 (Sm )Xm f 00 (Sm )Xm2 k f 000 k |Xm |3
f (Tm ) f (Sm )

n 2n 6n3/2
and f 0 (Sm )gm f 00 (Sm )g2m k f 000 k |gm |3
f (Tm+1 ) f (Sm ) .

n 2n 6n3/2
Notice that Sm is independent of Xm and gm and, therefore,

E f 0 (Sm )Xm = E f 0 (Sm )EXm = 0 = E f 0 (Sm )Egm = E f 0 (Sm )gm

and
E f 00 (Sm )Xm2 = E f 00 (Sm )EXm2 = E f 00 (Sm ) = E f 00 (Sm )Eg2m = E f 00 (Sm )g2m .
As a result, taking expectations and subtracting the above inequalities, we get
000 3 3
E f (Tm ) E f (Tm+1 ) k f k (E|X1 | + E|g1 | ) .

6n3/2

Adding up over 1 m n, we proved that


000 3 3
E f (Zn ) E f (Z) k f k (E|X
1 | + E|g1 | )

6 n
and limn E f (Zn ) = E f (Z). We proved in Section 8 that this convergence for f Cb (R) implies convergence of the
c.d.f.s in (10.0.1), but the same proof also works using functions with bounded third derivative. We will leave it as an
exercise at the end of this section. t
u
Now, let us prove the CLT using the method of characteristic functions.
2
Theorem 30 (Central Limit Theorem) Consider an i.i.d. sequence (Xi )i1 such that EX1 = 0, EX1 = 1. Then the
distribution of Zn = Sn / n converges weakly to standard normal distribution N(0, 1).

Proof. We begin by showing that the characteristic function of Sn / n converges to the characteristic function of the
standard normal distribution N(0, 1),
Sn
it 1 2
lim Ee n = e 2 t ,
n
for all t R. By independence,
n itXi  itX
Sn 1 n

it
Ee n = Ee n = Ee n .
i=1

51

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Since EX12 < , Lemma 22 implies that the characteristic function f (t) = EeitX1 C2 (R) and, therefore,

1 00
f (t) = f (0) + f 0 (0)t + f (0)t 2 + o(t 2 ) as t 0.
2
Since
f (0) = 1, f 0 (0) = EiXei0X = iEX = 0, f 00 (0) = E(iX)2 = EX 2 = 1,
we get
t2
f (t) = 1 + o(t 2 ).
2
Finally,
itS
n
  t n  t2  t 2 n 1 2
Ee n = f = 1 +o e 2 t
n 2n n
as n . The result then follows from Levys continuity theorem in the previous section. Alternatively, we could also
use Lemma 26 since it is easy to check that the sequence of laws
  
Sn 
L
n n1

is uniformly tight. Indeed, by Chebyshevs inequality,


 S  1  S  1
n n
P > M 2 Var = 2 <
n M n M

for large enough M. t


u

Next, we will prove a multivariate version of the central limit theorem. For a random vector X = (X1 , . . . , Xk ) Rk , let
EX = (EX1 , . . . , EXk ) denote its expectation and

Cov(X) = EXi X j 1i, jk

its covariance matrix. Again, the main step is to prove convergence of characteristic functions.
k 2
Theorem 31 Let (X i )i1 be a sequence of i.i.d. random vectors on R such that EX1 = 0 and E|X1 | < . Then the
distribution of Sn / n converges weakly to the distribution P with the characteristic function
1
f p (t) = e 2 (Ct,t) , (10.0.3)

where C = Cov(X1 ).

Proof. Consider any t Rk . Then Zi = (t, Xi ) are i.i.d. real-valued random variables and, by the central limit theorem
on the real line,
 S 
n 1 n  Var(Z ) 
1
 1 
E exp i t, = E exp i Zi exp = exp (Ct,t)
n n i=1 2 2
as n , since 
Var t1 X1,1 + + tk X1,k = tit j EX1,i X1, j = (Ct,t).
i, j
n  o
The sequence L
Sn
n
is uniformly tight on Rk since
 S
n
 1 Sn 2 1 2 1 2 1 M
P M 2 E = 2
E (Sn,1 , . . . , Sn,k ) = ESn,i = 2 E|X1 |2 0,
n M n nM nM 2 ik M

so we can use Lemma 26 to finish the proof. Alternatively, we can just apply Levys continuity theorem. t
u

52

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The covariance matrix C = Cov(X) is symmetric and non-negative definite, since

(Ct,t) = E(t, X)2 0.

The unique distribution with covariance function in (10.0.3) is called a multivariate normal distribution with the
covariance C and is denoted by N(0,C). It can also be defined more constructively as follows. Consider an i.i.d.
sequence g1 , . . . , gn of standard normal N(0, 1) random variables and let g = (g1 , . . . , gn )T . Given a k n matrix A, the
covariance matrix of Ag Rk is

C = Cov(Ag) = EAg(Ag)T = AEggT AT = AAT .

Since we know the characteristic function of the standard normal random variables gl , the characteristic function of
Ag is

E exp i(t, Ag) = E exp i(AT t, g) = E exp i(AT t)l gl = exp ((AT t)l )2 /2

ln ln
 1   1   1 
= exp |AT t|2 = exp t T AAT t = exp (Ct,t) ,
2 2 2
which means that Ag has the distribution N(0,C). On the other hand, given a symmetric non-negative definite matrix C
one can always find A such that C = AAT . For example, let C = QDQT be its eigenvalue decomposition, for orthogonal
matrix Q and diagonal matrix D. Since C is nonnegative definite, the elements of D are nonnegative. Then, one can
take n = k and A = C1/2 := QD1/2 QT or A = QD1/2 . This means that any normal distribution can be generated as a
linear transformation of the vector of i.i.d. standard normal random variables.
Density in the invertible case. Suppose det(C) , 0. Take any invertible A such that C = AAT , so that Ag N(0,C).
Since the density of g is
1  1   1 k  1 
2 exp 2 xl2 = 2 exp 2 |x|2 ,
lk

for any Borel set Rk we can write


Z  1 k  1 
1
P(Ag ) = P(g A ) = exp |x|2 dx.
A1 2 2

Let us now make the change of variables y = Ax or x = A1 y. Then


1 k  1 1
Z  
P(Ag ) = exp |A1 y|2 dy.
2 2 | det(A)|
But since
det(C) = det(AAT ) = det(A) det(AT ) = det(A)2
p
we have | det(A)| = det(C). Also

|A1 y|2 = (A1 y)T (A1 y) = yT (AT )1 A1 y = yT (AAT )1 y = yT C1 y.

Therefore, we get
1 k 1  1
Z  
P(Ag ) = p exp yT C1 y dy.
2 det(C) 2
This means that  1 k 1  1 
p exp yT C1 y
2 det(C) 2
is the density of the distribution N(0,C) when the covariance matrix C is invertible. t
u
General case. If C = QDQT is the eigenvalue decomposition of C, let us take, for example, X = QD1/2 g for i.i.d.
standard normal vector g, so that X N(0,C). If q1 , . . . , qk are the column vectors of Q then
1/2 1/2
X = QD1/2 g = (1 g1 )q1 + . . . + (k gn )qk .

53

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


1/2 1/2
Therefore, in the orthonormal coordinate basis q1 , . . . , qk , the random vector X has coordinates 1 g1 , . . . , k gk .
These coordinates are independent and have normal distributions with variances 1 , . . . , k correspondingly. When
det(C) = 0, i.e. C is not invertible, some of its eigenvalues will be zero, say, n+1 = . . . = k = 0. Then the distribution
will be concentrated on the subspace spanned by vectors q1 , . . . , qn , and it will not have density on the entire space Rk .
On the subspace spanned by vectors q1 , . . . , qn , it will have the density

1  x2 
f (x1 , . . . , xn ) = p exp l .
ln 2l 2l

t
u

Let us look at some properties of normal distributions.

Lemma 30 If X N(0,C) on Rk and A : Rk Rm is linear then AX N(0, ACAT ) on Rm .

Proof. The characteristic function of AX is


 1   1 
E exp i(t, AX) = E exp i(AT t, X) = exp (CAT t, AT t) = exp (ACAT t,t) ,
2 2
and the statement follows by definition. t
u

Lemma 31 X is normal on Rk if and only if (t, X) is normal on R for all t Rk .

Proof. =. The c.f. of real-valued random variable (t, X) is


 1   1 
f ( ) = E exp i (t, X) = E exp i(t, X) = exp (Ct, t) = exp 2 (Ct,t) ,
2 2
which implies that (t, X) has normal distribution N(0, (Ct,t)) with variance (Ct,t).

=. If (t, X) is normal then


 1 
E exp i(t, X) = exp (Ct,t) ,
2
because the variance of (t, X) is (Ct,t). This means that X is normal N(0,C). t
u

Lemma 32 Let Z = (X,Y ), where X = (X1 , . . . , Xi ) and Y = (Y1 , . . . ,Y j ), and suppose that Z is normal on Ri+ j . Then
X and Y are independent if and only Cov(Xm ,Yn ) = 0 for all m, n.

Proof. One way is obvious. The other way around, suppose that
 
D 0
C = Cov(Z) = .
0 F

If t = (t1 ,t2 ) then the characteristic function of Z is


 1   1 1 
E exp i(t, Z) = exp (Ct,t) = exp (Dt1 ,t1 ) (Ft2 ,t2 ) = E exp i(t1 , X) E exp i(t2 ,Y ),
2 2 2
which is precisely the characteristic function of independent X and Y . By the uniqueness theorem, X and Y are inde-
pendent. t
u

Exercise. Finish the proof of Theorem 29. Namely, show that (10.0.1) holds if limn E f (Zn ) = E f (Z) for functions
f with bounded third derivative.

Exercise. 24% of the residents in a community are members of a minority group but among the 96 people called for
jury duty only 13 are. Does this data indicate that minorities are less likely to be called for jury duty?

54

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Given 0 < t1 < . . . < tk , show that normal distribution N(0, (ti t j )1i, jk ) has density on Rk . (Hint: either
check that the covariance is non-degenerate, or look up the definition of the Brownian motion.)

Exercise. Given a vector of i.i.d. standard normal random variables g = (g1 , . . . , gk )T and orthogonal k k matrix V ,
show that Y = V g also has i.i.d. standard normal coordinates.

Exercise. If g is standard normal, prove that EgF(g) = EF 0 (g) for nice enough functions F : R R.

Exercise. If g = (g1 , . . . , gk ) has arbitrary normal distribution N(0,C) on Rk , prove that


n
F
Eg1 F(g) = (Eg1 gi )E (g),
i=1 xi

where C1i = Eg1 gi , for nice enough functions F : Rk R. Hint: what can you say about g1 and gi (C1i /C11 )g1 for
1 i k?

55

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 11

Lindebergs CLT. Three Series Theorem.

Instead of considering i.i.d. sequences, for each n 1 we will now consider a vector (X1n , . . . , Xnn ) of independent
random variables, not necessarily identically distributed. This setting is called triangular arrays for obvious reasons.

Theorem 32 For each n 1, consider a vector (Xin )1in of independent random variables such that
n
EXin = 0, Var(Sn ) = E(Xin )2 = 1.
i=1

Suppose that the following Lindebergs condition is satisfied:


n
E(Xin )2 I(|Xin | > ) 0 as n for all > 0. (11.0.1)
i=1

Then the distribution L in Xin N(0, 1) weakly.




Remark. We will prove this result again using characteristic functions, but one can also use the same argument as in
Theorem 29 in the previous section. We will leave it as an exercise below. Also, notice
that for i.i.d. random variables
this implies the CLT without the assumption E|X1 |3 < , since in that case Xin = Xi / n.

Proof. First of all, the sequence L in Xin is uniformly tight, because by Chebyshevs inequality


  1
P Xin > M 2

in M

2
for large enough M. It remains to show that the characteristic function of Sn coverges to e 2 . For simplicity of
notation, let us omit the upper index n and write Xi instead of Xin . Since
n
Eei Sn = Eei Xi
i=1

it is enough to show that


n
2
log Eei Sn = log 1 + Eei Xi 1 .

(11.0.2)
i=1 2
It is not difficult to check, by induction on m, that for any a R,
m

ia (ia)k |a|m+1
e . (11.0.3)
k=0 k! (m + 1)!

56

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


We will leave it as an exercise at the end of the section. Using this for m = 1,
2 2 2
Ee i 1 = Eei Xi 1 i EXi EXi2 2 + EXi2 I(|Xi | > ) 2 2 1
i X
(11.0.4)
2 2 2 2
for large n by (11.0.1) and for small enough . Using Taylors expansion of log(1 + z) it is easy to check that

log(1 + z) z |z|2 for |z| 1



2
and, therefore, using this and (11.0.4),
n n n
2 4 2
1 + Eei Xi 1 Eei Xi 1 Eei Xi 1 EXi2
 
log
i=1 i=1 i=1 4
n
4  4
max EXi2 EXi2 = max EX 2 .
4 1in i=1 4 1in i

This goes to zero, because


max EXi2 2 + max EXi2 I(|Xi | > ) 2
1in 1in

as n , by (11.0.1), and then we let 0. As a result, to prove (11.0.2) it remains to show that
n
2
Eei Xi 1 .

2
i=1

Using (11.0.3) for m = 1, on the event |Xi | > ,
2
e i 1 i Xi I |Xi | > Xi2 I |Xi | >
i X  
2
and, therefore,
2
1 i Xi + Xi2 I |Xi | > 2 Xi2 I |Xi | > .
i Xi  
(11.0.5)
2
e

Using (11.0.3) for m = 2, on the event |Xi | ,
2  3  3 2
1 i Xi + Xi2 I |Xi | |Xi |3 I |Xi |
i Xi
X . (11.0.6)
2 6 6 i
e

Combining the last two equations and using that EXi = 0,


2  3
1 + EXi2 2 EXi2 I |Xi | > + EXi2 .
i Xi
2 6
Ee

Finally, this implies


n  2 n  2 n
Eei Xi 1 + = Eei Xi 1 + EXi2

2 2
i=1 i=1 i=1
n
3 3
+ 2 EXi2 I(|Xi | > )
6 i=1 6

as n , using Lindebergs condition (11.0.1). It remains to let 0. t


u

We will now use this version of the CLT to prove the three series theorem for random series. Let us begin with the
following observation.

Lemma 33 If P, Q are distributions on R such that P Q = P then Q({0}) = 1.

57

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. Let us consider the characteristic functions
Z Z
fP (t) = eitx dP(x), fQ (t) = eitx dQ(x).

The condition P Q = P implies that fP (t) fQ (t) = fP (t). Since fP (0) = 1 and fP (t) is continuous, for small enough
|t| we have | fP (t)| > 0 and, as a result, fQ (t) = 1. Since
Z Z
fQ (t) = cos(tx) dQ(x) + i sin(tx) dQ(x),
R
for |t| this implies that cos(tx) dQ(x) = 1 and, since cos(s) 1, this can happen only if
 
Q x : xt = 0 mod 2 = 1 for all |t| .

Take s,t such that |s|, |t| and s/t is irrational. For x to be in the support of Q we must have xs = 2k and xt = 2m
for some integer k, m. This can happen only if x = 0. t
u

We proved in Section 6 that convergence of random series in probability implies almost sure convergence. It turns
out that this result can be strengthened, and convergence in distribution also implies convergence in probability and,
therefore, almost sure convergence.

Theorem 33 (Levys equivalence theorem) If (Xi ) is a sequence of independent random variables then the series
i1 Xi converges almost surely, in probability, or in distribution, at the same time.

Proof. Of course, we only need to show that convergence in distribution implies convergence in probability. Suppose
that L (Sn ) P. Convergence in law implies that {L (Sn )} is uniformly tight, which implies that {L (Sn Sk )}n,k1
is uniformly tight. We will now show that this implies that, for any > 0,

P(|Sn Sk | > ) < (11.0.7)

for n k N for large enough N. Suppose not. Then there exists > 0 and sequences (n(l)) and (n0 (l)) such that
n(l) n0 (l) and
P(|Sn0 (l) Sn(l) | > ) .
Let us denote Yl = Sn0 (l) Sn(l) . Since {L (Yl )} is uniformly tight, by the selection theorem, there exists a subsequence
(l(r)) such that L (Yl(r) ) Q. Since

Sn0 (l(r)) = Sn(l(r)) +Yl(r) and L (Sn0 (l(r)) ) = L (Sn(l(r)) ) L (Yl(r) ),

letting r we get that P = P Q. By the above Lemma, Q({0}) = 1, which implies that P(|Yl(r) | > ) for large
r a contradiction. By Lemma 16, (11.0.7) implies that Sn converges in probability. t
u

Theorem 34 (Three series theorem) Let (Xi )i1 be a sequence of independent random variables and let

Zi = Xi I(|Xi | 1).

Then the series i1 Xi converges almost surely if and only if the following three conditions hold:

(1) i1 P(|Xi | > 1) < ,


(2) i1 EZi converges,
(3) i1 Var(Zi ) < .

Proof. =. Suppose (1) (3) hold. Since

P(|Xi | > 1) = P(Xi , Zi ) < ,


i1 i1

58

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


by the Borel-Cantelli lemma, P(Xi , Zi i.o.) = 0, which means that i1 Xi converges if and only if i1 Zi converges.
By (2), it is enough to show that i1 (Zi EZi ) converges, but this follows from Theorem 18 by (3).

=. If i1 Xi converges almost surely, P(|Xi | > 1 i.o) = 0, and, since (Xi ) are independent, by the Borel-Cantelli
lemma,
P(|Xi | > 1) < .
i1

This proves (1). We will prove (3) by contradiction. If i1 Xi converges then, obviously, i1 Zi converges. Let

Smn = Zk .
mkn

Since Smn 0 as m, n , P(|Smn | > ) for any > 0 for m, n large enough. Suppose that i1 Var(Zi ) = .
Then
2
mn = Var(Smn ) = Var(Zk )
mkn

as n for any fixed m. Intuitively, this should not happen, since Smn 0 in probability but their variance goes
to infinity. In principle, one can construct such sequence of random variables, but in our case it will be ruled out by
Lindebergs CLT, as follows. Let us show that, by Lindebergs theorem,
Smn ESmn Zk EZk
Tmn = =
mn mkn mn

2 . We only need to check that


converges in distribution to N(0, 1) if m, n in such a way that mn
 Z EZ 2  Z EZ 
k k k k
E I > 0
mkn mn mn

if m, n in such a way that mn2 . This is obvious, because |Z EZ | < 2 and


k k mn , so the event in the
indicator does not occur and the sum is, in fact, equal to 0 for large m, n. Next, since P(|Smn | > ) , we get
  ESmn 
P |Smn | = P Tmn + 1.

mn mn
But this is impossible, since Tmn is approximately standard normal, which can not concentrate near any constant.
Therefore, we proved that i1 Var(Zi ) < . By Kolmogorovs SLLN this implies that i1 (Zi EZi ) converges
almost surely and, since i1 Zi converges almost surely, i1 EZi also converges. t
u

The central limit theorem describes when the distribution of the sum of independent random variables can approx-
imated by a normal distribution. For a change, we will now give a simple example of a different approximation.
Consider a triangular array of independent Bernoulli random variables Xin for i n such that P(Xin = 1) = pni and
P(Xin = 0) = 1 pni . If pni = p > 0 then, by the central limit theorem,
 S np 
L pn N(0, 1)
np(1 p)

weakly. However, if p = pn = pni 0 fast enough then one can check that Lindebergs conditions will be violated. We
will now show that, under certain conditions, the sum can be approximated in distribution by some Poisson distribution.
Recall that, for > 0, Poisson distribution has probability function

k
({k}) = e for k = 0, 1, 2, . . .
k!
The following holds.

59

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 35 Consider independent Bernoulli random variables Xi with parameters pi for i n and let

Sn = X1 + . . . + Xn and = p1 + . . . + pn .

Then, for any subset of integers B Z,


n
|P(Sn B) (B)| p2i .
i=1

Proof. The proof is based on a construction on the same probability space. Let us construct Bernoulli r.v. Xi B(pi )
and Poisson r.v. Xi pi on the same probability space as follows. Let us consider the standard probability space
([0, 1], B, P) with the Lebesque measure P. Let us fix i for a moment and define a random variable

0, 0 x 1 pi ,
Xi = Xi (x) =
1, 1 pi < x 1.

Clearly, Xi B(pi ). Let us construct Xi as follows. For k 0, let us denote


k
(pi )l pi
ck = e
l=0 l!

and let

0, 0 x c0 ,
1, c0 < x c1 ,

Xi = Xi (x) =

2, c1 < x c2 ,
...

Clearly, Xi has the Poisson distribution pi . When do we have Xi , Xi ? Since 1 pi epi = c0 , this can only happen
for 1 pi < x c0 and c1 < x 1, which implies that

P(Xi , Xi ) = epi (1 pi ) + (1 epi pi epi ) = pi (1 epi ) p2i .

Then we construct pairs (Xi , Xi ) for i n on separate coordinates of the product space [0, 1]n with the product Lebesgue
measure, thus, making them independent for i n. It is well-known (and easy to check) that a sum of independent
Poisson random variables is again Poisson with the parameter given by the sum of individual parameters and, therefore,
in Xi has the Poisson distribution , where = p1 + . . . + pn . Finally, we use the union bound to conclude that
n n
P(Sn , Sn ) P(Xi , Xi ) p2i ,
i=1 i=1

which finishes the proof. t


u

Exercise. Adapt the argument of Theorem 29 in the previous section to prove Theorem 32. Hint: when using the Taylor
expansion in that proof, mimic equations (11.0.5) and (11.0.6).

Exercise. Prove (11.0.3). (Hint: integrate the inequality to make the induction step.)

Exercise. Consider the random series n1 n n where (n ) are i.i.d. random signs P(n = 1) = 1/2. Give the
necessary and sufficient condition on for this series to converge.

Exercise. Let (Xn )n1 be i.i.d. random variables such that EX1 = 0 and 0 < EX12 < . Show that n1 Xn /an converges
almost surely if and only if n1 1/a2n < .

Exercise. Gamma distribution with parameters k, > 0 has density

xk1 ex/
I(x > 0).
k (k)

60

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Suppose Xn has Gamma(kn , n ) distribution with kn = 1 + sin2 n and n = 1/kn , and Xn are independent for n 1.
Either show that
X1 + . . . + Xn n

n
converges in distribution to a Gaussian, or reduce this to a question about the existence of a certain limit.

Exercise. Let (Xn )n1 be i.i.d. random variables with the Poisson distribution with mean = 1. Can you find an and
bn such that an n`=1 X` log ` bn converges in distribution to standard Gaussian?

Exercise. Suppose that the chances of winning the jackpot in a lottery are 1 : 139, 000, 000. Assuming that 100, 000, 000
people played independently of each other, estimate the probability that 3 of them will have to share the jackpot. Give
a bound on the quality of your estimate.

61

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 12

Conditional expectations and distributions.

The topic of this section can be viewed as a vast abstract generalization of the Fubini theorem for product spaces.
For example, given a pair of random variables (X,Y ), how can we average a function f (X,Y ) by fixing X = x and
averaging with respect to Y first? Can we define some distribution of Y given X = x that can be used to compute this
average? We will begin by defining these conditional averages, or conditional expectations, and then we will use them
to define the corresponding conditional distributions.
Let (, A , P) be a probability space and X : R be a random variable such that E|X| < . Let B be a
-subalgebra of A , B A . Random variable Y : R is called the conditional expectation of X given B if

1. Y is measurable on B, i.e. for every Borel set B on R, Y 1 (B) B.


2. For any set B B, we have EXIB = EY IB .

This conditional expectation is denoted by Y = E(X|B). If X, Z are random variables then the conditional expectation
of X given Z is defined by
Y = E(X|Z) = E(X| (Z)).
Since Y is measurable on (Z), Y = f (Z) for some measurable function f . Let us go over a number of properties of
conditional expectation, most of which follow easily from the definition.

1. (Existence) Let us define Z


(B) = X dP for B B.
B
Since E|X| < , (B) is a -additive signed measure on B. Moreover, if P(B) = 0 then (B) = 0, which means that
is absolutely continuous with respect to P. By the Radon-Nikodym theorem, there exists a function Y = d/dP
measurable on B such that, for B B, Z Z
(B) = X dP = Y dP.
B B

By definition, Y = E(X|B). Another way to define at conditional expectations is as follows. Notice that L 2 (, B, P)
is a linear subspace of L 2 (, A , P) for B A . If f L 2 (, A , P) then the conditional expectation g = E( f |B)
is simply an orthogonal projection of f onto that subspace, since such projection is B-measurable and the orthogonal
projection is defined by E f h = Egh for all h L 2 (, B, P). In particular, one can take h = IB for any B B. When
f L 1 (, A , P), the conditional expectation can be defined by approximating f by functions in L 2 (, A , P).

2. (Uniqueness) Suppose there exists another Y 0 = E(X|B) such that P(Y , Y 0 ) > 0, i.e. either P(Y > Y 0 ) > 0 or
P(Y < Y 0 ) > 0. Since both Y,Y 0 are measurable on B, the set B = {Y > Y 0 } B. If P(B) > 0 the E(Y Y 0 )IB > 0. On
the other hand,
E(Y Y 0 )IB = EXIB EXIB = 0,
a contradiction. Similarly, P(Y < Y 0 ) = 0. Of course, the conditional expectation can be redefined on a set of measure
zero, so uniqueness is understood in this sense.

62

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


3. (Linearity) Conditional expectation is linear,

E(cX +Y |B) = cE(X|B) + E(Y |B),

just like the usual integral. It is obvious that the right hand side satisfies the definition of the conditional expectation
E(cX +Y |B) and the equality follows by uniqueness. Again, the equality holds almost surely.
4. (Smoothing) If -algebras C B A then

E(X|C ) = E(E(X|B)|C ).

To see this, consider a set C C . Since C is also in B,

EIC (E(E(X|B)|C )) = EIC E(X|B) = EIC X,

so the right hand side satisfies the definition of E(X|C ).


5. (Extreme cases) When B is either the entire -algebra A or the trivial -algebra {0,
/ },

E(X|A ) = X, E(X|{0,
/ }) = EX.

This is obvious by definition.


6. (Independent -algebra) If X is independent of B then

E(X|B) = EX.

This is also obvious by independence, since for B B,

EXIB = EXP(B) = E(EX)IB

and EX is, of course, B-measurable.


6. (Monotonicity) If X Z then
E(X|B) E(Z|B)
almost surely. The proof is by contradiction, similar to the proof of uniqueness above.
7. (Monotone convergence) If E|Xn | < , E|X| < and Xn X then

E(Xn |B) E(X|B).

To prove this, let us first note that, since E(Xn |B) E(Xn+1 |B) E(X|B), there exists a limit

g = lim E(Xn |B) E(X|B).


n

Since E(Xn |B) are measurable on B, so is the limit g. It remains to check that for any set B B, EgIB = EXIB . Since
Xn IB XIB and E(Xn |B)IB gIB , the usual monotone convergence theorem implies that

EXn IB EXIB and EIB E(Xn |B) EgIB .

Since EIB E(Xn |B) = EXn IB , this implies that EgIB = EXIB and, therefore, g = E(X|B) almost surely.
8. (Dominated convergence) If |Xn | Y, EY < , and Xn X then

lim E(Xn |B) = E(X|B).


n

The proof of the same as for usual integrals. We write

Y gn = inf Xm Xn hn = sup Xm Y
mn mn

63

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and observe that gn X, hn X, |gn | Y and |hn | Y . Therefore, by the monotone convergence theorem,

E(gn |B) E(X|B), E(hn |B) E(X|B)

and the monotonicity of conditional expectation implies the claim.

9. (Measurable factor) If E|X| < , E|XY | < and Y is measurable on B then

E(XY |B) = Y E(X|B).

First of all, it is enough to prove this for non-negative X,Y 0 by decomposing X = X + X and Y = Y + Y .
Since Y is B-measurable, we can find a sequence of simple functions Yn of the form k wk ICk , where Ck B, such
that 0 Yn Y. By the monotone convergence theorem, it is enough to prove that

E(XICk |B) = ICk E(X|B).

Take any B B. Since B Ck B,

EIB ICk E(X|B) = EIBCk E(X|B) = EXIBCk = E(XICk )IB

and this finishes the proof.

10. (Jensens inequality) If f : R R is convex and E| f (X)| < then

f (E(X|B)) E( f (X)|B).

Let h(x) be some monotone (thus, measurable) function such that h(x) is in the subgradient f (x) of the convex
function f for all x. Then, by convexity,

f (X) f (E(X|B)) h(E(X|B))(X E(X|B)).

If we ignore any integrability issues then, taking conditional expectations of both sides,

E( f (X)|B) f (E(X|B)) h(E(X|B))(E(X|B) E(X|B)) = 0,

where we used the previous property, since h(E(X|B)) is B-measurable. Let us now make this simple idea rigorous.
Let Bn = { : |E(X|B)| n} and Xn = XIBn . As above,

f (Xn ) f (E(Xn |B)) h(E(Xn |B))(Xn E(Xn |B)).

Notice that Bn B and, by property 9, E(Xn |B) = IBn E(X|B) [n, n]. Since both f and h are bounded on [n, n],
now we dont have any integrability issues and, taking conditional expectations of both sides, we get

f (E(Xn |B)) E( f (Xn )|B).

Now, let n . Since E(Xn |B) = IBn E(X|B) E(X|B) and f is continuous, f (E(Xn |B)) converges to f (E(X|B)).
On the other hand,

E( f (Xn )|B) = E( f (X)IBn + f (0)IBcn |B) = E( f (X)|B)IBn + f (0)IBcn E( f (X)|B),

and this finishes the proof. t


u

Conditional distributions. Again, let (, A , P) be a probability space and let B A be a sub- -algebra of A . Let
(Y, Y ) be a measurable space and f be a measurable function from (, A ) into (Y, Y ). A conditional distribution of
f given B is a function P f |B : Y [0, 1] such that

(i) for all , P f |B (, ) is a probability measure on Y ;


(ii) for each set C Y , P f |B (C, ) is B-measurable and P f |B (C, ) = E(I( f C)|B)() almost surely.

64

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The meaning of this definition will become more transparent in the examples below, but we will study the existence of
conditional distribution in this abstract setting first. For any given C, the conditional expectation in (ii) is defined up
to a P|B -null set so there is some freedom in defining P f |B (C, ) for a given C, but it is not obvious that we can define
P f |B (C, ) simultaneously for all C in such a way that (i) is also satisfied. However, when the space (Y, Y ) is regular,
i.e. (Y, d) is a complete separable metric space and Y is its Borel -algebra, then a conditional distribution always
exists.

Theorem 36 If (Y, Y ) is regular then a conditional distribution P f |B exists. Moreover, if P0f |B is another conditional
distribution then
P f |B (, ) = P0f |B (, )
for P|B -almost all , i.e. the conditional distribution is unique modulo P|B -almost everywhere equivalence.

Proof. The Borel -algebra Y is generated by a countable collection Y 0 of sets in Y , for example, by open balls
with rational radii and centers in a countable dense subset of (Y, d). The algebra C generated by Y 0 is countable since
it can be written as a countable union of increasing finite algebras corresponding to finite subsets of Y 0 . Since (Y, d)
is complete and separable, by Ulams theorem (Theorem 5), for any set C Y one can find a sequence of compact
subsets K1 K2 . . . C such that P( f Kn ) P( f C). If K = n1 Kn then, by conditional monotone convergence
theorem,
E(I( f Kn )|B) E(I( f K)|B) = E(I( f C)|B) (12.0.1)
almost surely, where the last equality holds because P( f C \ K) = 0. Let D be a countable algebra generated by the
union of C and the set of all Kn for all C C . For each D D, let P f |B (D, ) be any fixed version of the conditional
expectation E(I( f D)|B)(). We have

(a) for each D D, P f |B (D, ) 0 a.s.,


(b) P f |B (Y, ) = 1 and P f |B (0,
/ ) = 0 a.s.,

(c) for any k 1 and any disjoint D1 , . . . , Dk D, P f |B



ik Di , = ik P f |B (Di , ) a.s.,
S

(d) for any C C and the specific sequence Kn as chosen, (12.0.1) holds a.s.

Since there are countably many conditions in (a) - (d), there exists a set N B such that P(N ) = 0 and conditions (a)
- (d) hold for all < N . We will now show that for all such , P f |B (, ) can be extended to a probability measure on
Y and this extension is such that for any C Y , P f |B (C, ) is a version of the conditional expectation, i.e. (ii) holds.
First, we need to show that for all < N , P f |B (, ) is countably additive on C or, equivalently, for any
sequence Cn C such that Cn 0/ we have P f |B (Cn , ) 0. If not, then there exists a sequence Cn 0/ such that
P f |B (Cn , ) > 0. Using (d), take compact Kn D such that P f |B (Cn \ Kn , ) /3n . Then


P f |B (K1 . . . Kn , ) P f |B (Cn , ) i
>
1in 3 2

so that each set K n = K1 . . . Kn is not empty. Since Cn 0/ and, therefore, K n 0,


/ the complements of K n form an
open cover of the compact set K1 (and the entire space) and there exists a finite open sub-cover of K1 . Since K n are
decreasing, it is easy to see that this can only happen if K n is empty for some n, which is a contradiction.
By Caratheodorys extension theorem, for all < N , P f |B (, ) can be uniquely extended to a probability
measure on Y so that (i) holds (for N , we can define P f |B (, ) to be any fixed probability measure on Y ). To
prove (ii), notice that the collection of sets C Y that satisfy (ii) is, obviously, a -system that contains algebra C
(which is a -system) and, by Dynkins theorem, Theorem 3, it contains the -algebra Y .
To prove uniqueness, by the property (ii), we have
P f |B (C, ) = P0f |B (C, )
for all C C on the set of of measure one, since C is countable, and we know that the probability measure on the
-algebra Y is uniquely determined by its values on the generating algebra C . t
u
When conditional distribution exists, it can be used to compute the conditional expectations with respect to B.

65

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 37 Let g : Y R be a measurable functions such that E|g( f )| < . Then for P|B -almost all , g is integrable
with respect to P f |B (, ) and Z
E(g( f )|B)() = g(y) P f |B (dy, ). (12.0.2)

Proof. If C Y and g(y) = I(y C) then, by (ii),


Z
E(I( f C)|B)() = P f |B (C, ) = I(y C) P f |B (dy, )

for P|B -almost all . Therefore, (12.0.2) holds for all simple functions. For g 0, take a sequence of simple functions
0 gn g. Since Eg( f ) < , by monotone convergence theorem for conditional expectations,

E(gn ( f )|B)() E(g( f )|B)() <

for all < N for some N B with P(N ) = 0. Assume also that for < N , (12.0.2) holds for all functions in the
sequence (gn ). By the usual monotone convergence theorem,
Z Z
gn (y) P f |B (dy, ) g(y) P f |B (dy, )

for all and, therefore, (12.0.2) holds for all < N . This prove the claim for g 0 and the general case follows. t
u

Product space case. Consider measurable spaces (X, X ) and (Y, Y ) and let (, A ) be the product space, = X Y
and A = X Y , with some probability measure P on it. Let

h() = h(x, y) = x : X Y X and f () = f (x, y) = y : X Y Y

be the coordinate projections. Let B be the -algebra generated by the first coordinate projection h, i.e.

B = h1 (X ) = X Y.

Suppose that a conditional distribution P f |B of f given B exists, for example, if (Y, Y ) is regular. For a fixed C Y ,
P f |B (C, ) is B-measurable, by definition, and since B = h1 (X ),

P f |B (C, (x, y)) = Px (C) (12.0.3)

for some X -measurable function Px (C). In the product space setting, Px is called a conditional distribution of y given
x. Notice that for any set D X , { : h() D} B and

I(x D)Px (C) = I(h D)E(I( f C)|B)


= E(I(h D)I( f C)|B) = E(I( D C)|B) a.s.

Integrating both sides over we get


Z Z
P(D C) = I(x D)Px (C) dP(x, y) = Px (C) d(x),
D

where = P h1 is the marginal of P on X. Therefore, conditional distribution of y given x satisfies:

(1) for all x X, Px () is a probability measure on Y ;


(2) for each set C Y , Px (C) is X -measurable;
(3) for any D X and C Y , P(D C) =
RR
dPx (y)d(x).
DC

Of course, (2) and (3) imply that Px (C) defines a version of the conditional expectation of I(y C) given the first
coordinate x. Moreover, (3) implies the following more general statement.

66

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 38 If g : X Y R is P-integrable then
Z ZZ
g(x, y) dP(x, y) = g(x, y) dPx (y)d(x). (12.0.4)

Proof. This coincides with (3) for the indicator of the rectangle D C and as a result holds for indicators of disjoint
unions of rectangles. Then, using the monotone class theorem (or Dynkins theorem) as in the proof of Fubinis
theorem, (12.0.4) can be extended to indicators of all measurable sets in the product -algebra A = X Y and then
to simple functions, positive measurable functions and all integrable functions. t
u

This result is a generalization of Fubinis theorem in the case when the measure is not a product measure. One can also
view this as a way to construct more general measures on product spaces than product measures. One can start with
any function Px (C) that satisfies properties (1) and (2) and define a measure P on the product space as in property (3).
Such functions satisfying properties (1) and (2) are called probability kernels or transition functions. In the case when
Px (C) = (C), we recover the product measure .

A special version of the product space case is the so called disintegration theorem. Let (Y, Y ) be regular and let
be a probability measure on Y . Consider a measurable function : (Y, Y ) (X, X ) and let = 1 be the
push-forward measure on X . Let P be the push-forward measure on the product -algebra X Y by the map
y ((y), y), so that P has marginals and .

Theorem 39 (Disintegration Theorem) For any bounded measurable functions h : X R, f : Y R,


Z Z Z 
f (y)h((y)) d(y) = f (y) dPx (y) h(x) d(x). (12.0.5)

Moreover, for -almost all x, h((y)) = h(x) for Px -almost all y.

Proof. To prove (12.0.5) just take g(x, y) = h(x) f (y) in (12.0.4). If we replace h(x) with h(x)g(x) and use (12.0.5)
twice we get
Z Z  Z
f (y)h((y)) dPx (y) g(x) d(x) = f (y)h((y))g((y)) d(y)
Z Z 
= f (y)dPx (y) h(x)g(x) d(x)
Z Z 
= f (y)h(x) dPx (y) g(x) d(x).

Since g is arbitrary, this means that for -almost all x,


Z Z
f (y)h((y)) dPx (y) = f (y)h(x) dPx (y).

This also holds for -almost all x simultaneously for countably many functions f . We can choose these f to be the
indicators of sets in the algebra C in the proof of Theorem 36, and this choice obviously implies that h((y)) = h(x)
for Px -almost all y. t
u

Example. (Same space case) Suppose that now X = Y, X Y and : Y X is the identity map, (y) = y. Then
= X is just the restriction of to a smaller -algebra X . In this case, (12.0.5) becomes
Z Z Z 
f (y)h(y) d(y) = f (y) dPx (y) h(x) dX (x),

for any bounded X -measurable function h and Y -measurable function f . t


u

Exercise. Let (X,Y ) be a random variable on R2 with the density f (x, y). Show that if E|X| < then
Z .Z
E(X|Y ) = x f (x,Y )dx f (x,Y )dx,
R R

67

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


where the right hand side is defined almost surely.

Exercise. In the setting of the Disintegration Theorem, if X = Y = [0, 1], is the Lebesgue measure on Y , (y) =
2yI(y 1/2) + 2(1 y)I(y > 1/2), what is and Px ?

Exercise. In the setting of the Disintegration Theorem, show that if X is a separable metric space then, for -almost
every x, Px ( 1 (x)) = 1.

Exercise. Suppose that X,Y and U are random variables on some probability space such that U is independent of
(X,Y ) and has the uniform distribution on [0, 1]. Prove that there exists a measurable function f : R [0, 1] R such
that (X, f (X,U)) has the same distribution as (X,Y ) on R2 .

68

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 13

Martingales. Uniform Integrability.

Let (, B, P) be a probability space and let (T, ) be a linearly ordered set. We will mostly work with countable sets
T , such as Z or subsets of Z. Consider a family of random variables Xt : R and a family of -algebras (Bt )tT
such that Bt Bu B for t u. A family (Xt , Bt )tT is called a martingale if the following hold:

1. Xt is Bt -measurable for all t T (Xt is adapted to Bt ),


2. E|Xt | < for all t T ,
3. E(Xu |Bt ) = Xt for all t u.

If the last equality is replaced by E(Xu |Bt ) Xt then the process is called a supermartingale and if E(Xu |Bt ) Xt
then it is called a submartingale. If (Xt , Bt ) is a martingale and Xt = E(X|Bt ) for some random variables X then the
martingale is called right-closable. If some t0 T is an upper bound of T (i.e. t t0 for all t T ) then the martingale
is called right-closed. Of course, in this case Xt = E(Xt0 |Bt ). Clearly, a right-closable martingale can be made into
a right-closed martingale by adding an additional point, say t0 , to the set T so that t t0 for all t T , and defining
Xt0 := X. Let us give several examples of martingales.
Example 1. Consider a sequence (Xn )n1 of independent random variables such that EXi = 0 and let Sn = in Xi . If
Bn = (X1 , . . . , Xn ) then (Sn , Bn )n1 is a martingale since, for m n,

E(Sm |Bn ) = E(Sn + Xn+1 + . . . + Xm |Bn ) = Sn + EXn+1 + . . . + EXm = Sn ,

since Xn+1 , . . . , Xm are independent of Bn and Sn is Bn -measurable.


Example 2. Consider a sequence of -algebras . . . Bn Bn+1 . . . B and a random variable X such that
E|X| < . If we let Xn = E(X|Bn ) then (Xn , Bn ) is a right-closable martingale, since for n < m,

E(Xm |Bn ) = E(E(X|Bm )|Bn ) = E(X|Bn ) = Xn ,

by the smoothing property of conditional expectation.


Example 3. As a typical example, one can consider an integrable function X = f (1 , . . . , n ) of some random variables
1 , . . . , n , let B0 be the trivial -algebra and Bm = (1 , . . . , m ) for m n. If we let Xm = E(X|Bm ) then X0 = EX,
Xn = X and the representation
n
X = EX + (Xm Xm1 )
m=1
is called the martingale-difference representation of X. This representation is particularly useful when 1 , . . . , n are
independent, because in this case Xm1 is just the expectation of Xm with respect to m .
Example 4. Let (Xi )i1 be i.i.d. and let Sn = in Xi . Let T = {. . . , 2, 1} be the set of negative integers and, for
n 1, let us define
Bn = (Sn , Sn+1 , . . .) = (Sn , Xn+1 , Xn+2 , . . .).

69

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Clearly, B(n+1) Bn . For 1 k n, by symmetry (we leave the details for the exercise at the end of the section),

E(X1 |Bn ) = E(Xk |Bn ).

Therefore,
Sn
Sn = E(Sn |Bn ) = E(Xk |Bn ) = nE(X1 |Bn ) and Zn := = E(X1 |Bn ).
1kn n

Thus, (Zn , Bn )n1 is a right-closed martingale. t


u

Theorem 40 (Doobs decomposition) If (Xn , Bn )n0 is a submartingale then it can be uniquely decomposed Xn =
Zn +Yn , where (Yn , Bn ) is a martingale, Z0 = 0, Zn Zn+1 almost surely and Zn is Bn1 -measurable.

The sequence (Zn ) is called predictable, since it is a function of X1 , . . . , Xn1 , so we know it at time n 1. Since an
increasing sequence is always convergent, the question of convergence for submartingales is reduced to the question
of convergence for martingales.

Proof. Let Dn = Xn Xn1 and

Gn = E(Dn |Bn1 ) = E(Xn |Bn1 ) Xn1 0

by the definition of submartingale. Let,

Hn = Dn Gn , Yn = H1 + . . . + Hn , Zn = G1 + + Gn .

Since Gn 0 almost surely, Zn Zn+1 and, by construction, Zn is Bn1 -measurable. We have,

E(Hn |Bn1 ) = E(Dn |Bn1 ) Gn = 0

and, therefore, E(Yn |Bn1 ) = Yn1 . Uniqueness follows by construction. Suppose that Xn = Zn + Yn with all stated
properties. First, since Z0 = 0, Y0 = X0 . By induction, given a unique decomposition up to n 1, we can write

Zn = E(Zn |Bn1 ) = E(Xn Yn |Bn1 ) = E(Xn |Bn1 ) Yn1

and Yn = Xn Zn . This finishes the proof. t


u

When we study convergence properties of martingales, an important role will be played by their uniform integrability
properties. We say that a collection of random variables (Xt )tT is uniformly integrable if

sup E|Xt |I(|Xt | > M) 0 as M . (13.0.1)


t

For example, when |Xt | Y for all t T and EY < then, clearly, (Xt )tT is uniformly integrable. Other basic criteria
of uniform integrability will be given in the exercises below. The following criterion of the L1 -convergence, which can
be viewed as a strengthening of the dominated convergence theorem, will be useful.

Lemma 34 Consider random variables (Xn ) and X such that E|Xn | < , E|X| < . The following are equivalent:

1. E|Xn X| 0 and n .
2. (Xn )n1 is uniformly integrable and Xn X in probability.

Proof. 2=1. For any > 0 and K > 0, we can write,

E|Xn X| + E|Xn X|I(|Xn X| > )


+ 2KP(|Xn X| > ) + 2E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K)
+ 2KP(|Xn X| > ) + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n

70

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Since Xn X in probability, letting n ,

lim sup E|Xn X| + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n n

Now, first letting 0, and then letting K and using that (Xn ) is uniformly integrable proves the result.

1=2. By Chebyshevs inequality,


1
P(|Xn X| > ) E|Xn X| 0

as n , so Xn X in probability. To prove uniform integrability, let us recall that for any > 0 there exists > 0
such that P(A) implies that E|X|IA . This is because

E|X|IA KP(A) + E|X|I(|X| K),

and if we take K such that E|X|I(|X| K) /2 then it is enough to take = /(2K). Now, given > 0, take as
above and take M > 0 large enough, so that for all n 1,

E|Xn |
P(|Xn | > M) .
M
We showed that E|X|I(|Xn | > M) for such and, therefore,

E|Xn |I(|Xn | > M) E|Xn X| + E|X|I(|Xn | > M) E|Xn X| + .

For large enough n n0 , E|Xn X| and, therefore,

E|Xn |I(|Xn | > M) 2.

We can also choose M large enough so that E|Xn |I(|Xn | > M) 2 for n n0 and this finishes the proof. t
u

Next, we will prove some uniform integrability properties for martingales and submartingales, but first let us make the
following simple observation.

Lemma 35 Let f : R R be a convex function such that E| f (Xt )| < for all t. Suppose that either of the following
two conditions holds:

1. (Xt , Bt ) is a martingale;
2. (Xt , Bt ) is a submartingale and f is increasing.

Then ( f (Xt ), Bt ) is a submartingale.

Proof. Under the first condition, for t u, by conditional Jensens inequality,

f (Xt ) = f (E(Xu |Bt )) E( f (Xu )|Bt ).

Under the second condition, since Xt E(Xu |Bt ) for t u and f is increasing,

f (Xt ) f (E(Xu |Bt )) E( f (Xu )|Bt ),

where we again used conditional Jensens inequality. t


u

Lemma 36 The following holds.

1. If (Xt , Bt )tT is a right-closable martingale then (Xt ) is uniformly integrable.


2. If (Xt , Bt )tT is a right-closable submartingale then, for any a R, (max(Xt , a)) is uniformly integrable.

71

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The fact that submartingale (Xt , Bt )tT is right-closable means that Xt E(X|Bt ) for some X such that E|X| < .

Proof. 1. Since there exists an integrable random variable X such that Xt = E(X|Bt ),

|Xt | = |E(X|Bt )| E |X| Bt and E|Xt | E|X| < .




Since {|Xt | > M} Bt ,

Xt I(|Xt | > M) = I(|Xt | > M)E(X|Bt ) = E(XI(|Xt | > M)|Bt )

and, therefore,

E|Xt |I(|Xt | > M) E|X|I(|Xt | > M) KP(|Xt | > M) + E|X|I(|X| > K)


E|Xt | E|X|
K + E|X|I(|X| > K) K + E|X|I(|X| > K).
M M
Letting M , K proves that supt E|Xt |I(|Xt | > M) 0 as M .

2. Since Xt E(X|Bt ) and {Xt > M} Bt , as above this implies that EXt I(Xt > M) EXI(Xt > M). Also, since the
function max(a, x) is convex and increasing in x, by the part (b) of the previous lemma,

E max(a, Xt ) E max(a, X). (13.0.2)

Finally, if we take M > |a| then the inequality | max(Xt , a)| > M can hold only if max(Xt , a) = Xt > M. Combining all
these observations,

E| max(Xt , a)|I(| max(Xt , a)| > M) = EXt I(Xt > M) EXI(Xt > M)
KP(Xt > M) + E|X|I(|X| > K)
E max(Xt , 0)
K + E|X|I(|X| > K)
M
E max(X, 0)
{by (13.0.2)} K + E|X|I(|X| > K).
M
Letting M and then K proves that (max(Xt , a))tT is uniformly integrable. t
u

Exercise. In the setting of the Example 3 above, prove carefully that E(X1 |Bn ) = E(Xk |Bn ) for 1 k n.

Exercise. Let (Xn )n1 be i.i.d. random variables and, given a bounded measurable function f : R2 R such that
f (x, y) = f (y, x), consider
2
Sn = 0 f (Xl , Xl0 )
n(n 1) 1l<l n

(called the U-statistics) and the -algebras Fn = X(1) , . . . , X(n) , (Xl )l>n , where X(1) , . . . , X(n) are the order statistics


of X1 , . . . , Xn . Prove that F(n+1) Fn and Sn = E(S2 |Fn ), i.e. (Sn , Fn )n1 is a right-closed martingale.

Exercise. Suppose that (Xn )n1 are i.i.d. and E|X1 | < . If Sn = X1 + . . . + Xn , show that the sequence (Sn /n)n1 is
uniformly integrable.

Exercise. Show that if suptT E|Xn |1+ < for some > 0 then (Xt )tT is uniformly integrable.

Exercise. Show that (Xt )tT is uniformly integrable if and only if (a) supt E|Xt | < and (b) for any > 0 there exists
> 0 such that suptT E|Xt | IA if P(A) .

Exercise. Suppose that random variables Xn 0 for n 0, Xn X0 in probability and EXn EX0 . Prove that the limit
limn E|Xn X0 | = 0. (Hint: consider (X0 Xn )+ .)

72

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 14

Stopping times.

Consider a probability space (, B, P) and a sequence of -algebras Bn B for n 0 such that Bn Bn+1 . An
integer valued random variable {0, 1, 2, . . .} is called a stopping time if { n} Bn for all n. Of course, in this
case we also have
{ = n} = { n} \ { n 1} Bn .
Given a stopping time , let us consider the -algebra

B = B B { n} B Bn for all n


consisting of all events that depend on the data up to a stopping time . If a sequence of random variable (Xn )
is adapted to (Bn ), i.e. Xn is Bn -measurable, then random variables such as X or k=1 Xk are B -measurable. For
example,
[ 
{ k} \ { k 1} {Xk A} Bn .
[ \
{ n} {X A} = { = k} {Xk A} =
kn kn

Let us mention several basic properties of stopping times.

1. Constant = n is a stopping time and B = Bn .

2. Stopping time is B -measurable. To see this, we need to show that { k} B for all integer k 0. This is
true, because
{ k} { n} = { k n} Bkn Bn ,
by the definition of stopping time.

3. is a stopping time if and only if { > n} Bn for all n. This is obvious, because { > n} = { n}c and
-algebras as closed under taking complements.

4. If (k )k1 are all stopping times then their minimum k1 k and maximum k1 k are also stopping times. This is
true, because
k > n Bn , k1 k n = k n Bn .
 \  \
k1 k > n =
k1 k1

5. If 1 and 2 are stopping times then the sum 1 + 2 is a stopping time. This is true, because
n
{1 = k} {2 n k} Bn .
[
{1 + 2 n} =
k=0

We will leave other properties, concerning two stopping times 1 and 2 , as an exercise.

6. The events {1 < 2 }, {1 = 2 } and {1 > 2 } belong to B1 and B2 .

73

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


7. If B B1 then B {1 2 } and B {1 < 2 } are in B2 .

8. If 1 2 then B1 B2 .
A martingale (Xn , Bn ) often represents a fair game, so that the average value E(Xm |Bn ) at some future time m > n
given the data Bn at time n is equal to the value Xn at time n. A stopping time represents a strategy to stop the game
based only on the information at any given time, so that the event { n} that we stop the game before time n depends
only on the data Bn up to that time. A classical result that we will now prove states that, under some mild integrability
conditions, the average value under any strategy is the same, so their is no winning strategy on average.

Theorem 41 (Optional Stopping) Let (Xn , Bn ) be a martingale and 1 , 2 < be stopping times such that

E|X2 | < , lim E|Xn |I(n 2 ) = 0. (14.0.1)


n

Then, for any set A B1 ,


EX2 IA I(1 2 ) = EX1 IA I(1 2 ).

For example, if 1 = 0 and A = then EX2 = EX0 if the condition (14.0.1) is satisfied. For example, if the stopping
time 2 is bounded then (14.0.1) is obviously satisfied. Let us note that without some kind of integrability condition,
Theorem 41 can not hold, as the following example shows.
Example. Consider an i.i.d. sequence (Xn ) such that P(Xn = 2n ) = 1/2. If Bn = (X1 , . . . , Xn ) then (Sn , Bn ) is a
martingale. Let 1 = 1 and 2 = min{k 1, Sk > 0}. Clearly, S2 = 2 because if 2 = k then

S2 = Sk = 2 22 . . . 2k1 + 2k = 2.

However, 2 = ES2 , ES1 = 0. Notice that the second condition in (14.0.1) is violated, since

P(2 = n) = 2n , P(2 n + 1) = 2k = 2n
kn+1

and, therefore,
E|Sn |I(n 2 ) = 2P(2 = n) + (2n+1 2)P(n + 1 2 ) = 2,
which does not go to zero. t
u
Proof of Theorem 41. Consider a set A B1 . The result is based on the following formal computation with the
middle step (*) proved below,

EX2 IA I(1 2 ) = EX2 I A {1 = n} I(n 2 )
n1
() 
= EXn I A {1 = n} I(n 2 ) = EX1 IA I(1 2 ).
n1

To prove (*), it is enough to show that for An = A {1 = n} Bn ,

EX2 IAn I(n 2 ) = EXn IAn I(n 2 ). (14.0.2)

We begin by writing

EXn IAn I(n 2 ) = EXn IAn I(2 = n) + EXn IAn I(n < 2 ) = EX2 IAn I(2 = n) + EXn IAn I(n < 2 ).

Since {n < 2 } = {2 n}c Bn and (Xn , Bn ) is a martingale, the last term

EXn IAn I(n < 2 ) = EXn+1 IAn I(n < 2 ) = EXn+1 IAn I(n + 1 2 )

and, therefore,
EXn IAn I(n 2 ) = EX2 IAn I(2 = n) + EXn+1 IAn I(n + 1 2 ).

74

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


We can continue the same computation and, by induction on m, we get

EXn IAn I(n 2 ) = EX2 IAn I(n 2 < m) + EXm IAn I(m 2 ). (14.0.3)

It remains to let m and use the assumptions in (14.0.1). By the second assumption, the last term

EXm IA I(m 2 ) E|Xm |I(m 2 ) 0
n

as m . Since
X2 IAn I(n 2 m) X2 IAn I(n 2 )
almost surely as m and E|X2 | < , by the dominated convergence theorem,

EX2 IAn I(n 2 < m) EX2 IAn I(n 2 ).

This proves (14.0.2) and finishes the proof of the theorem. t


u
Example. (Computing EX ) If we take 1 = 0, n = 0 and A = then An = and the equation (14.0.3) becomes

EX0 = EX2 I(2 < m) + EXm I(m 2 ).

This implies that, for any stopping time ,

EX0 = lim EX I( n) lim EXn I( n) = 0, (14.0.4)


n n

which gives a clean condition to check if we would like to show that EX = EX0 . t
u
Example. (Hitting times of simple random walk) Given p (0, 1), consider i.i.d. random variables (Xi )i1 such that

P(Xi = 1) = p, P(Xi = 1) = 1 p,

and consider a random walk S0 = 0, Sn+1 = Sn + Xn+1 . Consider two integers a 1 and b 1 and the stopping time

= min k 1 Sn = a or b .

If we denote q = 1 p then Yn = (q/p)Sn is a martingale, since


 q Sn +1  q Sn 1  q Sn
E(Yn+1 |Bn ) = p +q = .
p p p
It is easy to show that P( n) 0, which we leave as an exercise below. Since Sn [a, b] for n and S = a or b,
the equation (14.0.4) implies that
 q b  q a
1 = EY0 = EY = P(S = b) + (1 P(S = b)).
p p
If p , 1/2, we can solve this:
(q/p)a 1
P(S = b) = .
(q/p)a (q/p)b
If q = p = 1/2 then Sn itself is a martingale and the equation (14.0.4) implies that
a
0 = bP(S = b) + a(1 P(S = b)) and P(S = b) = .
ab
One can also compute E, which we will also leave as an exercise. t
u
Example. (Fundamental Walds identity) Let (Xn )n1 be a sequence of i.i.d. random variables, S0 = 0 and Sn =
X1 + . . . + Xn . Suppose that the Laplace transform ( ) = Ee X1 is defined on some nontrivial interval ( , + )
containing 0. It is obvious that
e Sn
Yn =
( )n

75

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


is a martingale for ( , + ). Let be a stopping time. Since Y 0,

e S e S
lim E I( n) = E I( < )
n ( ) ( )
by the monotone convergence theorem. Since Y0 = 1, (14.0.4) implies in this case that

e S e Sn
1=E I( < ) lim E I( n) = 0. (14.0.5)
( ) n ( )n

The left hand side is called the fundamental Walds identity. In some cases one can use this to compute the Laplace
transform of a stopping time and, thus, its distribution. t
u
Example. (Symmetric random walk) Let X0 = 0, P(Xi = 1) = 1/2, Sn = kn Xk . Given integer z 1, let

= min{k : Sk = z or z}.

Since ( ) = ch( ) 1,
e Sn e| |z
E I( n) P( n)
( )n ch( )n
and the right hand side of (14.0.5) holds. Therefore,

e S e z + e z
1=E = e z Ech( ) I(S = z) + e z Ech( ) I(S = z) = Ech( )
( ) 2
by symmetry. Therefore,
1 1
Ech( ) = and Ee = 1
ch( z) ch(ch (e )z)
by the change of variables e = 1/ch . t
u
For more general stopping times the condition on the right hand side of (14.0.5) might not be easy to check. We will
now show another approach that is helpful to verify fundamental Walds identity. If P is the distribution of X1 , let P
be the distribution with the Radon-Nikodym derivative with respect to P given by

dP e x
= .
dP ( )
This is, indeed, a density with respect to P, since

e x ( )
Z
dP = = 1.
R (x) ( )
For convenience of notation, we will think of (Xn ) as the coordinates on the product space (R , B , P ) with the
cylindrical -algebra. Therefore, { = n} (X1 , . . . , Xn ) B n is a Borel set on Rn . We can write,

e S
e Sn
e (x1 ++xn )
Z
E I( < ) = E n
I( = n) = dP(x1 ) . . . dP(xn )
( ) n=1 ( ) n=1 ( )n
{=n}
Z
= dP (x1 ) . . . dP (xn ) = P
( < ).
n=1
{=n}

This means that we can think of the random variables Xn as having the distribution P and, to prove Walds identity,
we need to show that < with probability one.
Example. (Crossing a growing boundary) Suppose that we have a boundary given by a sequence ( f (k)) that changes
with time and a stopping time (crossing time):

= min k Sk f (k) .

76

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


To verify Walds identity in (14.0.5), we need to show that < with probability one, assuming now that random
variables Xn have the distribution P . Under this distribution, the random variables Xn have the expectation

e x 0 ( )
Z
E Xi = x dP(x) = .
( ) ( )

By the strong law of large numbers


Sn 0 ( )
lim =
n n ( )
and, therefore, if the growth of the crossing boundary satisfies

f (n) 0 ( )
lim sup <
n n ( )

then, obviously, the random walk Sn will cross it with probability one. By Holders inequality, ( ) is log-convex and,
therefore, 0 ( )/( ) is increasing, which means that if this condition holds for 0 then it holds for > 0 . t
u

Exercise. Prove the properties 6 8 of the stopping times above.

Exercise. Let (Xn , Bn ) be a martingale and 1 2 . . . be a non-decreasing sequence of bounded stopping times.
Show that (Xn , Bn ) is a matringale.

Exercise. Let (Xn , Bn )n0 be a martingale and a stopping time such that P( < ) = 1. Suppose that E|Xn |1+ c
for some c, > 0 and for all n. Prove that EX = EX0 .

Exercise. Given 0 < p < 1, consider i.i.d. random variables (Xi )i1 such that P(Xi = 1) = p, P(Xi = 1) = 1 p and
consider a random walk S0 = 0, Sn+1 = Sn + Xn+1 . Consider two integers a 1 and b 1 and define a stopping time

= min{k 1, Sn = a or b}.

(a) Show that P( n) 0.


(b) Compute E. Hint: for p , 1/2, use that Sn nEX1 is a martingale; for p = 1/2, use that Sn2 n is a martingale.

Exercise. Let X0 = 0, P(Xi = 1) = 1/2 and Sn = kn Xk . Let = min k Sk 0.95k + 100 . Prove that

e S
E I( < ) = 1
( )

for large enough.

77

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 15

Doobs inequalities and convergence of


martingales.

In this section, we will study convergence of martingales, and we begin by proving two classical inequalities.

Theorem 42 (Doobs inequality) If (Xn , Bn )n1 is a submartingale and Yn = max1kn Xk then, for any M > 0,
1 1
P Yn M EXn I(Yn M) EXn+ .

(15.0.1)
M M
Proof. Define a stopping time

min{k | Xk M, k n}, if such k exists,
1 =
n, otherwise.

Let 2 = n so that 1 2 . By the optional stopping theorem, Theorem 41,

E(Xn |B1 ) = E(X2 |B1 ) X1 .

Let us average this inequality over the set A = {Yn = max1kn Xn M}, which belongs to B1 , because
n o
A {1 k} = max Xi M Bkn Bk .
1ikn

On the event A, X1 M and, therefore,

EXn IA EX1 IA MEIA = MP(A).

This is precisely the first inequality in (15.0.1), and the second inequality is obvious. t
u
Example. (Second Kolmogorovs inequality) If (Xi ) are independent and EXi = 0 then Sn = 1in Xi is a martingale
and Sn2 is a submartingale. Therefore, by Doobs inequality,
    1 1
P max |Sk | M = P max Sk2 M 2 2 ESn2 = 2 Var(Xk ).
1kn 1kn M M 1kn

This is a big improvement on Chebyshevs inequality, since we control the maximum max1kn |Sk | instead of one
sum |Sn |. t
u
Doobs upcrossing inequality. Let (Xn , Bn )n1 be a submartingale. Given two real numbers a < b, we will define an
increasing sequence of stopping times (n )n1 when Xn is crossing a downward and b upward, as in figure 15.1. More
specifically, we define these stopping times by
 
1 = min n 1 Xn a , 2 = min n > 2 Xn b

78

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


b x x x

a x x x x

Figure 15.1: Stopping times of level crossings.

and, by induction, for k 2,


 
2k1 = min n > 2k2 Xn a , 2k = min n > 2k1 Xn b .

Then, we consider the random variable



(a, b, n) = max k 2k n ,

which is the number of upward crossings of the interval [a, b] before time n.

Theorem 43 (Doobs upcrossing inequality) If (Xn , Bn ) is a submartingale then

E(Xn a)+
E(a, b, n) . (15.0.2)
ba

Proof. Since x (x a)+ is increasing convex function, Zn = (Xn a)+ is also a submartingale. Clearly, the number
of upcrossings of the interval [a, b] by (Xn ) can be expressed in terms of the number of upcrossings of [0, b a] by
(Zn ),
X (a, b, n) = Z (0, b a, n),
which means that it is enough to prove (15.0.2) for nonnegative submartingales. From now on we can assume that
0 Xn and we would like to show that
EXn
E(0, b, n) .
b
Let us define a sequence of random variables j for j 1 by

1, 2k1 < j 2k for some k
j =
0, otherwise,

i.e. j is the indicator of the event that at time j the process is crossing [0, b] upward. Define X0 = 0. Then, clearly,
n n
b(0, b, n) j (X j X j1 ) = I( j = 1)(X j X j1 ).
j=1 j=1

Since (k ) are stopping times, the event


c
2k1 j 1 \ 2k j 1 B j1 ,
[ [ 
{ j = 1} = 2k1 < j 2k =
k1 k1

i.e. the fact that at time j we are crossing upward is determined completely by the sequence up to time j 1. Then
n n
X j X j1 B j1

bE(0, b, n) EI( j = 1)(X j X j1 ) = EI( j = 1)E
j=1 j=1
n 
= EI( j = 1) E(X j |B j1 ) X j1 .
j=1

79

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Since (X j , B j ) is a submartingale, E(X j |B j1 ) X j1 and

I( j = 1)(E(X j |B j1 ) X j1 ) E(X j |B j1 ) X j1 .

Therefore,
n  n
bE(0, b, n) E E(X j |B j1 ) X j1 = E(X j X j1 ) = EXn .
j=1 j=1

This finishes the proof for nonnegative submartingales, and also the general case. t
u
Finally, we will now use Doobs upcrossing inequality to prove our main result about the convergence of submartin-
gales, from which the convergence of martingales will follow.

Theorem 44 Suppose that (Xn , Bn )<n< is a submartingale.

1. Almost sure limit X = limn Xn always exists and EX +


< +. If B = n Bn then (Xn , Bn )n< is
a submartingale (i.e. we can add n = to our index set).
+
2. If supn EXn+ < , then almost sure limit X+ = limn+ Xn exists and EX+ < .
3. Moreover, if (Xn+ )<n< is uniformly integrable then (Xn , Bn )<n+ is a right-closed submartingale, where
B+ = n Bn = n Bn .

Proof. Let us note that Xn converges as n (here, = + or ) if and only if lim sup Xn = lim inf Xn . Therefore,
Xn diverges if and only if the following event occurs
 [
lim sup Xn > lim inf Xn = lim sup Xn b > a lim inf Xn ,
a<b

where the union is taken over all rational numbers a < b. This means that P(Xn diverges) > 0 if and only if there exist
rational a < b such that 
P lim sup Xn b > a lim inf Xn > 0.
If we recall that (a, b, n) denotes the number of upcrossings of [a, b], this is equivalent to

P lim (a, b, n) = > 0.
n

Proof of 1. For each n 1, let us define a sequence

Y1 = Xn ,Y2 = Xn+1 , . . . ,Yn = X1 .

By Doobs inequality,
E(Yn a)+ E(X1 a)+
EY (a, b, n) = < .
ba ba
Since 0 Y (a, b, n) (a, b) almost surely as n +, by the monotone convergence theorem,

E(a, b) = lim EY (a, b, n) < .


n+

Therefore, P((a, b) = ) = 0, which means that the sequence (Xn )n0 can not have infinitely many upcrossings of
[a, b] and, therefore, the limit
X = lim Xn
n
exists with probability one. By Fatous lemma and the submartingale property,
+ + +
EX lim inf EXn EX1 < .
n

It remains to show that (Xn , Bn )n< is a submartingale. First of all, X = limn Xn is measurable on Bn
for all n and, therefore, measurable on B = Bn . Let us take a set A Bn . We would like to show that, for

80

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


any m, EX IA EXm IA . Since (Xn , Bn )n0 is a right-closed submartingale, by part 2 of Lemma 36 in Lecture 13,
Xn a <n0 is uniformly integrable for any a R. Since X a = limn Xn a almost surely, we have L1
convergence by Lemma 34,
E(X a)IA = lim E(Xn a)IA .
n

Finally, since the function x x a is convex and non-decreasing, (Xn a, Bn )n< is a submartingale by Lemma
35 in Lecture 13 and, therefore, for any n m, we have E(Xn a)IA E(Xm a)IA since A Bn . This proves that
E(X a)IA E(Xm a)IA . Letting a , by the monotone convergence theorem, EX IA EXm IA .
Proof of 2. If (a, b, n) is the number of upcrossings of [a, b] by the sequence X1 , . . . , Xn then, by Doobs inequality,

(b a)E(a, b, n) E(Xn a)+ |a| + EXn+ K < .

Therefore, E(a, b) < for (a, b) = limn (a, b, n) and, as above, the limit X+ = limn+ Xn exists. Since all Xn
are measurable on B+ , so is X+ . Finally, by Fatous lemma,
+
EX+ lim inf EXn+ < .
n

Notice that a different assumption, supn E|Xn | < , would similarly imply that E|X+ | lim infn E|Xn | < .
Proof of 3. By 2, the limit X+ exists. We want to show that for any m and for any set A Bm ,

EXm IA EX+ IA .

We already mentioned above that (Xn a) is a submartingale and, therefore, E(Xm a)IA E(Xn a)IA for m n.
Clearly, if (Xn+ )<n< is uniformly integrable then (Xn a)<n< is uniformly integrable for all a R. Since
Xn a X+ a almost surely, by Lemma 34,

lim E(Xn a)IA = E(X+ a)IA


n

and this shows that E(Xm a)IA E(X+ a)IA . Letting a and using the monotone convergence theorem we
get that EXm IA EX+ IA . t
u
Convergence of martingales. If (Xn , Bn ) is a martingale then both (Xn , Bn ) and (Xn , Bn ) are submartingales and
one can apply the above theorem to both of them. For example, this implies the following.

1. If supn E|Xn | < then almost sure limit X+ = limn+ Xn exists and E|X+ | < .
2. If (Xn )<n< is uniformly integrable then (Xn , Bn )<n+ is a right-closed martingale.

In other words, a martingale is right-closable if and only if it is uniformly integrable. Of course, in this case we also
conclude that limn+ E|Xn X+ | = 0. For a martingale of the type E(X|Bn ) we can identify the limit as follows.

Theorem 45 (Levys convergence theorem) Let (, B, P) be a probability space and X : R be a random variable
such that E|X| < . Given a sequence of -algebras

B1 . . . Bn . . . B+ B

where B+ = 1n< Bn , we have Xn = E(X|Bn ) E(X|B+ ) almost surely.


S 

Proof. (Xn , Bn )1n< is a right-closable martingale since Xn = E(X|Bn ). Therefore, it is uniformly integrable and the
limit X+ := limn+ E(X|Bn ) exists. It remains to show that X+ = E(X|B+ ). Consider a set A n Bn . Since
S

A Bm for some m and (Xn ) is a martingale, EXn IA = EXm IA for n m. Therefore,

EX+ IA = lim EXn IA = EXm IA = E(E(X|Bm )IA ) = EXIA .


n

Since n Bn is an algebra, we get EX+ IA = EXIA for all A B+ = n Bn (by Dynkins theorem or monotone
S 

class theorem), which proves that X+ = E(X|B+ ). t


u

81

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Example. (Strong law of large numbers) Let (Xn )n1 be i.i.d. such that E|X1 | < , and let

Bn = (Sn , Xn+1 , . . .), Sn = X1 + + Xn .

We showed before that


Sn
= E(X1 |Bn ),
Zn =
n
so (Zn , Bn ) is a reverse martingale and, by the above theorem, the almost sure limit Z = limn+ Zn exists and
is measurable on the -algebra B = n1 Bn . Since each event in B is obviously symmetric, i.e. invariant
T

under the permutations of finitely many (Xn ), by the Savage-Hewitt 0 1 law, the probability of each event is 0 or 1,
i.e. B consists of 0/ and up to sets of measure zero. Therefore, Z is a constant a.s. and, since (Zn )n1 is
also a martingale, EZ = EX1 . Therefore, we proved that Sn /n EX1 almost surely.
Example. (Improved Kolmogorovs law of large numbers) Consider a sequence (Xn )n1 such that

E(Xn+1 |X1 , . . . , Xn ) = 0.

We do not assume independence here. This assumption, obviously, implies that EXk Xl = 0 for k , l. Let us show that
if a sequence (bn ) is such that

EX 2
bn bn+1 , lim bn = and 2n <
n=1 bn
n

then Sn /bn 0 almost surely. Indeed, Yn = kn (Xk /bk ) is a martingale and (Yn ) is uniformly integrable, since

1 1 n EXk2
E|Yn |I(Yn | > M) E|Yn |2 = b2
M M k=1 k

(we used here that EXk Xl = 0 for k , l) and, therefore,

1 EXk2
sup E|Yn |I(Yn | > M) b2 0
n M k=1 k

as M . By the martingale convergence theorem, the limit Y = limn Yn exists almost surely, and Kroneckers
lemma implies that Sn /bn 0 almost surely. t
u
Exercise. Let (Xn )n1 be a martingale with EXn = 0 and EXn2 < for all n. Show that
  EXn2
P max Xk M .
1kn EXn2 + M 2

Hint: (Xn + c)2 is a submartingale for any c > 0.


Exercise. Let (gi ) be i.i.d. standard normal random variables and let Xn = ni=1 i2 (g2i 1) for some numbers (i ).
Prove that for t > 0,
n
P max |Xi | t 2t 2 i4 .

1in i=1

R |Y |
Exercise. Show that for any random variable Y, E|Y | p = pt p1 P(|Y | t)dt. Hint: represent |Y | p = pt p1 dt and
R
0 0
switch the order of integration.
Exercise. Let X,Y be two non-negative random variables such that for every t > 0,
Z
P(Y t) t 1 XI(Y t)dP.

For any p > 1, k f k p = ( | f | p dP)1/p and 1/p+1/q = 1, show that kY k p qkXk p . Hint: Use previous exercise, switch
R

the order of integration to integrate in t first, then use Holders inequality and solve for kY k p .

82

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Pick

b r
+ c of the same color

Figure 15.2: Polya urn model.

Exercise. Given a non-negative submartingale (Xn , Bn ), let Xn := max jn X j . Prove that for any p > 1 and 1/p+1/q =
1, kXn k p qkXn k p . Hint: use previous exercise and Doobs maximal inequality.

Exercise. (Polyas urn model) Suppose we have b blue and r red balls in the urn. We pick a ball randomly and return
it with c balls of the same color. Let us consider the sequence
#(blue balls after n iterations)
Yn = .
#(total after n iterations)
Prove that the limit Y = limn Yn exists almost surely.

Exercise. (Improved Kolmogorovs 0 1 law) Consider arbitrary random variables (Xi )i1 and consider -algebras
Bn = (X1 , . . . , Xn ) and B+ = ((Xn )n1 ). Consider any set A B+ that is independent of Bn for all n (also called
a tail event). By considering conditional expectations E(IA |Bn ) prove that P(A) = 0 or 1.

Exercise. Suppose that a random variable takes values 0, 1, 2, 3, . . .. Suppose that E = > 1 and Var( ) = 2 <
. Let nk be independent copies of . Let X1 = 1 and define recursively Xn+1 = Xk=1 n
nk . Prove that Sn = Xn / n
converges almost surely.

Exercise. Let (Xn )n1 be i.i.d. random variables and, given a bounded measurable function f : R2 R such that
f (x, y) = f (y, x), consider the U-statistics

2
Sn = 0 f (Xl , Xl0 )
n(n 1) 1l<l n

and the -algebras


Fn = X(1) , . . . , X(n) , (Xl )l>n ,


where X(1) , . . . , X(n) are the order statistics of X1 , . . . , Xn . Prove that limn Sn = E f (X1 , X2 ) almost surely. Hint:
notice that n1 Fn consists of symmetric events.

83

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 16

Bounded Lipschitz functions.

We will now study the class of the so called bounded Lipschitz functions on a metric space (S, d), which will play an
important role in the next section, when we deal with convergence of laws on metric spaces. For a function f : S R,
let us define a Lipschitz semi-norm by
| f (x) f (y)|
|| f ||L = sup .
x,y d(x, y)
Clearly, || f ||L = 0 if and only if f is constant so || f ||L is not a norm, even though it satisfies the triangle inequality.
Let us define a bounded Lipschitz norm by

|| f ||BL = || f ||L + || f || ,

where || f || = supsS | f (s)|. Let 


BL(S, d) = f : S R || f ||BL <
be the set of all bounded Lipschitz functions on (S, d). We will now prove several fact about these functions.

Lemma 37 If f , g BL(S, d) then f g BL(S, d) and || f g||BL || f ||BL ||g||BL .

Proof. First of all, || f g|| || f || ||g|| . We can write,

| f (x)g(x) f (y)g(y)| | f (x)(g(x) g(y))| + |g(y)( f (x) f (y))|


|| f || ||g||L d(x, y) + ||g|| || f ||L d(x, y)

and, therefore,
|| f g||BL || f || ||g|| + || f || ||g||L + ||g|| || f ||L || f ||BL ||g||BL ,
which finishes the proof. t
u

Let us recall the notations a b = max(a, b) and a b = min(a, b) and let = or .

Lemma 38 The following inequalities hold:

(1) || f1 fk ||L max1ik || fi ||L ,


(2) || f1 fk ||BL 2 max1ik || fi ||BL .

Proof. (1) It is enough to consider k = 2. For specificity, take = . Given x, y S, suppose that

f1 f2 (x) f1 f2 (y) = f1 (y).

84

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Then

f1 (x) f1 (y), if f1 (x) f2 (x)
| f1 f2 (y) f1 f2 (x)| = f1 f2 (x) f1 f2 (y)
f2 (x) f2 (y), otherwise
|| f1 ||L || f2 ||L d(x, y).

The case of = can be considered similarly.

(2) First of all, obviously, || f1 fk || max1ik || fi || . Therefore, using (1),

|| f1 fk ||BL max || fi || + max || fi ||L 2 max || fi ||BL


i i i

and this finishes the proof. t


u

Another important fact is the following.

Theorem 46 (Extension theorem) Given a set A S and a bounded Lipschitz function f BL(A, d) on A, there exists
an extension h BL(S, d) such that f = h on A and ||h||BL = || f ||BL .

Proof. Let us first find an extension such that ||h||L = || f ||L . We will start by extending f to one point x S \ A. The
value y = h(x) must satisfy
|y f (s)| k f kL d(x, s) for all s A,
or, equivalently,
inf ( f (s) + || f ||L d(x, s)) y sup( f (s) || f ||L d(x, s)).
sA sA

Such y exists if and only if, for all s1 , s2 A,

f (s1 ) + || f ||L d(x, s1 ) f (s2 ) || f ||L d(x, s2 ).

This inequality is satisfied, because, by triangle inequality,

f (s2 ) f (s1 ) || f ||L d(s1 , s2 ) || f ||L (d(s1 , x) + d(s2 , x)).

It remains to apply Zorns lemma to show that f can be extended to the entire S. Define order by inclusion:

f1 f2 if f1 is defined on A1 , f2 - on A2 , A1 A2 , f1 = f2 on A1 and k f1 kL = k f2 kL .

For any chain { f }, f = f  f . By Zorns lemma, there exists a maximal element h. It is defined on the entire S,
S

because, otherwise, we could extend to one more point. To extend preserving the bounded Lipschitz norm, take

h0 = (h || f || ) (|| f || ).

By part (1) of previous lemma, it is easy to see that ||h0 ||BL = || f ||BL . t
u

Let (C(S), d ) denote the space of continuous real-valued functions on S with d ( f , g) = supxS | f (x) g(x)|. To
prove the next property of bounded Lipschitz functions, let us first recall the following famous generalization of the
Weierstrass theorem. We will give the proof for convenience.

Theorem 47 (Stone-Weierstrass theorem) Let (S, d) be a compact metric space and F C(S) is such that

1. F is an algebra, i.e. for all f , g F , c R, we have c f + g F , f g F .


2. F separates points, i.e. if x , y S then there exists f F such that f (x) , f (y).

3. F contains constants.

Then F is dense in (C(S), d ).

85

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. Consider bounded f F , i.e. | f (x)| M. A function x |x| defined on the interval [M, M] can be uniformly
approximated by polynomials of x by the Weierstrass theorem on the real line or, for example, using Bernsteins
polynomials. Therefore, | f (x)| can be uniformly approximated by polynomials of f (x), and by properties 1 and 3,
by functions in F . Therefore, if F is the closure of F in d norm then for any f F its absolute value | f | F .
Therefore, for any f , g F we have
1 1 1 1
min( f , g) = ( f + g) | f g| F , max( f , g) = ( f + g) + | f g| F . (16.0.1)
2 2 2 2
Given any points x , y and c, d R one can always find f F such that f (x) = c and f (y) = d. Indeed, by property
2 we can find g F such that g(x) , g(y) and, as a result, a system of equations

ag(x) + b = c, ag(y) + b = d

has a solution a, b. Then the function f = ag + b satisfies the above and it is in F by 1. Take h C(S) and fix x. For
any y let fy F be such that
fy (x) = h(x), fy (y) = h(y).
By continuity of fy , for any y S there exists an open neighborhood Uy of y such that

fy (s) h(s) for s Uy .

Since (Uy ) is an open cover of the compact S, there exists a finite subcover Uy1 , . . . ,UyN . Let us define a function

f x (s) = max( fy1 (s), . . . , fyN (s)) F by (16.0.1).

By construction, it has the following properties:

f x (x) = h(x), f x (s) h(s) for all s S.

Again, by continuity of f x (s) there exists an open neighborhood Ux of x such that

f x (s) h(s) + for s Ux .

Take a finite subcover Ux1 , . . . ,UxM and define

h0 (s) = min f x1 (s), . . . , f xM (s) F by (16.0.1).




By construction, h0 (s) h(s) + and h0 (s) h(s) for all s S which means that d (h0 , h) . Since h0 F , this
proves that F is dense in (C(S), d ). t
u

The above results can be now combined to prove the following property of bounded Lipschitz functions.

Corollary 2 If (S, d) is a compact space then the bounded Lipschitz functions BL(S, d) are dense in (C(S), d ).

Proof. We apply the Stone-Weierstrass theorem with F = BL(S, d). Property 3 is obvious, property 1 follows from
Lemma 37 and property 2 follows from the extension Theorem 46, since a function defined on two points x , y such
that f (x) , f (y) can be extended to a bounded Lipschitz function on the entire S. t
u

We will also need another well-known result from analysis. A set A S is totally bounded if for any > 0 there exists
a finite -cover of A, i.e. a set of points a1 , . . . , aN such that A iN B(ai , ), where B(a, ) = {y S | d(a, y) } is
a ball of radius centered at a.

Theorem 48 (Arzela-Ascoli) If (S, d) is a compact metric space then a subset F C(S) is totally bounded in d
metric if and only if F is equicontinuous and uniformly bounded.

Equicontinuous means that for any > there exists > 0 such that if d(x, y) then, for all f F , | f (x) f (y)| .
The following fact was used in the proof of the Selection Theorem, which was proved for general metric spaces.

86

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Corollary 3 If (S, d) is a compact space then C(S) is separable in d .

Proof. By the above theorem, BL(S, d) is dense in C(S). For any integer n 1, the set { f : || f ||BL n} is, obviously,
uniformly bounded and equicontinuous. By the Arzela-Ascoli theorem, it is totally bounded and, therefore, separable,
which can be seen by taking the union of finite 1/m-covers for all m 1. The union
[
f : || f ||BL n = BL(S, d)
n1

is also separable and, since it is dense in C(S), C(S) is separable. t


u

Exercise.
 R If F is a finite set {x
1 , . . . , xn } and P is a law with P(F) = 1, show that for any law Q the supremum
sup f d(P Q) : k f kL 1 can be restricted to functions of the form f (x) = min1in (ci + d(x, xi )).

87

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 17

Convergence of laws on metric spaces.

Let (S, d) be a metric space and B - a Borel -algebra generated by open sets. Let us recall that Pn P weakly if
Z Z
f dPn f dP

for all f Cb (S) - real-valued bounded continuous functions on S. For a set A S, we denote by A the closure of A,
int A - the interior of A and A = A \ int A - the boundary of A. The set A is called a continuity set of P if P( A) = 0.

Theorem 49 (Portmanteau theorem) The following are equivalent.

(1) Pn P weakly.
(2) For any open set U S, lim infn Pn (U) P(U).
(3) For any closed set F S, lim supn Pn (F) P(F).
(4) For any continuity set A of P, limn Pn (A) = P(A).

Proof. (1) = (2). Let U be an open set and F = U c . Consider a sequence of functions in Cb (S),

fm (s) = min(1, md(s, F)).

Since F is closed, d(s, F) = 0 only if s F and, therefore, fm (s) I(s U) as m . Since, for each m 1,
Z
Pn (U) fm dPn

taking the limit n and using that Pn P,


Z
lim inf Pn (U) fm dP.
n

Now, letting m and using the monotone convergence theorem implies that
Z
lim inf Pn (U) I(s U) dP(s) = P(U),
n

which proves (2).


(2) (3). Obvious by taking complements.
(2), (3) = (4). Since int A is open and A is closed, and int A A, by (2) and (3),

P(int A) lim inf Pn (int A) lim sup Pn (A) P(A).


n n

88

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


If P( A) = 0 then P(A) = P(int A) = P(A) and, therefore, limn Pn (A) = P(A).

(4) = (1). Consider f Cb (S) and let Fy = {s S : f (s) = y} be a level set of f . There exist at most countably many
y such that P(Fy ) > 0. Therefore, for any > 0 we can find a sequence a1 . . . aN such that

max(ak+1 ak ) , P(Fak ) = 0 for all k


k

and the range of f is inside the interval (a1 , aN ). Let



Bk = s S ak f (s) < ak+1 and f (s) = ak I(s Bk ).
k

Since f is continuous, Bk Fak Fak+1 and P( Bk ) = 0. By (4),


Z Z
n
f dPn = ak Pn (Bk ) ak P(Bk ) = f dP.
k k
R R
Since, by construction, | f (s) f (s)| , letting 0 proves that f dPn f dP, so Pn P weakly. t
u

Next, we will introduce two metrics on the set of all probability measures on (S, d) with the Borel -algebra B and,
under some mild conditions, prove that they metrize the weak convergence. For a set A S, let us denote by

A = y S d(x, y) < for some x A

its open -neighborhood. If P and Q are probability distributions on S then

(P, Q) = inf > 0 P(A) Q(A ) + for all A B




is called the Levy-Prohorov distance between P and Q.

Lemma 39 is a metric on the set of probability laws on B.

Proof. (1) First, let us show that (Q, P) = (P, Q). Suppose that (P, Q) > . Then there exists a set A such that
P(A) > Q(A ) + . Taking complements gives

Q(Ac ) > P(Ac ) + P(Ac ) + ,

where the last inequality follows from the fact that Ac Ac :

a Ac = d(a, Ac ) < = d(a, b) < for some b Ac



since b < A , d(b, A)
= d(a, A) > 0 = a < A = a Ac .

Therefore, for the set B = Ac , Q(B) > P(B ) + . This means that (Q, P) > and, therefore, (Q, P) (P, Q). By
symmetry, (Q, P) (P, Q), so (Q, P) = (P, Q).

(2) Next, let us show that if (P, Q) = 0 then P = Q. For any set F and any n 1,
1 1
P(F) Q(F n ) + .
n
1
If F is closed then F n F as n and, by the continuity of measure,
1
P(F) lim Q F n = Q(F).
n

Similarly, P(F) Q(F) and, therefore, P(F) = Q(F) for all closed sets and, therefore, for all Borel sets.

89

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


(3) Finally, let us prove the triangle inequality

(P, R) (P, Q) + (Q, R).

If (P, Q) < x and (Q, R) < y then for any set A,

P(A) Q(Ax ) + x R (Ax )y + y + x R Ax+y + x + y,


 

which means that (P, R) x + y, and minimizing over x, y proves the triangle inequality. t
u
Given probability distributions P, Q on the metric space (S, d), we define the bounded Lipschitz distance between them
by n Z Z o
(P, Q) = sup f dP f dQ : || f ||BL 1 .

In this case, it is much easier to check that this is a metric.

Lemma 40 is a metric on the set of probability laws on B.

Proof. It is obvious that (P, Q) = (Q, P) and the triangle inequality holds. It remains to prove that if (P, Q) = 0
then P = Q. Given a closed set F and U = F c , it is easy to see that

fm (x) = md(x, F) 1 IU
R R
as m . Obviously, || fm ||BL m + 1 and, therefore, (P, Q) = 0 implies that fm dP = fm dQ. By the monotone
converngence theorem, letting m proves that P(U) = Q(U), so P = Q. t
u
Let us now show that on separable metric space, the metrics and metrize weak convergence. Before we prove this,
let us recall the statement of Ulams theorem proved in Theorem 5 in Section 1. Namely, every probability law P on a
complete separable metric space (S, d) is tight, which means that for any > 0 there exists a compact K S such that
P(S \ K) .

Theorem 50 If (S, d) is separable or P is tight then the following are equivalent:

(1) Pn P.
R R
(2) For all f BL(S, d), f dPn f d P.
(3) (Pn , P) 0.
(4) (Pn , P) 0.

Remark. We will prove the implications (1) = (2) = (3) = (4) = (1), and the assumption that (S, d) is
separable or P is tight will be used only in one step, to prove (2) = (3).
Proof. (1) = (2). This follow by definition, since BL(S, d) Cb (S).
(3) = (4). In fact, we will prove that p
(Pn , P) 2 (Pn , P). (17.0.1)
Given a Borel set A S, consider the function
 1 
f (x) = 0 1 d(x, A) , so that IA f IA .

Obviously, || f ||BL 1 + 1 and we can write
Z Z Z Z 
Pn (A) f dPn = f dPn f dP
f dP +
n Z Z o
P(A ) + (1 + 1 ) sup f dPn f dP : || f ||BL 1

= P(A ) + (1 + 1 ) (Pn , P) P(A ) + ,

90

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


where = max(, (1 +p 1 ) (Pn , P)). Thisp p (Pn , P)
implies that p . Since is arbitrary we can minimize = ()
over . If we take = then = max( , + ) = + and
p p
1 = 2 ; 1 = 1 2 .

(4) = (1). Suppose that (Pn , P) 0, which means that there exists a sequence n 0 such that

Pn (A) P(An ) + n for all measurable A S.


T n
If A is closed, then n1 A = A and, by the continuity of measure,
 
lim sup Pn (A) lim sup P(An ) + n = P(A).
n n

By the Portmanteau Theorem, Pn P.

(2) = (3). First, let us consider the case when (S, d) is complete. Then, by Ulams theorem, again, we can find a
compact K such that P(S \ K) . If (S, d) is not separable then we assume that P is tight and, by definition, we can
again find a compact K such that P(S \ K) . If we consider the function
 1  1
f (x) = 0 1 d(x, K) , so that || f ||BL 1 + ,

then Z Z
n
Pn (K ) f dPn f dP P(K) 1 ,

which implies that for n large enough, Pn (K ) 1 2. Let


 
B = f : || f ||BL(S,d) 1 , BK = f K : f B C(K),

where f K denotes the restriction of f to K. Since K is compact, by the Arzela-Ascoli theorem, BK is totally bounded
with respect to d and, given > 0, we can find k 1and f1 , . . . , fk B such that, for all f B,

sup | f (x) f j (x)| for some j k.


xK

This uniform approximation can also be extended to K . Namely, for any x K take y K such that d(x, y) . Then

| f (x) f j (x)| | f (x) f (y)| + | f (y) f j (y)| + | f j (y) f j (x)|


|| f ||L d(x, y) + + || f j ||L d(x, y) 3.

Therefore, for large enough n, for any f B,


Z Z Z Z 
f dP f dP f dP f dP + || f || Pn (K c ) + P(K c )

n n

ZK ZK
f dPn f dP + 2 +


K K
Z Z
f j dPn f j dP + 3 + 3 + 2 +



ZK Z K
f j dPn f j dP + 3 + 3 + 2(2 + )

Z Z
max f j dPn f j dP + 12.

1 jk

Finally, Z Z Z Z
(Pn , P) = sup f dPn f dP max f j dPn f j dP + 12

f B 1 jk

91

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and, using the assumption (2), lim supn (Pn , P) 12. Letting 0 finishes the proof.
Now, suppose that (S, d) is separable but not complete. Let (T, d) be the completion of (S, d). We showed in
the previous section that every function f BL(S, d) can be extended to f on (T, d) preserving the norm || f ||BL . This
extension is obviously unique, since S is dense in T , so there is one-to-one correspondence f f between the unit
balls of BL(S, d) and BL(T, d). Next, we can also view the measure P as a probability measure on (T, d) as follows. If
for any Borel set A in (T, d) we let
P(A) = P(A S)
(we define Pn similarly) then P is a probability measure on the Borel -algebra on (T, d). Note that S might not be
a measurable set in its completion (because of this, there is no natural correspondence between measures on S and
measures on its completion, T ) but one can easily check that the sets A S are Borel in (S, d) (this is because open sets
in S are intersections of open sets in T with S). One can also easily see that, with this definition,
Z Z
f dP = f d P,
S T

which means that both statement in (2) and (3) hold simultaneously on (S, d) and (T, d). This proves that (2) (3) on
all separable metric spaces. t
u

The focus of the above theorem was to show that metrics and metrize weak convergence. Another important aspect
of this result is that on separable spaces weak convergence can be checked on countably many functions f Cb (S, d),
which will be demonstrated in the following example.

Convergence of empirical measures. Let (, A , P) be a probability space and X1 , X2 , . . . : S be an i.i.d. sequence


of random variables with values in a metric space (S, d). Let be the law of Xi on S. Let us define the random empirical
measures n on the Borel -algebra B on S by

1 n
n (A)() = I(Xi () A), A B.
n i=1

By the strong law of large numbers, for any f Cb (S),

1 n
Z Z
f dn = f (Xi ) E f (X1 ) = f d a.s.
n i=1

However, the set of measure zero where this convergence is violated depends on f and it is not right away obvious that
the convergence holds for all f Cb (S) with probability one. We will need the following lemma.

Lemma 41 If (S, d) is separable then there exists a metric e on S such that (S, e) is totally bounded, and e and d define
the same topology, i.e. e(sn , s) 0 if and only if d(sn , s) 0.

Proof First of all, it is easy to check that c = d/(1 + d) is a metric (that defines the same topology) taking values in
[0, 1]. If {sn } is a dense subset of S, consider the map

T (s) = c(s, sn ) n1 : S [0, 1]N .

The key point here is that the sequence tm t in S, or limm c(tm ,t) = 0, if and only if limm c(tm , sn ) = c(t, sn )
for each n 1. In other words, tm t if and only if T (tm ) T (t) in the Cartesian product [0, 1]N equipped with the
product topology. This topology is compact and metrizable by the metric

r (un )n1 , (vn )n1 = 2n |un vn |,



n1

so we can now define the metric e on S by e(s, s0 ) = r(T (s), T (s0 )). Because of what we said above, this metric defines
the same topology as d. Moreover, it is totally bounded, because the image T (S) of our metric space S by the map T
is totally bounded in ([0, 1]N , r), since this space is compact. t
u

92

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 51 (Varadarajan) Let (S, d) be a separable metric space. Then n converges to weakly almost surely,

P : n ()() weakly = 1.

Proof. Let e be the metric in the preceding lemma. Clearly, Cb (S, d) = Cb (S, e) and weak convergence of measures is
not affected by this change of metric. If (T, e) is the completion of (S, e) then (T, e) is compact. By the Arzela-Ascoli
theorem, BL(T, e) is separable with respect to the d norm and, therefore, BL(S, e) is also separable. Let ( fm ) be a
dense subset of BL(S, e). Then, by the strong law of large number,

1 n
Z Z
fm dn = fm (Xi ) E fm (X1 ) =
n i=1
fm d a.s.

R R
Therefore, on the set of probability one, fm dn fm d for all m 1 and, since ( fm ) is dense in BL(S, e), on the
same set of probability one, this convergence holds for all f BL(S, e). Since (S, e) is separable, the previous theorem
implies that n weakly. t
u
Convergence and uniform tightness. Next, we will make several connections between convergence of measures and
uniform tightness on general metric spaces, which are similar to the results in the Euclidean setting. The sequence of
laws (Pn ) is uniformly tight if, for each > 0, there exists a compact set K S such that Pn (S \ K) for all n 1.
First, we will show that, in some sense, uniform tightness is necessary for convergence of laws.

Theorem 52 If Pn P0 weakly and each Pn is tight for n 0, then (Pn )n0 is uniformly tight.

In particular, by Ulams theorem, any convergent sequence of laws on a complete separable metric space is uniformly
tight.
Proof. Since Pn P0 and P0 is tight, by Theorem 50, the Levy-Prohorov metric (Pn , P0 ) 0. Given > 0, let us
take a compact K such that P0 (K) > 1 . By the definition of ,
1 1
1 < P0 (K) Pn K (Pn ,P0 )+ n + (Pn , P0 ) +
n
and, therefore, 
a(n) = inf > 0 : Pn (K ) > 1 0.
A probability measure is always closed regular, so any measurable set A can be approximated by a closed subset F.
Since Pn is tight, we can choose a compact of measure close to one, and intersecting it with the closed subset F, we can
approximate any set A by its compact subset. Therefore, there exists a compact Kn K 2a(n) such that Pn (Kn ) > 1 .
If we take L = K (n1 Kn ) then
Pn (L) Pn (Kn ) > 1
for all n 0. It remains to show that L is compact. Consider a sequence (xn ) on L. There are two possibilities.
First, if there exists an infinite subsequence (xn(k) ) that belongs to one of the compacts K j then it has a converging
subsubsequence in K j and as a result in L. If not, then there exists a subsequence (xn(k) ) such that xn(k) Km(k) and
m(k) as k . Since
Km(k) K 2a(m(k))
there exists yk K such that
d(xn(k) , yk ) 2a(m(k)).
Since K is compact, the sequence yk K has a converging subsequence yk(r) y K which implies that d(xn(k(r)) , y)
0, i.e. xn(k(r)) y L. Therefore, L is compact. t
u
We already know by the Selection Theorem that any uniformly tight sequence of laws on any metric space has a
converging subsequence. Under additional assumptions on (S, d) we can complement the Selection Theorem and
make some connections to the metrics defined above.

Theorem 53 Let (S, d) be a complete separable metric space and P be a subset of probability laws on S. Then the
following are equivalent.

93

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


(1) P is uniformly tight.

(2) For any sequence Pn P there exists a converging subsequence Pn(k) P where P is a law on S.
(3) P has the compact closure on the space of probability laws equipped with the Levy-Prohorov or bounded
Lipschitz metrics or .
(4) P is totally bounded with respect to or .

Remark. In other words, we are showing that, on complete separable metric spaces, total boundedness on the space
of probability measures is equivalent to uniform tightness. The rest is just basic properties of metric space. Also,
implications (1) = (2) = (3) = (4) hold without the completeness assumption and the only implication where
completeness will be used is (4) = (1).

Proof. (1) = (2). Any sequence Pn P is uniformly tight and, by the Selection Theorem, there exists a converging
subsequence.

(2) = (3). Since (S, d) is separable, by Theorem 50, Pn P if and only if (Pn , P) or (Pn , P) 0. Every sequence in
the closure P can be approximated by a sequence in P. That sequence has a converging subsequence that, obviously,
converges to an element in P, which means that the closure of P is compact.

(3) = (4). Compact sets are totally bounded and, therefore, if the closure P is compact, the set P is totally bounded.

(4) = (1). Since 2 , we will only deal with . For any > 0, there exists a finite subset P0 P such that
p

P P0 . Since (S, d) is complete and separable, by Ulams theorem, for each Q P0 there exists a compact KQ
such that Q(KQ ) > 1 . Therefore,

KQ is a compact and Q(K) > 1 for all Q P0 .


[
K=
QP0

Let F be a finite set such that K F (here we will denote by F the closed -neighborhood of F). Since P P0 ,
for any P P there exists Q P0 such that (P, Q) < and, therefore,

1 Q(K) Q(F ) P(F 2 ) + .

Thus, 1 2 P(F 2 ) for all P P. Given > 0, take m = /2m+1 and find Fm as above, i.e.

/2m 
1 m
P Fm .
2
Then
/2m
\ 
P Fm 1 m
= 1.
m1 m1 2
/2m
t
u
T
Finally, L = m1 Fm is compact because it is closed and totally bounded by construction, and S is complete.

Theorem 54 (Prokhorovs theorem) The set of probability laws on a complete separable metric space is complete
with respect to the metrics and .

Proof. If a sequence of laws is Cauchy with respect to or then it is totally bounded and, by the previous theorem,
it has a converging subsequence. Obviously, a Cauchy sequence will converge to the same limit. t
u

Exercise. Prove that the space of probability laws Pr(S) on a compact metric space (S, d) is compact with respect to
the metrics and .

Exercise. If (S, d) is separable, prove that Pr(S) is also separable with respect to the metrics and . Hint: think about
probability measures concentrated on finite subsets of a dense countable set in S, and use metric .

94

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Let (S, d) be a metric space and B its Borel -algebra. Let Pr(S) be the set of all probability measures on
(S, B) equipped with the topology of weak convergence. Let F be the Borel -algebra on Pr(S) generated by open
sets. Consider the map T : Pr(S) B [0, 1] defined by T (P, A) = P(A). Prove that for any fixed A B, the function
T ( , A) : Pr(S) [0, 1] is F -measurable. Hint: First, prove this for open sets A S.

Exercise. Let P, Q be two laws on R with distribution functions F, G respectively. Let

(P, Q) = inf{ > 0 : F(x ) G(x) F(x + ) + for all x}

- Levys metric. (a) Show that is a metric metrizing convergence of laws on R. (b) Show that , but that there
exist laws Pn , Qn with (Pn , Qn ) 0 while (Pn , Qn ) 6 0.

Exercise. Define a specific finite set of laws F on [0, 1] such that for every law P on [0, 1] there exists Q F with
(P, Q) < 0.1, where is Prokhorovs metric. Hint: reduce the problem to metric .

Exercise. Let X j be i.i.d. N(0, 1) random variables. Let H be a Hilbert space with orthonormal basis {e j } j1 . Let
X = j1 X j e j / j. For any > 0, find a compact K such that P(X K) > 1 .

95

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 18

Strassens Theorem. Relationships between


metrics.

Metric for convergence in probability. Let (, B, P) be a probability space, (S, d) - a metric space and X,Y : S
- random variables with values in S. The quantity

(X,Y ) = inf 0 P(d(X,Y ) > )

is called the Ky Fan metric on the set L 0 (, S) of classes of equivalences of such random variables, where two random
variables are equivalent if they are equal almost surely. If we take a sequence

k = (X,Y )

then P(d(X,Y ) > k ) k and, since I(d(X,Y ) > k ) I(d(X,Y ) > ), by the monotone convergence theorem,
P(d(X,Y ) > ) . Thus, the infimum in the definition of (X,Y ) is attained.

Lemma 42 The Ky Fan metric on L 0 (, S) metrizes convergence in probability.

Proof. First of all, clearly, (X,Y ) = 0 if and only X = Y almost surely. To prove the triangle inequality,
  
P d(X, Z) > (X,Y ) + (Y, Z) P d(X,Y ) > (X,Y ) + P d(Y, Z) > (Y, Z)
(Y, Z) + (Y, Z),

so that (X, Z) (X,Y ) + (Y, Z). This proves that is a metric. Next, if n = (Xn , X) 0 then, for any > 0
and large enough n such that n < ,

P(d(Xn , X) > ) P(d(Xn , X) > n ) n 0.

Conversely, if Xn X in probability then, for any m 1 and large enough n n(m),


 1 1
P d(Xn , X) > ,
m m
which means that n 1/m and n 0. t
u

Lemma 43 For X,Y L 0 (, S), the Levy-Prohorov metric satisfies

(L (X), L (Y )) (X,Y ).

Proof. Take > (X,Y ) so that P(d(X,Y ) ) . For any measurable set A S,

P(X A) = P(X A, d(X,Y ) < ) + P(X A, d(X,Y ) ) P(Y A ) +

96

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


which means that (L (X), L (Y )) . Letting (X,Y ) proves the result. t
u
We will now prove that, in some sense, the opposite is also true. Let (S, d) be a metric space and P, Q be
probability laws on S. Suppose that these laws are close in the Levy-Prohorov metric . Can we construct random
variables s1 and s2 , with laws P and Q, that are define on the same probability space and are close to each other in the
Ky Fan metric ? We will construct a distribution on the product space S S such that the coordinates s1 and s2 have
marginal distributions P and Q and the distribution is concentrated on a neighborhood of the diagonal s1 = s2 , where
s1 and s2 are close in metric d, and the size of the neighborhood is controlled by (P, Q).
Consider two sets X and Y. Given a subset K X Y and A X we define a K-image of A by
AK = y Y x A, (x, y) K .


A K-matching f of X into Y is a one-to-one function (injection) f : X Y such that (x, f (x)) K. We will need the
following well-known matching theorem.

Theorem 55 (Halls marriage theorem) If X,Y are finite and for all A X,
card(AK ) card(A) (18.0.1)
then there exists a K-matching f of X into Y .

Proof. We will prove the result by induction on m = card(X). The case of m = 1 is obvious. For each x X there
exists y Y such that (x, y) K. If there is a matching f of X \ {x} into Y \ {y} then defining f (x) = y extends f to
X. If not, then since card(X \ {x}) < m, by induction assumption, condition (18.0.1) is violated, i.e. there exists a set
A X \ {x} such that card(AK \ {y}) < card(A). But because we also know that card(AK ) card(A) this implies that
card(AK ) = card(A). Since card(A) < m, by induction there exists a matching of A onto AK . If there is a matching of
X \ A into Y \ AK we can combine it with a matching of A and AK . If not then again, by the induction assumption, there
exists D X \ A such that card(DK \ AK ) < card(D). But then
card (A D)K = card(DK \ AK ) + card(AK ) < card(D) + card(A) = card(D A),


which contradicts the assumption (18.0.1). t


u
The matching theorem will be the key tool in the proof of the main result of this section. Here, given a set F S, we
will denote F = {s S : d(s, F) }.

Theorem 56 (Strassen) Suppose that (S, d) is a separable metric space and , > 0. Suppose that laws P and Q are
such that, for all measurable sets F S,
P(F) Q(F ) + (18.0.2)
Then for any > 0 there exist two non-negative measures , on S S such that

1. = + is a law on S S with marginals P and Q.


2. (d(x, y) > + ) = 0.
3. (S S) + .
4. is a finite sum of product measures.

Remark. Condition (18.0.2) is a relaxation of the definition of the Levy-Prohorov metric, since one can take different
, > (P, Q). Conditions 1 3 mean that we can construct a measure on S S such that coordinates x, y have
marginal distributions P, Q, concentrated within distance + of each other (condition 2) except for the set of measure
at most + (condition 3).
Proof. Case A. The proof will proceed in several steps. We will start with the simplest case which, however, contains
the main idea. Given small > 0, take n 1 such that n > 1. Suppose that the laws P, Q are uniform on finite subsets
M, N S of equal cardinality,
1
card(M) = card(N) = n, P(x) = Q(y) = < , x M, y N.
n

97

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Using the condition (18.0.2), we would like to match as many points from M and N as possible, but only points that are
within distance from each other. To use the matching theorem, we will introduce some auxiliary sets U and V that
are not too big, with size controlled by the parameter , and the union of these sets with M and N satisfies a certain
matching condition.
Take integer k such that n k < ( + )n. Let us take sets U and V such that k = card(U) = card(V ) and U,V
are disjoint from M, N. Define
X = M U, Y = N V.
Let us define a subset K X Y such that (x, y) K if and only if one of the following holds:

(i) x U,
(ii) y V,
(iii) d(x, y) if x M, y N.

This means that small auxiliary sets can be matched with any points, but only close points, d(x, y) , can be
matched in the main sets M and N. Consider a set A X with cardinality card(A) = r. If A * M then by (i), AK = Y
and card(AK ) r. Suppose now that A M and, again, we would like to show that card(AK ) r. Using (18.0.2) and
the fact that, by (iii), A = AK N, we can write
r 1 1
= P(A) Q(A ) + = card(A ) + = card(AK N) + .
n n n
Therefore,
r = card(A) n + card(AK N) k + card(AK N) = card(AK ),
since k = card(V ) and AK = V (AK N). By the matching theorem, there exists a K-matching f of X into Y . Let

T = {x M : f (x) N},

i.e. points in M that are matched with points in N at distance d(x, y) . Clearly, card(T ) n k, since at most k
points can be matched with a point in V , and for x T, by (iii), d(x, f (x)) . For x M \ T, redefine f (x) to match
each x with a different points in N that are not matched with points in T. This defines a matching of M onto N. We
define measures and by
1 1
= (x, f (x)) , = (x, f (x)) ,
n xT n xM\T

and let = + . First of all, obviously, has marginals P and Q because each point in M or N appears in the sum
+ only once with the weight 1/n. Also,
 card(M \ T ) k
d(x, f (x)) > = 0, (S S) < + . (18.0.3)
n n
Finally, both and are finite sums of point masses, which are product measures of point masses.

Case B. Suppose now that P and Q are concentrated on finitely many points with rational probabilities. Then we can
artificially split all points into smaller points of equal probabilities as follows. Let n be such that n > 1 and

nP(x), nQ(x) J = {1, 2, . . . , n}.

Define a discrete metric on J by f (i, j) = I(i , j) and define a metric on S J by



e (x, i), (y, j) = d(x, y) + f (i, j).
j
Define a measure P0 on S J as follows. If P(x) = n then
 1
P0 (x, i) = for i = 1, . . . , j.
n

98

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Define Q0 similarly. Let us check that the laws P0 , Q0 satisfy the assumptions of Case A with + instead of . Given
a set F S J, define 
F1 = x S (x, j) F for some j .
Using (18.0.2),
P0 (F) P(F1 ) Q(F1 ) + Q0 (F + ) + ,
because f (i, j) . By Case A in (18.0.3), we can construct 0 = 0 + 0 with marginals P0 and Q0 such that

0 e((x, i), (y, j)) > + 2 = 0, 0 (S J) (S J) < + .


 

Let , , be the projections of 0 , 0 , 0 back onto S S by the map ((x, i), (y, j)) (x, y). Then, clearly, = + ,
has marginals P and Q and (S S) < + . Since

e (x, i), (y, j) = d(x, y) + f (i, j) d(x, y),

we get
d(x, y) > + 2 0 e((x, i), (y, j)) > + 2 = 0.
 

Finally, is obviously a finite sum of product measures. Of course, we can replace by /2.
Case C. (General case) Let P, Q be laws on a separable metric space (S, d). Let A be a maximal set such that for
all x, y A the distance d(x, y) . Such set A is called an -packing and it is countable, because S is separable, so
A = {xi }i1 . Since A is maximal, for each x S there exists y A such that d(x, y) < , otherwise, we could add x to
the set A. Let us create a partition of S using -balls around the points xi :

B1 = {x S : d(x, x1 ) < }, B2 = {d(x, x2 ) < } \ B1

and, iteratively for k 2,


Bk = {d(x, xk ) < } \ (B1 Bk1 ).
{Bk }k1 is a partition of S. Let us discretize measures P and Q by projecting them onto {xi }i1 :

P0 (xk ) = P(Bk ), Q0 (xk ) = Q(Bk ).

Consider any set F S. For any point x F, if x Bk then d(x, xk ) < , i.e. xk F and, therefore,

P(F) P0 (F ).

Also, if xk F then Bk F and, therefore,


P0 (F) P(F ).
To apply Case B, we need to approximate P0 by a measure on a finite number of points with rationals probabilities.
For large enough n 1, let
bnP0 (xk )c
P00 (xk ) = .
n
Clearly, as n , P00 (xk ) P0 (xk ). Notice that only a finite number of points carry non-zero weights P00 (xk ) > 0. Let
x0 be some auxiliary point outside of the sequence {xk } and let us assign to it the remaining probability

P00 (x0 ) = 1 P00 (xk ).


k1

If we take n large enough so that P00 (x0 ) < /2 then

|P00 (xk ) P0 (xk )| .


k0

All the relations above also hold true for Q, Q0 and Q00 that are defined similarly, and we can assume that the same
point x0 plays a role of the auxiliary point for both P00 and Q00 . Given F S, we can write

P00 (F) P0 (F) + P(F ) + Q(F + ) + + Q0 (F +2 ) + + Q00 (F +2 ) + + 2.

99

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


By Case B, there exists a decomposition 00 = 00 + 00 on S S with marginals P00 and Q00 such that

00 d(x, y) > + 3 = 0, 00 (S S) + 3.


Let us also move the points (x0 , xi ) and (xi , x0 ) for i 0 into in the support of 00 : since the total weight of these
points is at most , the total weight of 00 does not increase much:

00 (S S) + 5.

It remains to redistribute these measures from the sequence {xi }i0 to the entire space S in a way that recovers
marginal distributions P and Q and so that not much accuracy is lost. Define a sequence of measures on S by
P(C Bi )
Pi (C) = if P(Bi ) > 0, and Pi (C) = 0 otherwise,
P(Bi )
and define Qi similarly. The measures Pi and Qi are concentrated on Bi . Define

= 00 (xi , x j )(Pi Q j ).
i, j1

Since 00 (xi , x j ) = 0 unless d(xi , x j ) + 3, the measure is concentrated on the set {d(x, y) + 5} because
for x Bi , y B j ,
d(x, y) d(x, xi ) + d(xi , x j ) + d(x j , y) + ( + 3) + = + 5.
The marginals u and v of satisfy

u(C) := (C S) 00 (xi , x j )Pi (C) = 00 (xi , S)Pi (C)


i, j1 i1
00 0
P (xi )Pi (C) P (xi )Pi (C) = P(Bi )Pi (C) = P(C)
i1 i1 i1

and, similarly,
v(C) := (S C) Q(C).
If u(S) = v(S) = 1 then (S S) = 1 and is a probability measure with marginals P and Q, so we can take = 0.
Otherwise, take t = 1 u(S) = 1 v(S) and define
1
= (P u) (Q v).
t
It is easy to check that = + has marginals P and Q. For example,
1  
(C S) = (C S) + (C S) = u(C) + P(C) u(C) Q(S) v(S)
t
1   
= u(C) + P(C) u(C) 1 v(S) = u(C) + P(C) u(C) = P(C).
t
Also,
(S S) = t = 1 (S S) = 1 00 (S S) = 00 (S S) + 5.
Finally, by construction, is a finite sum of product measures. t
u
The following relationship between Ky Fan and Levy-Prohorov metrics is an immediate consequence of Strassens
theorem. We already saw that (L (X), L (Y )) (X,Y ).

Theorem 57 If (S, d) is a separable metric space and P, Q are laws on S then, for any > 0, there exist random
variables X and Y on the same probability space with the distributions L (X) = P and L (Y ) = Q such that

(X,Y ) (P, Q) + .

If P and Q are tight, one can take = 0.

100

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. Let us take = = (P, Q). Then, by the definition of the Levy-Prohorov metric, for any > 0 and any set A,

P(A) Q(A+ ) + + .

By Strassens theorem, there exists a measure on S S with the marginals P, Q such that

d(x, y) > + 2 + 2. (18.0.4)

Therefore, if X and Y are the coordinate projections on S S, i.e.

X,Y : S S S, X(x, y) = x, Y (x, y) = y,

then, by definition of the Ky Fan metric, (X,Y ) + 2. If P and Q are tight then there exists compact K such that
P(K), Q(K) 1 . For = 1/n find n as in (18.0.4). Since n has marginals P and Q, n (K K) 1 2 , which
means that the sequence (n )n1 is uniformly tight. By the Selection Theorem, there exists a converging subsequence
n(k) . Obviously, again has marginals P and Q. Since, by construction,
 2 2
n d(x, y) > + +
n n
and {d(x, y) > + } is an open set in S S, by the Portmanteau Theorem,
   
d(x, y) > + lim inf n(k) d(x, y) > + .
k

Letting 0 we get (d(x, y) > ) . As above, (X,Y ) for this measure . t


u
This also implies the relationship between the bounded Lipschitz metric and Levy-Prohorov metric .

Lemma 44 If (S, d) is a separable metric space then


1 p
(P, Q) (P, Q) 2 (P, Q).
2
Proof. We already proved the second inequality. To prove the first one, by the previous theorem, given > 0, we
can find random variables X and Y with the distributions P and Q, such that (X,Y ) + . Consider a bounded
Lipschitz function f , || f ||BL < . Then
Z Z
f dP f dQ = |E f (X) E f (Y )| E| f (X) f (Y )|


k f kL ( + ) + 2k f k P d(X,Y ) > +
k f kL ( + ) + 2k f k ( + ) 2k f kBL ( + ).

Thus, (P, Q) 2((P, Q) + ) and letting 0 finishes the proof. t


u
Exercise. Show that the inequality (P, Q) 2(P, Q) is sharp in the following sense: for any t (0, 1) and > 0,
one can find laws P and Q on R such that (P, Q) = t and (P, Q) 2t . Hint: let P, Q have atoms of size t at points
far apart, with P, Q otherwise the same.
Exercise. Suppose that X,Y and U are random variables on some probability space such that

P(X A) P(Y A ) +

for all measurable sets A R, and U is independent of (X,Y ) and has the uniform distribution on [0, 1]. Prove that
there exists a measurable function f : R [0, 1] R such that

P X f (X,U) >

and f (X,U) has the same distribution as Y .

101

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 19

Kantorovich-Rubinstein Theorem.

Let (S, d) be a separable metric space. Denote by P1 (S) the set of all laws on S such that for some z S (equivalently,
for all z S), Z
d(x, z) dP(x) < .
S
Let us denote by 
M(P, Q) = is a law on S S with marginals P and Q .
For P, Q P1 (S), the quantity
nZ o
W (P, Q) = inf d(x, y) d(x, y) : M(P, Q)

is called the Wasserstein distance between P and Q. A measure M(P, Q) represents a transportation between
the measures P and Q. We can think of the conditional distribution (y|x) as a way to redistribute the mass in the
neighborhood of a point x so that the distribution P will be redistributed to the distribution Q. If the distance d(x, y)
represents the cost of moving x to y then the Wasserstein distance gives the optimal total cost of transporting P to Q.
Given any two laws P and Q on S, let us define
n Z Z o
(P, Q) = sup f dP f dQ : || f ||L 1

and nZ Z o
md (P, Q) = sup f dP + g dQ : f , g C(S), f (x) + g(y) < d(x, y) .

Notice that, obviously, for P, Q P1 (S), both (P, Q), md (P, Q) < . Let us show that these two quantities are equal.

Lemma 45 We have (P, Q) = md (P, Q).

Proof. Given a function f such that || f ||L 1, let us take a small > 0 and set g(y) = f (y) . Then

f (x) + g(y) = f (x) f (y) d(x, y) < d(x, y)

and Z Z Z Z
f dP + g dQ = f dP f dQ .

Combining with the choice of f (x) and g(y) = f (y) we get


Z Z nZ Z o
f dP f dQ sup f dP + g dQ : f , g C(S), f (x) + g(y) < d(x, y) +

which, of course, proves that (P, Q) md (P, Q). Let us now consider functions f , g such that f (x) + g(y) < d(x, y).
Define
e(x) = inf(d(x, y) g(y)) = sup(g(y) d(x, y))
y y

102

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Clearly,
f (x) e(x) d(x, x) g(x) = g(x)
and, therefore, Z Z Z Z
f dP + g dQ e dP e dQ.

The function e satisfies

e(x) e(x0 ) = sup g(y) d(x0 , y) sup g(y) d(x, y)


 
y y

sup d(x, y) d(x0 , y) d(x, x0 )



y

which means that ||e||L 1 and, therefore, md (P, Q) (P, Q). This finishes the proof. t
u
Below, we will need the following version of the Hahn-Banach theorem.

Theorem 58 (Hahn-Banach) Let V be a normed vector space, E - a linear subspace of V and U - an open convex
set in V such that U E , 0.
/ If r : E R is a linear non-zero functional on E then there exists a linear functional
: V R such that |E = r and supU (x) = supUE r(x).

Proof. Let t = sup{r(x) : x U E} and let B = {x E : r(x) > t}. Since B is convex and U B = 0, / the Hahn-
Banach separation theorem implies that there exists a linear functional q : V R that separates U and B. For any
x0 U E let F = {x E : q(x) = q(x0 )}. Since q(x0 ) < infB q(x), F B = 0. / This means that the hyperplanes
{x E : q(x) = q(x0 )} and {x E : r(x) = t} in the subspace E are parallel and this implies that q(x) = r(x) on E
for some , 0. Let = q/. Then r = |E and
1 1
sup (x) = sup q(x) inf q(x) = inf r(x) = t = sup r(x) = sup (x).
U U B B UE UE

Since U U E, the inequality is actually equality, which finishes the proof. t


u
Using this, we will prove the following Kantorovich-Rubinstein theorem for compact metric spaces.

Theorem 59 If S is a compact metric space then W (P, Q) = md (P, Q) = (P, Q) for P, Q P1 (S).

Proof. We only need to show the first equality. Consider a vector space V = C(S S) equipped with k k norm and
let 
U = f V : f (x, y) < d(x, y) .
Obviously, U is convex and open, because S S is compact and any continuous function on a compact achieves its
maximum. Consider a linear subspace E of V defined by

E = V : (x, y) = f (x) + g(y), f , g C(S)

so that 
U E = V : (x, y) = f (x) + g(y) < d(x, y), f , g C(S) .
Define a linear functional r on E by
Z Z
r( ) = f dP + g dQ if = f (x) + g(y).

By the above Hahn-Banach theorem, r can be extended to : V R such that |E = r and

sup ( ) = sup r( ) = md (P, Q) < .


U UE

Let us look at the properties of this functional. First of all, if a(x, y) 0 then (a) 0. Indeed, for any c 0,

U 3 d(x, y) c a(x, y) < d(x, y)

103

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and, therefore, for all c 0,
(d ca ) = (d) c(a) () sup < .
U
This can hold only if (a) 0. This implies that if 1 2 then (1 ) (2 ). For any function , both ,
k k 1 and, by monotonicity of ,
|( )| || || (1) = || || .
Since S S is compact and is a continuous functional on (C(S S), k k ), by the Reisz representation theorem,
there exists a unique measure on the Borel -algebra on S S such that
Z
( f ) = f (x, y) d(x, y).

Since |E = r, Z Z Z
( f (x) + g(y)) d(x, y) = f dP + g dQ,
which implies that has marginal P and Q, i.e. M(P, Q). This proves that
nZ o Z
md (P, Q) = sup ( ) = sup f (x, y) d(x, y) : f (x, y) < d(x, y) = d(x, y) d(x, y) W (P, Q).
U

The opposite inequality is easy, because for any f , g such that f (x) + g(y) < d(x, y) and any M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) d(x, y) d(x, y)d(x, y). (19.0.1)

This finishes the proof and, moreover, it shows that the infimum in the definition of W is achieved on some . t
u
Remark. Notice that in the proof of this theorem we never used the fact that d is a metric. Theorem holds for any
d C(S S) under the corresponding integrability assumptions. For example, one can consider loss functions of the
type d(x, y) p for p > 1, which are not necessarily metrics. However, in Lemma 45, the fact that d is a metric was
essential. t
u
Our next goal will be to show that W = on separable and not necessarily compact metric spaces. We start with the
following.
Lemma 46 If (S, d) is a separable metric space then W and are metrics on P1 (S).
Proof. Since for a bounded Lipschitz metric we have (P, Q) (P, Q), is also a metric, because if (P, Q) = 0
then (P, Q) = 0 and, therefore, P = Q. As in (19.0.5), it should be obvious that (P, Q) = md (P, Q) W (P, Q) and if
W (P, Q) = 0 then (P, Q) = 0 and P = Q. The symmetry of W is obvious. It remains to show that W (P, Q) satisfies the
triangle inequality. The idea here is very simple, and let us first explain it in the case when (S, d) is complete. Consider
three laws P, Q, T on S and let M(P, Q) and M(Q, T) be such that
Z Z
d(x, y) d(x, y) W (P, Q) + and d(y, z) d(y, z) W (Q, T) + .

Let us generate a distribution on S S S with marginals P, Q and T and marginals on pairs of coordinates (x, y)
and (y, z) given by and by gluing and in the following way. We know that when (S, d) is complete and
separable, there exist regular conditional distributions (dx | y) and (dz | y) of the coordinates x and z given y. Then,
we define a distribution on S S S by first generating y from the distribution Q and, given y, generating the pair
x and z according to the conditional distributions (dx | y) and (dz | y) independently of each other, i.e. according to
the product measure on S S
(dx dz | y) = (dx | y) (dz | y).
This is called conditionally independent coupling of x and z given y. Obviously, by construction, (x, y) has the
distribution and (y, z) has the distribution . Therefore, the marginals of x and z are P and T, which means that the
pair (x, z) has the distribution M(P, T). Finally,
Z Z Z Z
W (P, T) d(x, z) d(x, z) = d(x, z) d(x, y, z) d(x, y) d + d(y, z) d
Z Z
= d(x, y) d + d(y, z) d W (P, Q) +W (Q, T) + 2.

104

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Letting 0 proves the triangle inequality for W . In the case when (S, d) is not complete, we will apply the same
basic idea after we discretize the space without losing much in the transportation cost integral. This can be done as in
the proof of Strassens theorem, Case C. Given > 0, consider a partition (Sn )n1 of S such that diameter(Sn ) < for
all n. On each box Sn Sm let

1 ((C Sn ) Sm ) 2 (Sn (C Sm ))
nm (C) = , nm (C) =
(Sn Sm ) (Sn Sm )
be the marginal distributions of the conditional distribution of on Sn Sm . Define
0 = (Sn Sm ) nm
1 2
nm .
n,m

In this construction, locally on each small box Sn Sm , measure is replaced by the product measure with the same
marginals. Let us compute the marginals of 0 . Given a set C S,
0 (C S) = 1
(Sn Sm ) nm 2
(C) nm (S)
n,m

= ((C Sn ) Sm ) = ((C Sn ) S) = P(C Sn ) = P(C).


n,m n n

Similarly, 0 (S C) = Q(C), so 0 has the same marginals as , 0 M(P, Q). It should be obvious that transportation
cost integral does not change much by replacing with 0 . One can visualize this by looking at what happens locally
on each small box Sn Sm . Let (Xn ,Ym ) be a random pair with distribution restricted to Sn Sm so that
1
Z
Ed(Xn ,Ym ) = d(x, y)d(x, y).
(Sn Sm ) Sn Sm

Let Ym0 be an independent copy of Ym , also independent of Xn . Then the joint distribution of (Xn ,Ym0 ) is nm
1 2 and
nm
Z
Ed(Xn ,Ym0 ) = 1
d(x, y) d(nm 2
nm )(x, y).
Sn Sm

Then Z
d(x, y) d(x, y) = (Sn Sm )Ed(Xn ,Ym ),
n,m
Z
d(x, y) d 0 (x, y) = (Sn Sm )Ed(Xn ,Ym0 ).
n,m

Finally, d(Ym ,Ym0 ) diam(Sm ) and these two integrals differ by at most . Therefore,
Z
d(x, y) d 0 (x, y) W (P, Q) + 2.

Similarly, we can define


0 = (Sn Sm ) nm
1 2
nm
n,m

such that Z
d(x, y) d 0 (x, y) W (Q, T) + 2.

We will now show that this special simple form of the distributions 0 (x, y), 0 (y, z) ensures that the conditional distri-
butions of x and z given y are well defined. Let Qm be the restriction of Q to Sm ,
2
Qm (C) = Q(C Sm ) = (Sn Sm ) nm (C).
n
2 (C) = 0 for all n, which means that 2 are absolutely continuous with respect to
Obviously, if Qm (C) = 0 then nm nm
Qm and the Radon-Nikodym derivatives
2
dnm
fnm (y) =
dQm
(y) exist and (Sn Sm ) fnm (y) = 1 a.s. for y Sm .
n

105

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Of course, we can set fnm (y) = 0 for y outside of Sm . Let us define a conditional distribution of x given y by

0 (A|y) = (Sn Sm ) fnm (y)nm


1
(A).
n,m

Notice that for any A B, 0 (A|y) is measurable in y and 0 (A|y) is a probability distribution on B, for Q-almost all
y, because
0 (S|y) = (Sn Sm ) fnm (y) = 1 a.s.
n,m

Let us check that for Borel sets A, B B,


Z
0 (A B) = 0 (A|y) dQ(y).
B

Indeed, since fnm (y) = 0 for y < Sm ,


Z Z
0 (A|y) dQ(y) = 1
(Sn Sm )nm (A) fnm (y) dQ(y)
B n,m B
Z
1
= (Sn Sm )nm (A)
B
fnm (y) dQm (y)
n,m

= 1
(Sn Sm )nm 2
(A)nm (B) = 0 (A B).
n,m

Conditional distribution 0 (|y) can be defined similarly, and we finish the proof as in the case of the complete space
above. t
u

Next lemma shows that on a separable metric space any law with the first moment, i.e. P P1 (S), can be approxi-
mated in metrics W and by laws concentrated on finite sets.

Lemma 47 If (S, d) is separable and P P1 (S) then there exists a sequence of laws Pn such that Pn (Fn ) = 1 for some
finite sets Fn and W (Pn , P), (Pn , P) 0.

Proof. For each n 1, let (Sn j ) j1 be a partition of S such that diam(Sn j ) 1/n. Take a point xn j Sn j in each set Sn j
and for k 1 define a function 
xn j , if x Sn j for j k,
fnk (x) =
xn1 , if x Sn j for j > k.
We have,
1 2
Z Z Z
d(x, fnk (x)) dP(x) = d(x, fnk (x)) dP(x) P(Sn j ) + d(x, xn1 ) dP(x)
j1 Sn j n jk S\(Sn1 Snk ) n

for k large enough, because P P1 (S), i.e. d(x, xn1 ) dP(x) < , and the set S \ (Sn1 Snk ) 0.
R
/
Let n be the image on S S of the measure P under the map x ( fnk (x), x) so that n M(Pn , P) for some Pn
concentrated on the set of points {xn1 , . . . , xnk }. Finally,

2
Z Z
W (Pn , P) d(x, y) dn (x, y) = d( fnk (x), x) dP(x) .
n
Since (Pn , P) W (Pn , P), this finishes the proof. t
u

We are finally ready to extend Theorem 59 to separable metric spaces.

Theorem 60 (Kantorovich-Rubinstein theorem) If (S, d) is a separable metric space then W (P, Q) = (P, Q) for any
distributions P, Q P1 (S).

106

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Proof. By the previous lemma, we can approximate P and Q by Pn and Qn concentrated on finite (hence, compact)
sets. By Theorem 59, W (Pn , Qn ) = (Pn , Qn ). Finally, since both W, are metrics,

W (P, Q) W (P, Pn ) +W (Pn , Qn ) +W (Qn , Q)


= W (P, Pn ) + (Pn , Qn ) +W (Qn , Q)
W (P, Pn ) +W (Qn , Q) + (Pn , P) + (Qn , Q) + (P, Q).

Letting n proves that W (P, Q) (P, Q). We saw above that the opposite inequality always holds. t
u
Wassersteins distance Wp (P, Q) on Rn . We will now prove a version of the Kantorovich-Rubinstein theorem on Rn
in some cases when d(x, y) is not a metric. Given p 1, let us define the Wasserstein distance Wp (P, Q) on
n Z o
P p (Rn ) = P - law on Rn : |x| p dP(x) <

corresponding to the cost function d(x, y) = |x y| p by


nZ o
Wp (P, Q) p = inf |x y| p d(x, y) : M(P, Q) . (19.0.2)

Even though d(x, y) is not a metric for p > 1, Wp is still a metric on P p (Rn ), which can be shown the same way as in
Lemma 46. Namely, given nearly optimal M(P, Q) and M(Q, T) we can construct (X,Y, Z) M(P, Q, T) such
that (X,Y ) and (Y, Z) and, therefore,
1 1 1 1 1
Wp (P, T) (E|X Z| p ) p (E|X Y | p ) p + (E|Y Z| p ) p (Wpp (P, Q) + ) p + (Wpp (Q, T) + ) p .

Then, we let 0. The following dual representation holds.

Theorem 61 For any P, Q P p (Rn ),


nZ Z o
Wp (P, Q) p = sup f dP + g dQ : f , g C(Rn ), f (x) + g(y) |x y| p . (19.0.3)

Proof We will show below that for any continuous uniformly bounded function d(x, y) on Rn Rn ,
nZ o
inf d(x, y) d(x, y) : M(P, Q)
nZ Z o
= sup f dP + g dQ : f , g C(Rn ), f (x) + g(y) d(x, y) . (19.0.4)

This will imply (19.0.3) as follows. Let us take R 1 large enough so that for K = {|x| R},
Z Z
|x| p dP 2p1 , |x| p dQ 2p1 .
Kc Kc

We can find such R, because P, Q P p (Rn ). Let d(x, y) = |x y| p (2R) p . Then for any x, y K, we have d(x, y) =
|x y| p and for any M(P, Q),
Z Z Z
p
|x y| d(x, y) d(x, y) d(x, y) + |x y| p d(x, y).
(KK)c

Let us break the second integral into two integrals over disjoint sets {|x| |y|} and {|x| < |y|}. On the first set,
|x y| p 2 p |x| p and, moreover, x can not belong to K, since in that case y must be in K c . This means that the part of
the second integral over this set in bounded by
Z Z
2 p |x| p d(x, y) = 2 p |x| p dP(x) /2.
K c Rn Kc

The second integral over the set {|x| < |y|} can be similarly bounded by /2 and, therefore,
Z Z
|x y| p d(x, y) d(x, y) d(x, y) + .

107

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Together with (19.0.4) this implies that
nZ o
Wp (P, Q) p = inf |x y| p d(x, y) : M(P, Q)
nZ o
inf d(x, y) d(x, y) : M(P, Q) +
nZ Z o
= sup f dP + g dQ : f , g C(Rn ), f (x) + g(y) d(x, y) +
nZ Z o
sup f dP + g dQ : f , g C(Rn ), f (x) + g(y) |x y| p + .

Letting 0 proves inequality in one direction, and the opposite inequality is always true, as we have seen above,
since for any f , g C(Rn ) such that f (x) + g(y) |x y| p and any M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) d(x, y) |x y| p d(x, y). (19.0.5)

Therefore, it remains to prove (19.0.4). Notice that, by adding a constant, we can assume that d 0.
Again, the inequality is obvious, so we only need to prove . We will reduce this to the compact case proved
in Theorem 59. Let us take R 1 large enough and let K = {|x| R}. Let us consider measures

P(C K) Q(C K)
PK (C) = , QK (C) =
P(K) Q(K)

and let K M(PK , QK ) be the measure on K K that achieves the infimum


Z nZ o
d(x, y) dK (x, y) = inf d(x, y) d(x, y) : M(PK , QK ) .

If define the measure on Rn Rn by



= P(K)Q(K) K + P Q (KK)c ,

it is easy to check that M(P, Q). Therefore,


nZ o Z
inf d(x, y) d(x, y) : M(P, Q) d(x, y) d (x, y)
Z
d(x, y) dK (x, y) + kdk P Q((K K)c ).

If we take R large enough so that the last term is smaller then then, using the Kantorovich-Rubinstein theorem on
compacts, we get
nZ o Z
inf d(x, y) d(x, y) : M(P, Q) d(x, y) dK (x, y) +
nZ o
= inf d(x, y) d(x, y) : M(PK , QK ) +
nZ Z o
= sup f dPK + g dQK : f , g C(K), f (x) + g(y) d(x, y) + .
Z Z
f dPK + g dQK + 2, (19.0.6)

for some f , g C(K) such that f (x) + g(y) d(x, y). To finish the proof, we would like to estimate the above sum of
integrals by Z Z
dP + dQ

for some functions , C(Rn ) such that (x) + (y) d(x, y). This can be achieved by improving our choice of
functions f and g using infimum-convolutions, as we will now explain.

108

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


However, before we do this, let us make the following useful observation. Since d 0, we can assume that the
supremum in (19.0.6) is strictly positive (otherwise, there is nothing to prove) and
Z Z
f dPK + g dQK 0.

Since we can replace f and g by f c and g+c without changing the sum of integrals, we can assume that each integral
is nonnegative. This implies that there exist points x0 , y0 K such that f (x0 ), g(y0 ) 0. Since f (x) + g(y) d(x, y),

f (x) d(x, y0 ), g(y) d(x0 , y) for x, y K. (19.0.7)

Let us now define the function (x) for x Rn by



(x) = inf d(x, y) g(y) .
yK

This is called the infimum-convolution. Clearly, since f (x) d(x, y) g(y) for all y K, we have f (x) (x) for
x K, which implies that Z Z Z Z
f dPK + g dQK dPK + g dQK .

Notice that, by definition, (x) + g(y) d(x, y) for all x Rn , y K. Also, using (19.0.7) and the fact that g(y0 ) 0,

inf d(x, y) d(x0 , y) (x) d(x, y0 ) g(y0 ) d(x, y0 ). (19.0.8)
yK

Next, we define the function (y) for y Rn by



(y) = infn d(x, y) (x) .
xR

Since g(y) d(x, y) (x) for all x Rn , we have g(y) (y) for y K, which implies that
Z Z Z Z Z Z
f dPK + g dQK dPK + g dQK dPK + dQK .

By definition, (x) + (y) d(x, y) for all x, y Rn . Also, using (19.0.8) and the fact that (x0 ) f (x0 ) 0,

infn d(x, y) d(x, y0 ) (y) d(x0 , y) (x0 ) d(x0 , y). (19.0.9)
xR

The estimates in (19.0.7) and (19.0.8) imply that k k , kk kdk . Therefore, if we write
Z Z Z Z Z
dP = dP + dP = P(K) dPK + dP
K Kc K c
Z ZK Z
= dPK P(K c ) dPK + dP,
K K Kc

each of the last two terms can be bounded in absolute value by kdk P(K c ), which shows that, for large enough R 1,
Z Z
dPK dP + .
K

The same estimate holds for and we showed that


Z Z Z Z
f dPK + g dQK dP + dQ + 2.

The equation (19.0.6) implies that


nZ o Z Z
inf d(x, y) d(x, y) : M(P, Q) dP + dQ + 4

and, since (x) + (y) d(x, y), this finishes the proof. t
u

109

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Exercise. Let P1 (R) be the set of laws on R such that |x| dP(x) < . Let us define a map : P1 (R) P1 (R) as
R

follows. Consider a random variable with values in N such that E < and an independent random variable such
that E| | < . Given a sequence (Xi ) of i.i.d. random variables with the distribution P P1 (R), independent of and
, let (P) be the distribution of the sum + i=1 Xi , where the sum i=1 Xi is zero if = 0. If E [0, 1), prove that
has a unique fixed point, (P) = P. Hint: use the Banach Fixed Point Theorem for the Wasserstein metric W1 . (You
need to prove that the metric space (P1 (R),W1 ) is complete.)

Exercise. Let P be the set of probability laws on [1, 1] and define a map : P P as follows. Consider a random
variable with values in N and an independent random variable . Given a sequence (Xi ) of i.i.d. random variables on
[1, 1] with the distribution P P, independent of and , let (P) be the distribution of cos( + i=1 Xi ), where
the sum i=1 Xi is zero if = 0. Prove that has a fixed point, (P) = P. Hint: use the Schauder Fixed Point Theorem
for P equipped with the topology of weak convergence. (Notice that now a fixed point is possibly not unique.)

Exercise. (a) Consider a countable set A and two probability measures P and Q on A. Prove that
nZ o 1
inf I(x , y) d(x, y) : M(P, Q) = P({a}) Q({a}) .

2 aA

Hint: use the Kantorovich-Rubinstein theorem with metric d(x, y) = I(x , y).

(b) Let (S, d) be a separable metric space and B be the Borel -algebra. Given two probability measures P and Q on
B, construct a measure M(P, Q) that witnesses equality
nZ o
inf I(x , y) d(x, y) : M(P, Q) = sup |P(A) Q(A)|.
AB

Hint: use the Hahn-Jordan measure decomposition.

Exercise. Consider a finite set A and nonnegative real-valued functions

f1 , . . . , fn : A R

such that for any convex combination g =R1 f1 + . . . + n fn there is a point a A such that g(a) < 1. Prove that there
is a probability measure P on A such that fi dP < 1 for all i n. Hint: use the Hahn-Banach separation theorem.

110

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 20

Brunn-Minkowski and Prekopa-Leindler


inequalities.

In this section we will make several connections between the Kantorovich-Rubinstein theorem and other classical
objects. Let us start with the following classical inequality.

Theorem 62 (Brunn-Minkowski inequality on the real line) If is the Lebesgue measure and A, B are two bounded
Borel sets on R then (A + B) (A) + (B), where A + B is the set addition, i.e. A + B = {a + b : a A, b B}.

Proof. First, suppose that A and B are open. Since the Lebesgue measure is invariant under translations, let us translate
A and B so that sup A = inf B = 0. Let us check that, in this case, A B A + B. Since A is open, for each a A
there exists > 0 such that (a , a + ) A. Since inf B = 0, there exists b B such that 0 b < /2. Then
a (a + b , a + b + ) A + B, which proves that A A + B. One can prove similarly that B is also a subset of
A + B. Since A and B are disjoint, we proved that (A) + (B) = (A B) (A + B).
Now, suppose that A and B are compact. Then, obviously, A + B is also compact. For > 0, let us denote by
C an open -neighborhood of the set C. Since A + B (A + B)2 , using the previous case of the open sets, we get
(A ) + (B ) (A + B ) ((A + B)2 ). Since A is closed, A A as 0 and, by the continuity of measure,
(A ) (A). The same holds for B and A + B, and we proved the inequality for two compact sets.
Finally, consider arbitrary bounded measurable sets A and B. By the regularity of measure, we can find compacts
C A and D B such that (A \ C) and (B \ D) . Usin the previous case of the compact sets, we can write
(A) + (B) 2 (C) + (D) (C + D) (A + B), and letting 0 finishes the proof. t
u
Using this, we will prove another classical inequality.

Theorem 63 (Prekopa-Leindler inequality) Consider nonnegative integrable functions w, u, v : Rn [0, ) such that
for some [0, 1],
w( x + (1 )y) u(x) v(y)1 for all x, y Rn .
Then, Z Z  Z 1
w dx u dx v dx .

Using the Prekopa-Leindler inequality one can prove that the Lebesgue measure on Rn satisfies the Brunn-Minkowski
inequality
(A)1/n + (B)1/n (A + B)1/n . (20.0.1)
We will leave this as an exercise.
Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement holds for
n and we would like to show it for n + 1. By the assumption, for any x, y Rn and a, b R,

w( x + (1 )y, a + (1 )b) u(x, a) v(y, b)1 .

111

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Let us fix a and b and consider functions w1 (x) = w(x, a + (1 )b), u1 (x) = u(x, a) and v1 (x) = v(x, b) on Rn that
satisfy w1 ( x + (1 )y) u1 (x) v1 (y)1 . By the induction assumption,
Z Z  Z 1
w1 dx u1 dx v1 dx .
Rn Rn Rn

These integrals still depend on a and b and we can define


Z Z
w2 ( a + (1 )b) = w1 dx = w(x, a + (1 )b) dx
Rn Rn

and, similarly, Z Z
u2 (a) = u1 (x, a) dx and v2 (b) = v1 (x, b) dx.
Rn Rn
Then the above inequality can be rewritten as

w2 ( a + (1 )b) u2 (a) v2 (b)1 .

These functions are defined on R and, by the induction assumption,


Z Z  Z 1 Z Z  Z 1
w2 ds u2 ds v2 ds = w dz u dz v dz ,
R R R Rn+1 Rn+1 Rn+1

which finishes the proof of the induction step. It remains to prove the case n = 1. We can assume that u, v, w : R [0, 1],
because both inequalities in the statement of the theorem are homogeneous to truncation and scaling. Also, we can
assume that u and v are not identically zero, since there is nothing to prove in that case, and we can scale them by their
k k norm and assume that kuk = kvk = 1. We have the following set inclusion,

{w a} {u a} + (1 ){v a},

because if u(x) a and v(y) a then, by assumption,

w( x + (1 )y) u(x) v(y)1 a a1 = a.

When a [0, 1], the sets {u a} and {v a} are not empty and the Brunn-Minkowski inequality implies that

(w a) (u a) + (1 )(v a).

Finally,
Z Z Z 1 Z 1
w(z) dz = I(x w(z)) dadz = (w a) da
R R 0 0
Z 1 Z 1
(u a) da + (1 ) (v a) da
0 0
Z Z Z  Z 1
= u(z) dz + (1 ) v(z) dz u(z) dz v(z) dz ,
R R R R

where in the last step we used the arithmetic-geometric mean inequality. This finishes the proof. t
u

Remark.
R R
Another common proof of the last step n = 1 uses transportation of measure, as follows. We can assume that
u = v = 1 by rescaling
u v w
u R , v R , w R R 1 .
u v ( u) ( v)
R
Then we need to show that w 1. Consider cumulative distribution functions
Z x Z x
F(x) = u(y) dy and G(x) = v(y) dy,

112

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and let x(t) and y(t) be their quantile transforms. Then, F(x(t)) = t and G(y(t)) = t for 0 t 1. By the Lebesgue
differentiation theorem, F 0 (x) = u(x) for all x < N for some set N of Lebesgue measure zero. Since x(t) is (strictly)
monotone, the derivative x0 (t) exists for all t < N 0 for some set N 0 of Lebesgue measure zero. Also, notice that

x(t) N = t F(N ),

and, since F is absolutely continuous, F(N ) also has Lebesgue measure zero (Lusin N property). Therefore, for
t < F(N ) N 0 , the derivative x0 (t) exists and F 0 (x(t)) = u(x(t)). Using the chain rule in the equation F(x(t)) = t, for
all such t, u(x(t))x0 (t) = 1. Similarly, outside of some set of measure zero, v(y(t))y0 (t) = 1. Now, consider the function
z(t) = x(t) + (1 )y(t). This function is strictly increasing and differentiable almost everywhere. Therefore,
Z + Z 1 Z 1
w(z(t))z0 (t) dt = w x(t) + (1 )y(t) z0 (t) dt.

w(z) dz
0 0

By the arithmetic-geometric mean inequality

z0 (t) = x0 (t) + (1 )y0 (t) x0 (t) y0 (t)1

and, by assumption,
w( x(t) + (1 )y(t)) u(x(t)) v(y(t))1 .
Therefore, Z 1 Z 1
Z   1
w(z)dz u(x(t))x0 (t) v(y(t))y0 (t) dt = 1 dt = 1.
0 0
This finishes the proof. t
u

Entropy and the Kullback-Leibler divergence. Consider a probability measure P on Rn and a nonnegative measur-
able function u : Rn [0, ). We define the entropy of u with respect to P by
Z Z Z
EntP (u) = u log u dP u dP log u dP.

Notice that EntP (u) 0, by Jensens inequality, since u log u is a convex function. Entropy has the following variational
representation.

Lemma 48 The entropy can be written as


nZ Z o
EntP (u) = sup uv dP : ev dP 1 . (20.0.2)

Proof. Take any measurable function v such that ev dP 1. Then, for any 0,
R

Z Z  Z  Z
uv dP uv dP + 1 ev dP = + (uv ev ) dP.

The integrand (uv ev ) is concave in v (since 0) and can be maximized point-wise by taking v such that u = ev ,
which implies that
u
Z Z Z
uv dP + u log dP u dP.

R v
This bound holds for all v such that e dP 1 and, therefore,
nZ u
Z o Z Z
sup uv dP : ev dP 1 + u log dP u dP.

R
The right hand side is convex in for 0 (since u is nonnegative) and the minimum is achieved on = u dP. With
this choice of the right hand side is equal to EntP (u), and we proved that
nZ Z o
sup uv dP : ev dP 1 EntP (u).

113

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


R 
On the other hand, it is easy to see that this upper bound is achieved on the function v = log u
R v
u dP , which satisfies
e dP = 1. t
u
Suppose now that a law Q is absolutely continuous with respect to P and denote its Radon-Nikodym derivative by
dQ
u= .
dP
Then the Kullback-Leibler divergence of P from Q is defined by
dQ
Z Z
D(Q||P) := log u dQ = log dQ. (20.0.3)
dP
Notice that this quantity is not symmetric in P and Q, so the order is important. Clearly, D(Q||P) = EntP (u), since
dQ dQ dQ dQ dQ
Z Z Z Z
EntP (u) = log dP dP log dP = log dQ.
dP dP dP dP dP
The variational characterization (20.0.2) implies that
Z Z Z
if ev dP 1 then v dQ = uv dP D(Q||P). (20.0.4)

Transportation inequality for Gaussian measures. Let us consider a non-degenerate normal distribution N(0,C)
with the covariance matrix C such that detC , 0. We know that this distribution has density eV (x) , where
1
V (x) = (C1 x, x) + const.
2
If we denote A = C1 /2 then, for any t [0, 1],

tV (x) + (1 t)V (y) V (tx + (1 t)y) t(Ax, x) + (1 t)(Ay, y) A(tx + (1 t)y), (tx + (1 t)y)
1
t(1 t)|x y|2 ,

= t(1 t) A(x y), (x y) (20.0.5)
2max (C)

where max (C) is the largest eigenvalue of C. We will use this to prove the following useful inequality for the Wasser-
stein distance W2 defined at the end of the previous section.

Theorem 64 If P = N(0,C) and Q is absolutely continuous with respect to P then

W2 (Q, P)2 2max (C) D(QkP). (20.0.6)

Proof. Let us denote C2 = 1/(2max (C)) and, using the Kantorovich-Rubinstein theorem (see (19.0.3)), write
1 n
Z o
W2 (Q, P)2 = inf C2 |x y|2 d(x, y) : M(P, Q)
C2
1 nZ Z o
= sup f dP + g dQ : f (x) + g(y) C2 |x y|2 .
C2

Consider functions f , g C(Rn ) such that f (x) + g(y) C2 |x y|2 . Then, by (20.0.5), for any t (0, 1),
1  
f (x) + g(y) tV (x) + (1 t)V (y) V (tx + (1 t)y)
t(1 t)
and
t(1 t) f (x) tV (x) + t(1 t)g(y) (1 t)V (y) V (tx + (1 t)y).
If we introduce the function

u(x) = e(1t) f (x)V (x) , v(y) = etg(y)V (y) and w(z) = eV (z)

114

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


then w(tx + (1 t)y) u(x)t v(y)1t and, by the Prekopa-Leindler inequality,
Z t Z 1t Z
e(1t) f (x)V (x) dx etg(x)V (x) dx eV (x) dx.

Since eV is the density of P, we showed that


Z t Z 1t Z  1 Z 1
1t t
e(1t) f dP etg dP 1 and e(1t) f dP etg dP 1.

It is a simple calculus exercise to show that


Z 1
s
R
lim es f dP = e f dP ,
s0

and, therefore, letting t 1 proves that Z R


eg dP e f dP
1.

This inequality can be written as ev dP 1 for v = g + f dP and (20.0.4) implies that


R R

Z Z Z
v dQ = f dP + g dQ D(Q||P).

Together with the Kantorovich-Rubinstein representation above this implies that


1
W2 (Q, P)2 D(Q||P) = 2max (C) D(Q||P),
C2
which finishes the proof. t
u
Concentration of Gaussian measure. Given a measurable set A Rn with P(A) > 0, define the distribution PA by

P(C A)
PA (C) = .
P(A)
Then, obviously, the Radon-Nikodym derivative
dPA 1
= IA
dP P(A)
and the Kullback-Leibler divergence
1 1
Z
D(PA ||P) = log dPA = log .
A P(A) P(A)
Since W2 is a metric, for any two Borel sets A and B,
p  1 1 
W2 (PA , PB ) W2 (PA , P) +W2 (PB , P) 2max (C) log1/2 + log1/2 ,
P(A) P(B)

using (20.0.6). Suppose that the sets A and B are apart from each other by a distance t, i.e. d(A, B) t > 0. Then any
two points in the support of measures PA and PB are at a distance at least t from each other and the transportation
distance W2 (PA , PB ) t. Therefore,
p  1 1  p 1
t W2 (PA , PB ) 2max (C) log1/2 + log1/2 2max (C) log1/2 .
P(A) P(B) P(A)P(B)
Therefore,
1  t2 
P(B) exp .
P(A) 4max (C)

115

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


In particular, if B = {x : d(x, A) t} then

 1  t2 
P d(x, A) t exp .
P(A) 4max (C)

If the set A is not too small, e.g. P(A) 1/2, this implies that

  t2 
P d(x, A) t 2 exp .
4max (C)

This shows that the Gaussian measure is exponentially concentrated near any large enough set. The constant 1/4 in
the exponent is not optimal and can be replaced by 1/2; this is just an example of application of the above ideas. The
optimal result is the famous Gaussian isoperimetry,

if P(A) = P(B) for some half-space B then P(At ) P(Bt ).

Gaussian concentration via infimum-convolution. If we denote c = 1/max (C) then setting t = 1/2 in (20.0.5),
x+y c
V (x) +V (y) 2V |x y|2 .
2 4
Given a function f on Rn , let us define its infimum-convolution by
 c 
g(y) = inf f (x) + |x y|2 .
x 4
Then, for all x and y, x+y
c
g(y) f (x) |x y|2 V (x) +V (y) 2V . (20.0.7)
4 2
If we define
u(x) = e f (x)V (x) , v(y) = eg(y)V (y) , w(z) = eV (z)
then (20.0.7) implies that x+y
w u(x)1/2 v(y)1/2 .
2
The Prekopa-Leindler inequality with = 1/2 implies that
Z Z
eg dP e f dP 1. (20.0.8)

Given a measurable set A, let f be equal to 0 on A and + on the complement of A. Then


c
g(y) = d(x, A)2
4
and (20.0.8) implies
c 1
Z
exp d(x, A)2 dP(x) .
4 P(A)
By Chebyshevs inequality,

 1  ct 2  1  t2 
P d(x, A) t exp = exp ,
P(A) 4 P(A) 4max (C)

which is the same Gaussian concentration inequality we proved above. t


u

Discrete metric and total variation. The total variation distance between probability measures P and Q on a mea-
surable space (S, B) is defined by
TV(P, Q) = sup |P(A) Q(A)|.
AB

116

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Using the Hahn-Jordan decomposition, we can represent a signed measure = P Q as = + such that, for
some set D B and for any set E B,

+ (E) = (ED) 0 and (E) = (EDc ) 0.

Therefore, for any A B,

P(A) Q(A) = + (A) (A) = + (AD) (ADc ),

which makes it obvious that


sup |P(A) Q(A)| = + (D).
AB

Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the Kantorovich-
Rubinstein theorem. Let us start with the following simple observation.
R
Lemma 49 If f is a measurable function on S such that | f | 1 and f dP = 0 then for any R,
Z
2 /2
e f dP e .

Proof. Since (1 + f )/2, (1 f )/2 [0, 1] and

1+ f 1 f
f = + ( ),
2 2
by the convexity of ex , we get
1 + f 1 f
e f e + e = ch( ) + f sh( ).
2 2
Therefore, Z
2 /2
e f dP ch( ) e ,

where the last inequality is easy to see by Taylors expansion. t


u

Let us now consider a discrete metric on S given by

d(x, y) = I(x , y). (20.0.9)

Then a 1-Lipschitz function f with respect to the metric d, k f kL 1, is defined by the condition that for all x, y S,

| f (x) f (y)| 1. (20.0.10)

Formally, the Kantorovich-Rubinstein theorem in this case would state that


nZ o
W (P, Q) := inf I(x , y) d(x, y) : M(P, Q)
n Z Z o
= sup f dQ f dP : k f kL 1 =: (P, Q).

However, since any uncountable set S is not separable w.r.t. the discrete metric d, we can not apply the Kantorovich-
Rubinstein theorem directly. In this case, one can use the Hahn-Jordan decomposition to show that W coincides with
the total variation distance, W (P, Q) = TV(P, Q) and it is easy to construct a measure M(P, Q) explicitly that
witnesses the above equality. This was one of the exercises at the end of the previous section. One can also easily
check directly that (P, Q) = TV(P, Q). Thus, for the discrete metric d,

W (P, Q) = TV(P, Q) = (P, Q).

We have the following analogue of the Kullback-Leibler divergence bound in Theorem 64, which now holds for any
measure P, not only Gaussian.

117

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 65 If Q is absolutely continuous with respect to P then
p
TV(P, Q) 2D(Q||P).
R R
Proof. Take f such that (20.0.10) holds. If we define g(x) = f (x) f dP then, clearly, |g| 1 and g dP = 0. The
above lemma implies that for any R,
Z
f dP 2 /2
R
e f dP 1.

The variational characterization of entropy (20.0.4) implies that


Z Z
f dQ f dP 2 /2 D(Q||P)

and for > 0 we get


1
Z Z

f dQ f dP + D(Q||P).
2
Minimizing the right hand side over > 0, we get
Z Z p
f dQ f dP 2D(Q||P).

Applying this to f and f yields the result. t


u

Exercise. Prove the Brunn-Minkowski inequality (20.0.1) on Rn using the Prekopa-Leindler inequality. Hint: Apply
the Prekopa-Leindler inequality to sets A/ and B/(1 ) and optimize over (0, 1).

Exercise. Using the Prekopa-Leindler inequality prove that if N is the standard Gaussian measure on RN , A and B are
Borel sets and [0, 1] then

log N ( A + (1 )B) log N (A) + (1 ) log N (B).

Exercise. (Andersons inequality) If C is convex and symmetric around 0, i.e. C = C, then for any z RN ,

N (C) N (C + z).

Hint: use the previous problem.

Exercise. If C is convex and symmetric set, X and Y are Gaussian on RN and

Cov(X) Cov(Y )

(i.e. their difference is nonnegative definite) then P(X C) P(Y C). Hint: use the previous problem.

Exercise. Suppose that X = (X1 , . . . , XN ) is a vector of i.i.d. standard normal random variables. Suppose that f : RN R
is a Lipschitz function with k f kL = L. If M is a median of f (X), prove that
2
P f (X) M + Lt 2et /4 .


Hint: use Gaussian concentration inequality. Median means that P( f (X) M) 1/2 and P( f (X)s M) 1/2.

118

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 21

Stochastic Processes. Brownian Motion.

We have developed a general theory of convergence of laws on (separable) metric spaces and in the following two
sections we will look at some specific examples of convergence on the spaces of continuous functions (C[0, 1], k k )
and (C(R+ ), d), where d is a metric metrizing uniform convergence on compacts. These examples will describe a
certain central limit theorem type results on these spaces and in this section we will define the corresponding limiting
Gaussian laws, namely, the Brownian motion and Brownian bridge. We will start with basic definitions and basic
regularity results in the presence of continuity. Given a set T and a probability space (, F , P), a stochastic process is
a function
Xt () = X(t, ) : T R
such that for each t T , Xt : R is a random variable, i.e. a measurable function. In other words, a stochastic process
is a collection of random variables Xt indexed by a set T. A stochastic process is often defined by specifying finite
dimensional (f.d.) distributions PF = L ({Xt }tF ) for all finite subsets F T. Kolmogorovs theorem then guarantees
the existence of a probability space on which the process is defined, under the natural consistency condition

F1 F2 = PF1 = PF2 F .

R 1

One can also think of a process as a function on with values in RT = { f : T R}, because for a fixed , Xt ()
RT is a (random) function of t. In Kolmogorovs theorem, given a family of consistent f.d. distributions, a process was
defined on the probability space (RT , BT ), where BT is the cylindrical -algebra generated by the algebra of cylinders
B RT \F for Borel sets B in RF and all finite F. When T is uncountable, some very natural sets such as
n o [
: sup Xt > 1 = {Xt > 1}
tT tT

might not be measurable on BT . However, in our examples we will deal with continuous processes that possess
additional regularity properties. If (T, d) is a metric space then a process Xt is called sample continuous if for all ,
Xt () C(T, d) - the space of continuous function on (T, d). The process Xt is called continuous in probability if
Xt Xt0 in probability whenever t t0 . Let us note that sample continuity is not a property that follows automatically
if we know all finite dimensional distributions of the process.
Example. Let T = [0, 1], (, P) = ([0, 1], ) where is the Lebesgue measure. Let Xt () = I(t = ) and Xt0 () = 0.
Finite dimensional distributions of these processes are the same, because for any fixed t [0, 1],

P(Xt = 0) = P(Xt0 = 0) = 1.

However, P(Xt is continuous) = 0, but for Xt0 this probability is 1. t


u
Let (T, d) be a metric space. The process Xt is measurable if

Xt () : T R

is jointly measurable on the product space (T, B) (, F ), where B is the Borel -algebra on T.

119

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Lemma 50 If (T, d) is a separable metric space and Xt is sample continuous then Xt is measurable.

Proof. Let (S j ) j1 be a measurable partition of T such that diam(S j ) n1 . For each non-empty S j , let us take a point
t j S j and define
Xtn () = Xt j () for t S j .
Xtn () is, obviously, measurable on T , because for any Borel set A on R,
[
(,t) : Xtn () A =
 
S j : Xt j () A .
j1

Since Xt (w) is sample continuous, Xtn () Xt () as n for all (,t). Hence, Xt is also measurable. t
u

If Xt is a sample continuous process indexed by T then we can think of Xt as an element of the metric space of
continuous functions (C(T, d), || || ), rather then simply an element of RT . We can define measurable events on this
space in two different ways. On the one hand, we have the natural Borel -algebra B on C(T, d) generated by the
open (or closed) balls 
Bg () = f C(T, d) : || f g|| < .
On the other hand, if we think of C(T ) as a subspace of RT , we can consider a -algebra

ST = B C(T, d) : B BT ,


which is the intersection of the cylindrical -algebra BT with C(T, d). It turns out that these two definitions coincide
if (T, d) is separable. An important implication of this is that the law of any random element with values in (C(T, d), ||
|| ) is completely determined by its finite dimensional distributions.

Lemma 51 If (T, d) is a separable metric space then B = ST .

Proof. Let us first show that ST B. Any element of the cylindrical algebra that generates the cylindrical -algebra
BT is given by
B RT \F for a finite F T and for some Borel set B RF .
Then
B RT \F
\  
C(T, d) = x C(T, d) : (xt )tF B = F (x) B

where F : C(T, d) RF is the finite dimensional projection such that F (x) = (xt )tF . Projection F is, obviously,
continuous in the || || norm and, therefore, measurable on the Borel -algebra B generated by the open sets in
the k k norm. This implies that F (x) B B and, thus, ST B. Let us now show that B ST . Let T 0 be a
countable dense subset of T . Then, by continuity, any closed -ball in C(T, d) can be written as
 \
f C(T, d) : || f g|| = f C(T, d) : | f (t) g(t)| ST .
tT 0

This finishes the proof. t


u

In the remainder of the section we will define two specific sample continuous stochastic processes.

Brownian motion. Brownian motion is defined as a sample continuous process Xt on T = R+ such that

(a) the distribution of Xt is a centered Gaussian for each t 0;


(b) X0 = 0 and EX12 = 1;
(c) if t < s then L (Xs Xt ) = L (Xst );

(d) for any t1 < . . . < tn , the increments Xt1 X0 , . . . , Xtn Xtn1 are independent.

120

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


If we denote 2 (t) = Var(Xt ) then these properties imply
t  1
2 (nt) = n 2 (t), 2 = 2 (t) and 2 (qt) = q 2 (t)
m m
for all rational q. Since 2 (1) = 1, 2 (q) = q for all rational q and, by the sample continuity, 2 (t) = t for all t 0. For
any t1 < . . . < tn , the increments Xt j+1 Xt j are independent Gaussian N(0,t j+1 t j ) and, therefore, jointly Gaussian.
This implies that the vector (Xt1 , . . . , Xtn ) also has a jointly Gaussian distribution, since it can be written as a linear map
of the increments. The covariance matrix can be easily computed, since for s < t,

EXs Xt = EXs (Xs + (Xt Xs )) = s = min(t, s).

As a result, we can give an equivalent definition: Brownian motion is a sample continuous centered Gaussian process Xt
for t [0, ) with the covariance Cov(Xt , Xs ) = min(t, s). A process is Gaussian if all its finite dimensional distributions
are Gaussian.
Without the requirement of sample continuity, the existence of such process follows from Kolmogorovs theorem,
since all finite dimensional distributions are consistent by construction. However, we still need to prove that there exists
a sample continuous version with these prescribed finite dimensional distributions. We start with a simple estimate.

Lemma 52 The tail probability of the standard normal distribution satisfies


1
Z
2 /2 c2
(c) = ex dx e 2
2 c

for all c 0.

Proof. We can write


1 1 x 1 1 c2 /2
Z Z
2 /2 2
(c) = dx ex ex /2 dx = e .
c 2 2 c c 2 c

If c > 1/ 2 then (c) exp(c2 /2). If c 1/ 2 then a simpler estimate gives the result
1  1  1 2  2
(c) (0) = exp ec /2 ,
2 2 2
which finishes the proof. t
u

Theorem 66 (Existence of Brownian motion) There exists a sample continuous Gaussian process with the covariance
Cov(Xt , Xs ) = min(t, s).

Proof. It is enough to construct Xt on the interval [0, 1]. Given a process Xt that has the finite dimensional distributions
of the Brownian process, but is not necessarily continuous, let us define for n 1,

Vk = X k+1
n
X kn for k = 0, . . . , 2n 1.
2 2

The variance Var(Vk ) = 1/2n and, by the above lemma,


 1  1  2n1 
P max |Vk | 2 2n P |V1 | 2 2n+1 exp 4 .
k n n n
The right hand side is summable over n 1 and, by the Borel-Cantelli lemma,
n 1o 
P max |Vk | 2 i.o. = 0. (21.0.1)
k n
t t
Given t [0, 1] and its dyadic expansion t = j=1 2jj for t j {0, 1}, let us define t(n) = nj=1 2jj so that

Xt(n) Xt(n1) {0} {Vk : k = 0, . . . , 2n 1}.

121

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Then, the sequence
Xt(n) = 0 + (Xt( j) Xt( j1) )
1 jn

converges almost surely to some limit Zt , because, by (21.0.1), with probability one,

|Xt(n) Xt(n1) | n2

for large enough (random) n n0 (). By construction, Zt = Xt on the dense subset of all dyadic t [0, 1]. If we
can prove that Zt is sample continuous then all f.d. distributions of Zt and Xt will coincide, which means that Zt is a
continuous version of the Brownian motion. Take any t, s [0, 1] such that |t s| 2n . If t(n) = 2kn and s(n) = 2mn ,
then |k m| {0, 1}. As a result, |Xt(n) Xs(n) | is either equal to 0 or one of the increments |Vk | and, by (21.0.1),
|Xt(n) Xs(n) | n2 for large enough n. Finally,

|Zt Zs | |Zt Xt(n) | + |Xt(n) Xs(n) | + |Xs(n) Zs |


1 1 1 c
2 + 2 + 2 ,
ln l n ln l n

which proves the continuity of Zt . On the event in (21.0.1) of probability zero we set Zt = 0. t
u

Definition. A sample continuous centered Gaussian process Bt for t [0, 1] is called a Brownian bridge if

EBt Bs = s(1 t) for s < t.

Such process exists because if Xt is a Brownian motion then Bt = Xt tX1 is a Brownian bridge, since for s < t,

EBs Bt = E(Xt tX1 )(Xs sX1 ) = s st ts + st = s(1 t).

Notice that B0 = B1 = 0. t
u

Exercise. (Ciesielskis construction of Brownian motion) Let ( fk2n ) be the Haar basis on (L2 [0, 1], dx), i.e. f0 = 0 and
for n 1 and k In = {k - odd, 1 k 2n 1},

, (k 1)2n x < k2n


 (n1)/2
2
fk2n (x) =
2 (n1)/2 , k2n x < (k + 1)2n

and 0 otherwise. If (gk2n ) are i.i.d. standard Gaussian random variables, prove that
Z t
Wt = g0t + gk2n fk2n (x) dx
n1 kIn 0

is a Brownian motion on [0, 1]. Hint: To show that EWt Ws = min(t, s), use Parsevals identity. To prove continuity,
show that Z t
fk2n (x) dx = 2(n+1)/2 max gk2n

en = gk2n

kIn 0 kIn

satisfies P en 2 2n ln 2n 8n and use the Borel-Cantelli lemma.


2 /(2T )
Exercise. If Wt is a Brownian motion, using Doobs inequality prove that P(max0tT Wt x) ex .

122

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 22

Donsker Invariance Principle.

In this section we will show how Brownian motion Wt arises in a classical central limit theorem on the space of
continuous functions on R+ . When working with continuous processes defined on R+ , such as the Brownian motion,
the metric k k on C(R+ ) is too strong. A more appropriate metric d can be defined by
1 dn ( f , g)
d( f , g) = 2n 1 + dn ( f , g) where dn ( f , g) = sup | f (t) g(t)|.
n1 0tn

It is obvious that d( f j , f ) 0 if and only if dn ( f j , f ) 0 for all n 1, i.e. d metrizes uniform convergence on
compacts. (C(R+ ), d) is also a complete separable space, since any sequence is Cauchy in d if and only if it is Cauchy
for each dn . When proving uniform tightness of laws on (C(R+ ), d), we will need a characterization of compacts via
the Arzela-Ascoli theorem, which in this case can be formulated as follows. For a subset K C(R+ ), let us define by

Kn = f [0,n] : f K

the restriction of K to the interval [0, n]. Then, it is clear that K is compact with respect to d if and only if each Kn is
compact with respect to dn . Therefore, using the Arzela-Ascoli theorem to characterize compacts in C[0, n] we get the
following. For a function x C[0, T ], let us denote its modulus of continuity by
n o
mT (x, ) = sup |xa xb | : |a b| , a, b [0, T ] .

Theorem 67 (Arzela-Ascoli) A set K is compact in (C(R+ ), d) if and only if K is closed, uniformly bounded and
equicontinuous on each interval [0, n]. In other words,
sup |x0 | < and lim sup mT (x, ) = 0 for all T > 0.
xK 0 xK

This implies the following criterion of the uniform tightness of laws on the metric space (C(R+ ), d), which is simply
a translation of the Arzela-Ascoli theorem into probabilistic language.
Theorem 68 A sequence of laws (Pn )n1 on (C(R+ ), d) is uniformly tight if and only if

lim sup Pn |x0 | > = 0 (22.0.1)
+ n1

and
lim sup Pn mT (x, ) > = 0

(22.0.2)
0 n1

for any T > 0 and any > 0.


Proof. (=) Suppose that for any > 0, there exists a compact K such that Pn (K) > 1 for all n 1. By the
Arzela-Ascoli theorem, |x0 | for some > 0 and for all x K and, therefore,
sup Pn (|x0 | > ) sup Pn (K c ) .
n n

123

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Also, by equicontinuity, for any > 0 there exists 0 > 0 such that for < 0 and for all x K we have mT (x, ) < .
Therefore,
sup Pn mT (x, ) > sup Pn (K c ) .

n n

(=) Fix any > 0. For each integer T 1, find T > 0 such that

sup Pn |x0 | > T .
n 2T +1

For each integer T 1 and k 1, find T,k > 0 such that


 1
sup Pn mT (x, T,k ) > T +k+1 .
n k 2

Consider the set n 1 o


AT = x C(R+ ) : |x0 | T , mT (x, T,k ) for all k 1 .
k
Then,

sup Pn (AT )c

+ =
n 2T +1 k1 2T +k+1 2T
T
and their intersection A = T 1 AT satisfies
[ 
sup Pn (Ac ) = sup Pn (AT )c T
= .
n n T 1 T 1 2

By construction, each set AT is closed, uniformly bounded and equicontinuous on [0, T ]. Therefore, by the Arzela-
Ascoli theorem, their intersection A is compact on (C(R+ ), d). Since > 0 was arbitrary, this proves that the sequence
(Pn ) is uniformly tight, . t
u

Remarks. Of course, for the uniform tightness on (C[0, 1], k k ) we only need the second condition (22.0.2) for
T = 1. Also, it will be convenient to slightly relax (22.0.2) and replace it with asymptotic equicontinuity condition

lim lim sup Pn mT (x, ) > = 0.



(22.0.3)
0 n

If this holds then, given > 0, we can find 0 > 0 and n0 1 such that for all n > n0 ,

Pn mT (x, 0 ) > < .




Since mT (x, ) 0 as 0 for all x C(R+ ), for each n n0 , we can find n > 0 such that

Pn mT (x, n ) > < ,




Since mT (x, ) is decreasing in , this implies that

if < min(0 , 1 , . . . , n0 ) then Pn mT (x, ) > <




for all n 1. We showed that (22.0.3) implies (22.0.2). t


u

Using this characterization of uniform tightness, we will now give our first example of convergence on (C(R+ ), d) to
the Brownian motion Wt . Consider a sequence (Xi )i1 of i.i.d. random variables such that EXi = 0 and 2 = EXi2 < .
Let us consider a continuous partial sum process on [0, ) defined by

1 Xbntc+1
Wtn = Xi + (nt bntc) , (22.0.4)
n ibntc
n

where bntc is the integer part of nt, bntc nt < bntc + 1.

124

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 69 (Donsker Invariance Principle) The processes Wtn converge in distribution to the Brownian motion Wt
on the space (C(R+ ), d).

This implies for example that any continuous functionals of these processes on (C(R+ ), d) converges in distribution,
for example,
sup Wtn sup Wt
t1 t1
in distribution.
Proof. Since the last term in Wtn is of order n1/2 , for simplicity of notations, we will simply write
1
Wtn = Xi
n int

and treat nt as an integer. By the central limit theorem,


1 1
Xi = t Xi
n int nt int

converges in distribution to normal distribution N(0,t). Given t < s, we can represent


1
Wsn = Wtn + Xi
n nt<ins

and since Wtn and Wsn Wtn are independent, it should be obvious that finite dimensional distributions of Wtn converge
to finite dimensional distributions of the Brownian motion Wt . By Lemma 51 in the previous section, this identifies
Wt as the unique n (L (Wtn ))n1 is uniformly tight
 possible limit of Wt and, if we can shown that the sequence of laws n
on C[0, ), d , the Selection Theorem will imply that Wt Wt weakly. Since W0 = 0, we only need to prove the
asymptotic equicontinuity (22.0.3). First,
1 1
mT (W n , ) = sup Xi max Xi .

n 0knT,0< jn n
t,s[0,T ],|ts| ns<int k<ik+ j

If instead of maximizing over all 0 k nT, we maximize in increments of n , i.e. over indices k of the type k = ln
for 0 l m 1, where m := T / , then it is easy to check that the maximum will decrease by at most a factor of 3,
1 1
max Xi 3 max Xi ,

0knT,0< jn

n k<ik+ j 0lm1,0< jn

n ln <iln + j

because the second maximum over 0 < j n is taken over intervals of the same size n . As a consequence, if
mT (W n , ) > then one of the events
n 1 o
max Xi >

0< jn

n ln <iln + j 3

must occur for some 0 l m 1. Since the number of these events is m = T / ,


   1 
P mT (W n , ) > m P max > 3 .
X (22.0.5)

i
0< jn n 0<i j

Kolmogorovs inequality, Theorem 16, implies that if Sn = X1 + + Xn and max0< jn P(|Sn S j | > ) p < 1 then
  1
P max |S j | > 2 P(|Sn | > ).
0< jn 1 p

If we take = n /6 then, by Chebyshevs inequality,
  62 n 2
P Xi > n 2 2 = 36 2
j<in
6 n

125

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


and, therefore, if 36 2 < 1,
   
2 1

P max X > n 1 36 i > 6 n .
X

0< jn
i 3 P
0<i j 0<in

Finally, using (22.0.5) and the central limit theorem,


  1  
lim sup P mT (Wtn , ) > m 1 36 2 lim sup P i > 6 n
X

n n 0<in
 
2 1

= m 1 36 2 N(0, 1) ,
6
 1 2 
1
2T 1 1 36 2

exp 2 .
26
This, obviously, goes to zero as 0, which proves that

lim lim sup P mT (W n , ) > = 0,



0 n

for all T > 0 and > 0. This finishes the proof that Wtn Wt weakly in (C[0, ), d). t
u

126

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 23

Convergence of empirical process to


Brownian bridge.

Empirical process and the Kolmogorov-Smirnov test. In this sections we show how the Brownian bridge Bt arises
in another central limit theorem on the space of continuous functions on [0, 1]. Let us start with a motivating example
from statistics. Suppose that x1 , . . . , xn are i.i.d. uniform random variables on [0, 1]. By the law of large numbers,
for any t [0, 1], the empirical c.d.f. n1 ni=1 I(xi t) converges to the true c.d.f. P(x1 t) = t almost surely and,
moreover, by the CLT,
1 n 
d
Xtn := n

I(xi t) t N 0,t(1 t) .
n i=1
The stochastic process Xtn is called the empirical process. The covariance of this process, for s t,

EXtn Xsn = E(I(x1 t) t)(I(x1 s) s) = s ts ts + ts = s(1 t),

is the same as the covariance of the Brownian bridge and, by the multivariate CLT, all finite dimensional distributions
of the empirical process converge to the f.d. distributions of the Brownian bridge,

L (Xtn )tF L (Bt )tF .


 
(23.0.1)

However, we would like to show the convergence of Xtn to Bt in some stronger sense that would imply weak conver-
gence of continuous functions of the process on the space (C[0, 1], k k ).
The Kolmogorov-Smirnov test in statistics provides some motivation. Suppose that i.i.d. (Xi )i1 have continuous
distribution with c.d.f. F(t) = P(X1 t). Let Fn (t) = n1 ni=1 I(Xi t) be the empirical c.d.f.. Since F is continuous
and F(R) = [0, 1],
1 n 1 n 
sup n|Fn (t) F(t)| = sup n I(Xi t) F(t) = sup n I F(Xi ) F(t) F(t)

tR tR n i=1 tR n i=1
1 n
d
= sup n I(F(Xi ) t) t = sup |Xtn |,
t[0,1] n i=1 t[0,1]

because (F(Xi )) are i.i.d. and have the uniform distribution on [0, 1]. This means that the distribution of the left hand
side does not depend on F and, in order to infer whether the sample (Xi )1in comes from the distribution with the c.d.f.
F, statisticians need to know only the distribution of the supremum of the empirical process or, as an approximation,
the distribution of its limit. Equation (23.0.1) suggests that

L sup |Xtn | L sup |Bt | ,


 
(23.0.2)
t t

and the right hand side is called the Kolmogorov-Smirnov distribution that will be computed in the next section.
Since Bt is sample continuous, its distribution is the law on the metric space (C[0, 1], k k ). Even though Xtn is

127

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


not continuous, its jumps are equal to n1/2 , so it can be approximated by a continuous process Ytn uniformly within
n1/2 . Since k k is a continuous functional on (C[0, 1], k k ), (23.0.2) would hold if we can prove weak convergence
L (Ytn )L (Bt ). We only need to prove uniform tightness of L (Ytn ) because, by Lemma 51, (23.0.1) already identifies
the law of the Brownian motion as the unique possible limit. Thus, we need to address the question of uniform tightness
of (L (Ytn )) on the complete separable space (C[0, 1], || || ) or, equivalently, by the result of the previous section, the
asymptotic equicontinuity of Ytn ,
lim lim sup P m(Y n , ) > = 0.

0 n
Obviously, this is again equivalent to
lim lim sup P m(X n , ) > = 0,

0 n
so we can deal with the process Xtn , even though it is not continuous. By Chebyshevs inequality,
 1
P m(X n , ) > Em(X n , )

and we need to learn how to control Em(X n , ). The modulus of continuity of X n can be written as
1 n
m(X n , ) = sup |Xtn Xsn | = n sup I(s < xi t) (t s)

|ts| |ts| n i=1
n
1

= n sup f (xi ) E f , (23.0.3)

f F n i=1

where we introduced the class of functions

F = f (x) = I(s < x t) : t, s [0, 1], |t s| < .



(23.0.4)

We will now develop an approach to control the expectation of (23.0.3) for general classes of functions F and we will
only use the specific definition (23.0.4) at the very end. This will be done in several steps.
Symmetrization. As the first step, we will replace the empirical process (23.0.3) by a symmetrized version, called the
Rademacher process, that will be easier to control. Let x10 , . . . , xn0 be independent copies of x1 , . . . , xn and let 1 , . . . , n
be i.i.d. Rademacher random variables, such that P(i = 1) = P(i = 1) = 1/2. Let us define

1 n 1 n
Pn f = f (xi ) and P0n f = f (xi0 ).
n i=1 n i=1

Notice that EP0n f = E f . Consider the random variables


1 n
Z = sup Pn f E f and R = sup i f (xi ) .

f F f F n i=1

Then, using Jensens inequality, the symmetry, and then triangle inequality, we can write

EZ = E sup Pn f E f = E sup Pn f EP0n f

f F f F
1 n 1 n
E sup ( f (xi ) f (xi0 )) = E sup i ( f (xi ) f (xi0 ))

f F n i=1 f F n i=1
1 n 1 n
E sup i f (xi ) + E sup i f (xi0 ) = 2ER.

f F n i=1 f F n i=1

Equality in the second line holds because switching xi xi0 arbitrarily does not change the expectation, so the equality
holds for any fixed (i ) and, therefore, for any random (i ). t
u
Covering numbers, Kolmogorovs chaining and Dudleys entropy integral. To control ER for general classes
of functions F , we will need to use some measures of complexity of F . First, we will show how to control the

128

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Rademacher process R conditionally on x1 , . . . , xn . Suppose that (F, d) is a totally bounded metric space. For any
u > 0, a u-packing number of F with respect to d is defined by

D(F, u, d) = max card Fu F : d( f , g) > u for all f , g Fu

and a u-covering number is defined by



N(D, u, d) = min card Fu F : f F g Fu such that d( f , g) u .

Both packing and covering numbers measure how many points are needed to approximate any element in the set F
within distance u. It is a simple exercise (at the end of the section) to show that

N(F, u, d) D(F, u, d) N(F, u/2, d)

and, in this sense, packing and covering numbers are closely related. Let F be a subset of the cube [1, 1]n equipped
with a rescaled Euclidean metric 1 n 1/2
d( f , g) = ( fi gi )2 .
n i=1
Consider the following Rademacher process on F,

1 n
R( f ) = i fi .
n i=1

Then we have the following version of the classical Kolmogorov chaining lemma.

Theorem 70 (Kolmogorovs chaining) For any u > 0,


 Z d(0, f ) 
P f F, R( f ) 29/2 log1/2 D(F, , d) d + 27/2 d(0, f ) u 1 eu .
0

Proof. Without loss of generality, assume that 0 F. Define a sequence of subsets

{0} = F0 F1 . . . Fj . . . F

such that Fj satisfies

1. f , g Fj , d( f , g) > 2 j ,
2. f F we can find g Fj such that d( f , g) 2 j .

F0 obviously satisfies these properties for j = 0. To construct Fj+1 given Fj :

Start with Fj+1 := Fj .

If possible, find f F such that d( f , g) > 2( j+1) for all g Fj+1 .


Let Fj+1 := Fj+1 { f } and repeat until you cannot find such f .

Define projection j : F Fj as follows:

for f F find g Fj with d( f , g) 2 j and set j ( f ) = g.

Any f F can be decomposed into the telescopic series

f = 0 ( f ) + (1 ( f ) 0 ( f )) + (2 ( f ) 1 ( f )) + . . .

= ( j ( f ) j1 ( f )).
j=1

129

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Moreover,
d( j1 ( f ), j ( f )) d( j1 ( f ), f ) + d( f , j ( f ))
2( j1) + 2 j = 3 2 j 2 j+2 .
As a result, the jth term in the telescopic series for any f F belongs to a finite set of possible links
L j1, j = f g : f Fj , g Fj1 , d( f , g) 2 j+2 .


Since R( f ) is linear,

R( f ) = R

j ( f ) j1 ( f ) .
j=1

We first show how to control R on the set of all links. Assume that ` L j1, j . By Hoeffdings inequality,
 1 n   t2   t2 
P R(`) = i `i t exp 1 n 2 exp .
n i=1 2n i=1 `i 2 22 j+4

If |F| denotes the cardinality of the set F then |L j1, j | |Fj1 | |Fj | |Fj |2 and, therefore, by the union bound,
 t2  1 u
P ` L j1, j , R(`) t 1 |Fj |2 exp

= 1 e
22 j+5 |Fj |2
after making the change of variables
 1/2
t = 22 j+5 (4 log |Fj | + u) 27/2 2 j log1/2 |Fj | + 25/2 2 j u.

Hence,   1 u
P ` L j1, j , R(`) 27/2 2 j log1/2 |Fj | + 25/2 2 j u 1 e .
|Fj |2
If Fj1 = Fj then we can chose j1 ( f ) = j ( f ) and, since in this case L j1, j = {0}, there is no need to control these
links. Therefore, we can assume that |Fj1 | < |Fj | and taking a union bound for all steps,
 
P j 1 ` L j1, j , R(`) 27/2 2 j log1/2 |Fj | + 25/2 2 j u

1 u 1
1 2
e 1 2
eu = 1 ( 2 /6 1)eu 1 eu .
j=1 |F j | j=1 ( j + 1)

Given f F, let integer k be such that 2(k+1) < d(0, f ) 2k . Then in the above construction we can assume that
0 ( f ) = . . . = k ( f ) = 0, i.e. we will project f on 0 if possible. Then, with probability at least 1 eu ,

R( f ) = R j ( f ) j1 ( f )


j=k+1
 
27/2 2 j log1/2 |Fj | + 25/2 2 j u
j=k+1

27/2 2 j log1/2 D(F, 2 j , d) + 25/2 2k u.
j=k+1

Note that 2k < 2d( f , 0) and 25/2 2k < 27/2 d( f , 0). Finally, since the packing numbers D(F, , d) are decreasing in ,
we can write
Z 2(k+1)
29/2 2( j+1) log1/2 D(F, 2 j , d) 29/2 log1/2 D(F, , d) d
j=k+1 0
Z d(0, f )
29/2 log1/2 D(F, , d) d, (23.0.5)
0

130

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


since 2(k+1) < d(0, f ). This finishes the proof. t
u

The integral in (23.0.5) is called Dudleys entropy integral. We would like to apply the bound of the above theorem to

1 n
nR = sup i f (xi )

f F n i=1

for a class of functions F in (23.0.4). Suppose that x1 , . . . , xn [0, 1] are fixed and let
n    o
F = fi 1in = I s < xi t : t, s [0, 1], |t s| {0, 1}n .
1in

Then we can control the covering numbers of this set as follows.

Lemma 53 N(F, u, d) Ku4 for some absolute K > 0 independent of the points x1 , . . . , xn .

This kind of estimate is called uniform covering numbers, because the bound does not depend on the points x1 , . . . , xn
that generate the class F from F .

Proof. We can assume that x1 . . . xn . Then the class F consists of all vectors of the type

(0 . . . 1 . . . 1 . . . 0),

i.e. the coordinates equal to 1 come in blocks. Given u, let Fu be a subset of such vectors with blocks of 1s starting
and ending at the coordinates kbnuc. Given any vector f F, let us approximate it by a vector in f 0 Fu by choosing
the closest starting and ending coordinates for the blocks of 1s. The number of different coordinates will be bounded
by 2bnuc and, therefore, the distance between f and f 0 will be bounded by
q
d( f , f 0 ) 2n1 bnuc 2u.

The cardinality
of Fu is, obviously, of order u2 and this proves that N(F, 2u, d) Ku2 . Making the change of
variables 2u u proves the result. t
u

To apply Kolmogorovs chaining bound to this class F, let us make a simple observation that if a random variable
2
X 0 satisfies P(X a + bt) Ket for all t 0 then
Z Z Z t2
2
EX = P(X t) dt a + P(X a + t) dt a + K e
b dt a + Kb K(a + b).
0 0 0

Theorem 70 then implies that


1 n Z Dn r K 
E sup i fi K log du + Dn (23.0.6)

F n i=1 0 u

where E is the expectation with respect to (i ) only and

1 n 1 n
D2n = sup d(0, f )2 = sup f (xi )2 = sup I(s < xi t)
F F n i=1 |ts| n i=1

1 n 1 n
= sup I(xi t) I(xi s) .

|ts| n i=1 n i=1

Since the integral on the right hand side of (23.0.6) is concave in Dn , by Jensens inequality,
1 n Z EDn r K 
E sup i f (xi ) K log du + EDn .

F n i=1 0 u

131

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


By the symmetrization inequality, this finally proves that
r
n
Z EDn K 
Em(X , ) K log du + EDn .
0 u
The strong law of large numbers easily implies (exercise) that
1 n
sup I(xi t) t 0 a.s.

t[0,1] n i=1

and, therefore, D2n a.s. and EDn . This implies that


r
Z K
lim sup Em(Xtn , ) K log du + .
n 0 u

The right-hand side goes to zero as 0 and this finishes the proof of asymptotic equicontinuity of X n . As a result,
for any continuous function on (C[0, 1], k k ) the distibution of (Xtn ) converges to the distribution of (Bt ). For
example,
1
n sup I(xi t) t sup |Bt |

0t1 n 0t1

in distribution. We will find the distribution of the right hand side in the next section. Notice that the methods we used
to prove equicontinuity were quite general and the main step where we used the specific class of function F was to
control the covering numbers.

Exercise. Show that N(F, u, d) D(F, u, d) N(F, u/2, d).

Exercise. If (xi )i1 are i.i.d. uniform on [0, 1], prove that
1 n
sup I(xi t) t 0 a.s.

t[0,1] n i=1

Exercise. Consider a set = { = (1 , . . . , d ) : di=1 |i | 1} Rd and the distance d( 1 , 2 ) = ni=1 |i1 i2 |.


Prove that D(, , d) (4/)d .

Exercise. Suppose that a family F of measurable functions f : [0, 1] is such that for some V 1 and for any
1 , . . . , n ,
2 1 n 2 1/2
log D(F , , d) V log where d( f , g) = f (i ) g(i ) .
n i=1
If (Xn )n1 are i.i.d. random elements with values in , prove that
1 n  V 1/2
E sup f (Xi ) E f (X1 ) K

f F n i=1 n

for some absolute constant K. (Assume, for example, that F is countable to avoid any measureability issues.)

132

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 24

Reflection principles for Brownian motion.

We showed that the empirical process converges to the Brownian bridge on (C([0, 1]), k k ). As a result, the distribu-
tion of a continuous function of the process will also converge, for example,

sup |Xtn | sup |Bt |


0t1 0t1

in distribution. We will compute the distribution of this supremum in Theorem 73 below, but first we will prove the
so called strong Markov property
 of the Brownian motion. Given a Brownian motion Wt on some probability space
(, B, P) let Bt = (Ws )st B be the -algebra generated by the process up to time t. Let us consider a family
of -algebras Ft for t 0 such that

(i) Ft Fs for all t s;


(ii) Bt Ft for all t 0;
(iii) the future increments (Wt+s Wt )s0 are independent of Ft for all t 0.

For example, if we consider some -algebra A independent of the process Wt then Ft = (Bt A ) satisfy these
properties. A random variable 0 is called a stopping time if

{ t} Ft for all t 0.

For example, a hitting time c = inf{t > 0,Wt = c} for c > 0 is a stopping time because, by the sample continuity,
\[
{c t} = {Wr > q},
q<c r<t

where the intersection and union are over rational numbers q, r. The following is very similar to the property of
stopping times for sums of i.i.d. random variables in Section 7.

Theorem 71 (Strong Markov Property) On the event { < }, the increments Wt0 := W+t W of the process after
the stopping time are independent of the -algebra

F = B B : B { t} Ft for all t 0


generated by the data up to the stopping time and, moreover, the process Wt0 is again a Brownian motion.

Proof. The main tool in the proof is the approximation of an arbitrary stopping time by the dyadic stopping time

b2n c + 1
n = .
2n

133

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


First of all, let us check that n is a stopping time. Indeed, if
k k+1 k+1
< n then n = n
2n 2 2
and, therefore, for any t 0, if l/2n t < (l + 1)/2n then
 
n l o l
{ q} Ft .
[
{n t} = n n = < n =
2 2 n q<l/2

By construction, n and, by the continuity of the process, Wn W almost surely. The process Wt0 is, obviously,
continuous and we only need to check the statement of the theorem on its finite dimensional distributions. Consider
an integer d 1, 0 t1 < . . . < td and f Cb (Rd ), and let

e(t) := f (W+t1 W , . . . ,W+td W ).

Take a set B F . For any integer k 0, the event

B {n = k2n } = B (k 1)2n < k2n Fk2n




is independent of e(k2n ) and, therefore,

Ee(k2n )I B {n = k2n } = Ee(k2n )EI B {n = k2n }


 
Ee(n )IB I( < ) =
k0 k0
n

= Ee(0) EI B {n = k2 } = Ee(0)EIB I( < ).
k0

In the second line we used the homogeneity of the Brownian motion to write Ee(k2n ) = Ee(0). Since e(n ) e(),

Ee()IB I( < ) = lim Ee(n )IB I( < ) = Ee(0)EIB I( < ).


n

This proves that the increments (W+t1 W , . . . ,W+td W ) are independent of the -algebra F on the event <
and have the same distribution as (Wt1 , . . . ,Wtd ), i.e. the original Brownian motion. This finishes the proof. t
u
Remark. A random variable 0 is called a Markov time if

{ < t} Ft for all t 0.

The -algebra generated by the process up to a Markov time is defined by

F + = B B : B { < t} Ft for all t 0 .



(24.0.1)

One can show that a stopping time is always Markov time and F F + . With a little bit more work, one can
generalize the above Strong Markov property to Markov times and -algebras F + . t
u
Example. As a first example of application of the SMP, let us compute the following probability,
 
P sup Wt c = P(c b),
tb

for any c > 0 and the hitting time c . Since the event {c b} Fc is independent of the process Wt0 = Wc +t Wc ,
which is a Brownian motion, we can write (see also Remark below)

0 1
P(Wb c) = P(c b,Wb Wc 0) = P(c b)P(Wbc
0) = P(c b). (24.0.2)
2
Therefore,
1 x2
  Z
P sup Wt c = P(c b) = 2P(Wb c) = 2 e 2 dx. (24.0.3)
tb c/ b 2

134

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


This implies that the density of of c is given by
1 c2 c
fc (b) = e 2b 3/2 ,
2 b

and since it is of order O(b3/2 ) as b +, Ec = . t


u

Remark. In the equation (24.0.2), we used that the event {c b} Fc is independent of Wb Wc = Wb


0
c
. However,
even though the future increments Wt are independent of Fc , the new time b c is not. This is not a problem,
0

because one can show as in the proof of the SMP that the process Wt W for t is independent of F . Indeed, for
dyadic approximation n of , conditionally on n = k2n , the increments Wt Wk2n are independent of Fk2n , so the
proof does not change. For example, we can write
in Fk/2n indep. of Fk/2n
z }| { z }| { 
P(n b,Wb Wn 0) = P n = k/2n b, Wb Wk/2n 0
k0
1  k  1
= P n = b = P(n b),
2 k0 2n 2

and letting n we recover (24.0.2). t


u

Reflection principles. If Wt is a Brownian motion then Bt = Wt tW1 is the Brownian bridge for t [0, 1]. The next
lemma shows that we can think of the Brownian bridge as Brownian motion conditioned to be equal to zero at time
t = 1 (pinned down Brownian motion).

Lemma 54 Conditional distribution of Wt given |W1 | < converges to the law of Bt ,

L Wt |W1 | < L (Bt ) as 0.




Proof. Notice that Bt = Wt tW1 is independent of W1 , because their covariance

EBt W1 = EWt W1 tEx12 = t t = 0.

Therefore, the Brownian motion can be written as a sum Wt = Bt + tW1 of the Brownian bridge and independent
process tW1 . Therefore, if we define a random variable with distribution L ( ) = L W1 |W1 | < independent
of Bt then
L Wt |W1 | < = L (Bt + t ) L (Bt ).


as 0. t
u

Theorem 72 If Bt is a Brownian bridge then, for all b > 0,


  2
P sup Bt b = e2b .
t[0,1]

Proof. Since Bt = Wt tW1 and W1 are independent, we can write

P(t : Wt tW1 = b, |W1 | < ) P(t : Wt = b + tW1 , |W1 | < )


P(t : Bt = b) = = .
P(|W1 | < ) P(|W1 | < )

We can estimate the numerator from below and above by


  
P t : Wt > b + , |W1 | < P t : Wt = b + tW1 , |W1 | < P t : Wt b , |W1 | < .

Let us first analyze the upper bound. If we define a hitting time = inf{t : Wt = b } then W = b and
  
P t : Wt b , |W1 | < = P 1, |W1 | < = P 1,W1 W (b, b + 2) .

135

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


2b
A
b!"

B
!"

Figure 24.1: Reflecting the Brownian motion.

By the strong Markov property and symmetry of the Brownian motion,


 
P 1,W1 W (b, b + 2) = P 1, (W1 W ) (b, b + 2)

= P 1,W1 W (b 2, b)

= P W1 (2b 3, 2b ) ,

because the fact that W1 (2b 3, 2b ) automatically implies that 1 for b > 0 and small enough. We
reflected the Brownian motion after stopping time as in Figure 24.1. Therefore, we proved that
P(W1 (2b 3, 2b )) 2
P(t : Bt = b) e2b
P(W1 (, ))
as 0. The lower bound can be analyzed similarly. t
u

Theorem 73 (Kolmogorov-Smirnov) If Bt is the Brownian bridge then, for all b > 0,


  2 2
P sup |Bt | b = 2 (1)n1 e2n b .
0t1 n1

Proof. For n 1, consider an event

An = t1 < < tn 1 : Bt j = (1) j1 b




and let b and b be the hitting times of b and b. By symmetry of the distribution of the process Bt ,
  
P sup |Bt | b = P b or b 1 = 2P(A1 , b < b ).
0t1

Again, by symmetry,

P(An , b < b ) = P(An ) P(An , b < b ) = P(An ) P(An+1 , b < b )

and, by induction,
P(A1 , b < b ) = P(A1 ) P(A2 ) + . . . + (1)n1 P(An , b < b ).
As in Theorem 72, reflecting the Brownian motion each time we hit b or b, one can show that

P W1 (2nb , 2nb + ) 1 2 2 2
P(An ) = lim  = e 2 (2nb) = e2n b
0 P W1 (, )
and this finishes the proof. t
u
Given a, b > 0, let us compute the probability that a Brownian bridge crosses one of the levels a or b.

136

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 74 (Two-sided boundary) If a, b > 0 then
2 2 2 2
e2(na+(n+1)b) + e2((n+1)a+nb) 2e2n (a+b) .

P t : Bt = a or b = (24.0.4)
n0 n1

Proof. We have
P(t : Bt = a or b) = P(t : Bt = a, a < b ) + P(t : Bt = b, b < a ).
If we introduce the events 
Cn = t1 < . . . < tn : Bt1 = b, Bt2 = a, . . .
and 
An = t1 < . . . < tn : Bt1 = a, Bt2 = b, . . .
then, as in the previous theorem,

P(Cn , b < a ) = P(Cn ) P(Cn , a < b ) = P(Cn ) P(An+1 , a < b )

and, similarly,
P(An , a < b ) = P(An ) P(Cn+1 , b < a ).
By induction,

P(t : Bt = a or b) = (1)n1 (P(An ) + P(Cn )).
n=1

Probabilities of the events An and Cn can be computed using the reflection principle as above,
2 (a+b)2 2 2
P(A2n ) = P(C2n ) = e2n , P(C2n+1 ) = e2(na+(n+1)b) , P(A2n+1 ) = e2((n+1)a+nb)

and this finishes the proof. t


u

If X = inf Bt and Y = sup Bt then the spread of the process Bt is = X +Y.

Theorem 75 (Distribution of the spread) For any t > 0,


2 2
P t = 1 (8n2t 2 2)e2n t .

n1

Proof. First of all, (24.0.4) gives the joint c.d.f. of (X,Y ) because

F(a, b) = P(X < a,Y < b) = P(a < inf Bt , sup Bt < b) = 1 P(t : Bt = a or b).

If f (a, b) = 2 F/ a b is the joint p.d.f. of (X,Y ) then the c.d.f of the spread X +Y is
 Z tZ ta
P Y +X t = f (a, b) db da.
0 0

The inner integral is Z ta


F F
f (a, b) db = (a,t a) (a, 0).
0 a a
Since
F 2
na + (n + 1)b e2(na+(n+1)b)

(a, b) = 4n
a n0
2
(n + 1)a + nb e2((n+1)a+nb)

+ 4(n + 1)
n0
2 2
8n2 (a + b) e2n (a+b) ,
n1

137

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


plugging in the values b = t a and b = 0 gives
Z ta
2 2 2 2

0
f (a, b)db = 4n((n + 1)t a)e2((n+1)ta) + 4(n + 1)(nt + a)e2(nt+a) 8n2te2n t .
n0 n0 n1

Integrating over a [0,t],


2t 2 2t 2 2t 2
e2n e2(n+1) 8n2t 2 e2n
 
P Y +X t = (2n + 1)
n0 n1
2n2 t 2 2 2 2n2 t 2
= 1+2 e 8n t e ,
n1 n1

and this finishes the proof. t


u

Exercise. Prove the Strong Markov Property for a Markov time and -algebra F + in (24.0.1).

In the next four exercises, (Ft ) is a nested family of -algebras, Ft Fs for t s.

Exercise. Show that for = t, F + = Ft + = s>t Fs .

Exercise. A family (Ft ) is called right-continuous if s>t Fs = Ft for all t 0. If (Ft ) is right-continuous, show that
a Markov time is a stopping time and F + = F .

Exercise. If we define F t := s>t Fs , show that (F t ) is right-continuous.

Exercise. If (Ft ) satisfies conditions (i) (iii) at the beginning of this section, prove that (F t ) satisfies them too.

138

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 25

Skorohods Imbedding and Laws of the


Iterated Logarithm.

In this section we will prove another classical limit theorem in Probabiltiy, called the laws of the iterated logarithm,
using the method of Skorohods imbedding, which we will describe first. Let Wt be the Brownian motion.

Theorem 76 If is a stopping time such that E < then EW = 0 and EW2 = E.

Proof. Let us start with the case when a stopping time takes a finite number of values,

{t1 , . . . ,tn }.

If Ft j = ((Wt )tt j ) then (Wt j , Ft j ) is a martingale, since

E(Wt j |Ft j1 ) = E(Wt j Wt j1 +Wt j1 |Ft j1 ) = Wt j1 .

By optional stopping theorem for martingales, EW = EWt1 = 0. Next, let us prove that EW2 = E by induction on n.
If n = 1 then = t1 and
EW2 = EWt21 = t1 = E.
To make an induction step from n 1 to n, define a stopping time = tn1 and write

EW2 = E(W +W W )2 = EW2 + E(W W )2 + 2EW (W W ).

First of all, by the induction assumption, EW2 = E. Moreover, , only if = tn , in which case = tn1 . The
event
{ = tn } = { tn1 }c Ftn1
and, therefore,
EW (W W ) = EWtn1 (Wtn Wtn1 )I( = tn ) = 0.
Similarly,
E(W W )2 = EE(I( = tn )(Wtn Wtn1 )2 |Ftn1 ) = (tn tn1 )P( = tn ).
Therefore,
EW2 = E + (tn tn1 )P( = tn ) = E
and this finishes the proof of the induction step. Next, let us consider the case of a uniformly bounded stopping time
M < . In the previous lecture we defined a dyadic approximation

b2n c + 1
n = ,
2n

139

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


which is also a stopping time, n , and by the sample continuity Wn W almost surely. Since (n ) are uniformly
bounded, En E. To prove that EW2n EW2 , we need to show that the sequence (W2n ) is uniformly integrable.
Notice that n < 2M and, therefore, n takes possible values of the type k/2n for k k0 = b2n (2M)c. Since the sequence

W1/2n , . . . ,Wk0 /2n ,W2M

is a martingale, adapted to a corresponding sequence of Ft , and n and 2M are two stopping times such that n < 2M,
by the optional stopping theorem, Theorem 41, Wn = E(W2M |Fn ). By Jensens inequality,

W4n E(W2M
4
|Fn ), EW4n EW2M
4
= 6M,

and the uniform integrability follows by Holders and Chebyshevs inequalities,


6M
EW2n I(|Wn | > N) (EW4n )1/2 (P(|Wn | > N))1/2 0
N2
as N , uniformly over n. This proves that EW2n EW2 . Since n takes finite number of values, by the previous
case, EW2n = En and letting n proves
EW2 = E. (25.0.1)
Before we consider the general case, let us notice that for two bounded stopping times M one can similarly
show that
E(W W )W = 0. (25.0.2)
Namely, one can approximate the stopping times by the dyadic stopping times and using that, by the optional stopping
theorem, (Wn , Fn ), (Wn , Fn ) is a martingale and

E(Wn Wn )Wn = EWn E(Wn |Fn ) Wn ) = 0.

Finally, we consider the general case. Let us define (n) = min(, n). For m n, (m) (n) and

E(W(n) W(m) )2 = EW(n)


2 2
EW(m) 2EW(m) (W(n) W(m) ) = E(n) E(m)

using (25.0.1), (25.0.2) and the fact that (n), (m) are bounded stopping times. Since (n) , Fatous lemma and
the monotone convergence theorem imply

E(W W(m) )2 lim inf E(n) E(m) = E E(m).



n

Letting m shows that


lim E(W W(m) )2 = 0
m
2
which means that EW(m) EW2 . Since EW(m)
2 = E(m) by the previous case and E(m) E by the monotone
2
convergence theorem, this implies that EW = E. t
u

We will now prove the Skorohods imbedding theorem.

Theorem 77 (Skorohods imbedding) Let Y be a random variable such that EY = 0 and EY 2 < . There exists a
stopping time < such that L (W ) = L (Y ).

Proof. Let us start with the simplest case when Y takes only two values, Y {a, b} for a, b > 0. The condition EY = 0
determines the distribution of Y,
a
pb + (1 p)(a) = 0 and p = . (25.0.3)
a+b
Let = inf{t > 0,Wt = a or b} be a hitting time of the two-sided boundary a, b. The tail probability of can be
bounded by
P( > n) P |W j+1 W j | < a + b, 0 j n 1 = P(|W1 | < a + b)n = n .


140

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Therefore, E < and, by the previous theorem, EW = 0. Since W {a, b} we must have
L (W ) = L (Y ).
Let us now consider the general case. If is the law of Y, let us define Y by the identity Y = Y (x) = x on its sample
probability space (R, B, ). Let us construct a sequence of -algebras
B1 B2 . . . B
as follows. Let B1 be generated by the set (, 0), i.e.
B1 = 0,

/ R, (, 0), [0, +) .
Given B j , let us define B j+1 by splitting each finite interval [c, d) B j into two intervals [c, (c + d)/2) and [(c +
d)/2, d), and splitting infinite interval (, j) into (, ( j + 1)) and [( j + 1), j) and, similarly, splitting
[ j, +) into [ j, j + 1) and [ j + 1, ). Consider a right-closed martingale
Y j = E(Y |B j ).
It is almost obvious that the Borel -algebra B on R is generated by all these -algebras, B = j1 B j . Then, by
Levys martingale convergence, Lemma 45, Y j E(Y |B) = Y almost surely. Since Y j is measurable on B j , it must
be constant on each simple set [c, d) B j . If Y j (x) = y for x [c, d) then, since Y j = E(Y |B j ),
Z
y([c, d)) = EY j I[c,d) = EY I[c,d) = x d(x)
[c,d)

and
1
Z
y= x d(x). (25.0.4)
([c, d)) [c,d)

If ([c, d)) = 0 we pick any y (c, d) as a value of Y j on [c, d). Since in the -algebra B j+1 the interval [c, d) is split
into two intervals, the random variable Y j+1 can take only two values on the interval [c, d), say c y1 < y < y2 < d,
and, since (Y j , B j ) is a martingale,
E(Y j+1 |B j ) Y j = 0. (25.0.5)
We will define stopping times n such that L (Wn ) = L (Yn ) iteratively as follows. Since Y1 takes only two values a
and b, let 1 = inf{t > 0,Wt = a or b} and we proved above that L (W1 ) = L (Y1 ). Given j define j+1 as follows:
if W j = y for y in (25.0.4) then j+1 = inf{t > j ,Wt = y1 or y2 }.
Let us explain why L (W j ) = L (Y j ). First of all, by construction, W j takes the same values as Y j . If C j is the -
algebra generated by the disjoint sets {W j = y} for y as in (25.0.4), i.e. for possible values of Y j , then W j is C j
measurable, C j C j+1 , C j F j and at each step simple sets in C j are split in two,
{W j = y} = {W j+1 = y1 } {W j+1 = y2 }.
By Markovs property of the Brownian motion and Theorem 76, E(W j+1 W j |F j ) = 0 and, therefore,
E(W j+1 |C j ) W j = 0.
Since on each simple set {W j = y} in C j , the random variable W j+1 takes only two values y1 and y2 , this equation
allows us to compute the probabilities of these simple sets recursively as in (25.0.3),
y2 y
P(W j+1 = y2 ) = P(W j = y).
y2 y1
By (25.0.5), Y j s satisfy the same recursive equations and this proves that L (Wn ) = L (Yn ). The sequence n is
monotone, so it converges n to some stopping time . Since
En = EW2n = EYn2 EY 2 < ,

we have E = lim En EY 2 < and, therefore, < almost surely. Then Wn W almost surely by sample
continuity and, since L (Wn ) = L (Yn ) L (Y ), this proves that L (W ) = L (Y ). t
u

141

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The Skorohods imbedding will be used to reduce the case of sums of i.i.d. random variables to the following law of
the iterated logarithm for Brownian motion.
p
Theorem 78 Let Wt be the Brownian motion and u(t) = 2t`(t), where `(t) = log logt. Then
Wt
lim sup =1
t u(t)
almost surely.

Let us briefly describe the main idea that gives origin to the function u(t). For a > 1, consider a geometric sequence
t = ak and take a look at the probabilities of the following events,
Wk Lu(ak )  1 1  1 L2 2ak `(ak ) 
P Wak Lu(ak ) = P a

exp
ak
p
ak ak 2 L 2`(ak ) 2
L 2
1 1  1 
p . (25.0.6)
2 L 2`(ak ) k log a
This series will converge or diverge depending on whether L > 1 or L < 1. Even though these events are not indepen-
dent, in some sense, they are almost independent and the Borel-Cantelli lemma would imply that the upper limit of
Wak behaves like u(ak ). Some technical work will complete this main idea. Let us start with the following.

Lemma 55 For any > 0,


n |W W |
t s
o
lim sup sup : s t (1 + )s 4
s u(s)
almost surely.

Proof. Let , > 0, tk = (1 + )k and Mk = u(tk ). By symmetry, the equation (24.0.3) and the Gaussian tail estimate
in Lemma 52,
   
P sup |Wt Wtk | Mk 2P sup Wt Mk
tk ttk+1 0ttk+1 tk
 1 Mk2 
= 4N(0,tk+1 tk )(Mk , ) 4 exp
2 (tk+1 tk )
 2 2t `(t )   1  2
k k
4 exp =4 .
2tk k log(1 + )

If 2 > , the sum of these probabilities converges and, by the Borel-Cantelli lemma, for large enough k,

sup |Wt Wtk | u(tk ).


tk ttk+1

It is easy to see that for small enough , u(tk+1 )/u(tk ) < 1 + 2. If k is such that tk s tk+1 and s t (1 + )s
then, clearly, tk s t tk+2 and, therefore, for large enough k,

|Wt Ws | 2u(tk ) + u(tk+1 ) (2 + (1 + ))u(s) 4u(s).



Letting over some sequence finishes the proof. t
u
Proof of Theorem 78. For L = 1 + > 1, the equation (25.0.6) and the Borel-Cantelli lemma imply that

Wtk (1 + )u(tk )

for large enough k. If tk = (1 + )k then Lemma 55 implies that, with probability one, for large enough t, if tk t < tk+1
then
Wt Wtk u(tk ) Wt Wtk u(tk )
= + (1 + ) + 4 .
u(t) u(tk ) u(t) u(tk ) u(t)

142

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Letting , 0 over some sequences proves that, with probability one,
Wt
lim sup 1.
t u(t)

To prove that the upper limit is equal to one, we will use the Borel-Cantelli lemma for the independent increments
Wak Wak1 for large values of the parameter a > 1. If 0 < < 1 then, similarly to (25.0.6),

1 1  1 (1)2
P Wak Wak1 (1 )u(ak ak1 )

.
2 (1 ) 2`(ak ak1 ) log(ak ak1 )
p

The series diverges and, since these events are independent, they occur infinitely often with probability one. We
already proved (by (25.0.6)) that, for > 0, for large enough k 1, Wak /u(ak ) 1 + and, therefore, by symmetry,
Wak /u(ak ) (1 + ). This gives

Wak u(ak ak1 ) Wak1


k
(1 ) +
u(a ) u(ak ) u(ak )
u(ak ak1 ) u(ak1 )
(1 ) (1 + )
u(ak ) u(ak )
s s
(ak ak1 )`(ak ak1 ) ak1 `(ak1 )
= (1 ) (1 + )
ak `(ak ) ak `(ak )

and r r
Wt Wk 1 1
lim sup lim sup ak (1 ) 1 (1 + ) .
t u(t) k u(a ) a a
Letting 0 and a over some sequences proves that the upper limit is equal to one. t
u

The LIL for Brownian motion implies the LIL for sums of independent random variables via Skorohods imbedding.

Theorem 79 Suppose that Y1 , . . . ,Yn are i.i.d. and EYi = 0, EYi2 = 1. If Sn = Y1 + . . . +Yn then

Sn
lim sup =1
n 2n log log n

almost surely.
L
Proof. Let us define a stopping time (1) such that W(1) = Y1 . By the strong Markov property, the increment of
the process after stopping time is independent of the process before stopping time and has the law of the Brownian
L L
motion. Therefore, we can define (2) such that W(1)+(2) W(1) = Y2 and, by independence, W(1)+(2) = Y1 + Y2
L
and (1), (2) are i.i.d. By induction, we can define i.i.d. (1), . . . , (n) such that Sn = WT (n) where T (n) = (1) +
. . . + (n). We have
Sn d WT (n) Wn WT (n) Wn
= = + .
u(n) u(n) u(n) u(n)
By the LIL for Brownian motion,
Wn
lim sup = 1.
n u(n)
By the strong law of large numbers, T (n)/n E(1) = EY12 = 1 almost surely. For any > 0, Lemma 55 implies that,
for large n,
|WT (n) Wn |
4 ,
u(n)
and letting 0 finishes the proof. t
u

143

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


The LIL for Brownian motion also implies the local LIL:
Wt
lim sup p = 1.
t0 2t`(1/t)

It is easy to check that if Wt is a Brownian motion then tW1/t is also the Brownian motion and the result follows by a
change of variable t 1/t. To check that tW1/t is a Brownian motion notice that for t < s,

1 1
EtW1/t sW1/s tW1/t = st t 2 = t t = 0

s t
2
and E tW1/t sW1/s = t + s 2t = s t.

144

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 26

Moment problem and de Finettis theorem


for coin flips.

Let us start by recalling the example in Section 5. Let (Xi ) be i.i.d. with the Bernoulli distribution with probability of
success [0, 1], i.e. P (Xi = 1) = and P (Xi = 0) = 1 , and let u : [0, 1] R be some continuous function on
[0, 1]. Then, by Theorem ?? in Section 5, the Bernstein polynomials
n    n n   
k  k n k
Bn ( ) := u P Xi = k = u (1 )nk
k=0 n i=1 k=0 n k

approximate u( ) uniformly on [0, 1].

Moment problem. Consider a random variable X taking values in [0, 1] and let k = EX k be its moments. Given a
sequence (c0 , c1 , c2 , . . .), let us define the sequence of increments by ck = ck+1 ck . Then

k = k k+1 = E(X k X k+1 ) = EX k (1 X),

()(k ) = (1)2 2 k = EX k (1 X) EX k+1 (1 X) = EX k (1 X)2


and, by induction,
(1)r r k = EX k (1 X)r .
Clearly, (1)r r k 0 since X [0, 1]. If u is a continuous function on [0, 1] and Bn is the corresponding Bernstein
polynomial defined above then
n    n   
k n k n
EBn (X) = u EX k (1 X)nk = u (1)nk nk k .
k=0 n k k=0 n k

Since Bn converges uniformly to u, EBn (X) converges to Eu(X). Let us define


 
(n) n
pk = (1)nk nk k 0.
k
(n) (n)
By taking u 1, it is easy to see that nk=0 pk = 1, so we can think of pk as the distribution
 k (n)
P X (n) = = pk (26.0.1)
n

of some random variable X (n) . We showed that EBn (X) = Eu X (n) Eu(X) for any continuous function u, which


means that X (n) converges to X in distribution. In other words, given finitely many moments of X, this construction
gives an explicit approximation of the distribution of X.

145

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Next, given a sequence (k ), we consider the following question: When is (k ) the sequence of moments of
some [0, 1] valued random variable X? By the above, it is necessary that

k 0, 0 = 1 and (1)r r k 0 for all k, r. (26.0.2)

It turns out that this is also sufficient.

Theorem 80 (Hausdorff) There exists a r.v. X [0, 1] such that k = EX k if and only if (26.0.2) holds.

Proof. The idea of the proof is as follows. If k are the moments of some r.v. X, then the discrete distributions defined
(n)
in (26.0.1) should approximate it. Therefore, our goal will be to show that the condition (26.0.2) ensures that (pk )
is indeed a distribution and then show that the moments of (26.0.1) converge to k . As a result, any limit of these
(n)
distributions will be a candidate for the distribution of X. First of all, let us express k in terms of (pk ). Since
k = k+1 k , we have the following inversion formula:

k = k+1 k = (k+2 k+1 ) + (k+1 + 2 k )


r  
r
= k+2 2k+1 + 2 k = (1)r j r j k+ j ,
j=0 j

(n)
by induction on r. Take r = n k and recall the definition of pk above. Then
nk   1
nk n (n)
k = pk+ j .
j=0 j k+ j

Since
n 1 k + j n 1
     
nk (n k)! (k + j)!(n k j)!
= = ,
j k+ j j!(n k j)! n! k k
we can rewrite  1
nk  n   1
k+ j n (n) m n (n)
k = pk+ j = pm .
j=0 k k m=k k k
(n) (n)
For k = 0, this gives mn pm = 0 = 1 and, by the assumption (26.0.2), pm 0. Therefore, we can consider a
random variable X (n) such that  m (n)
P X (n) = = pm for 0 m n.
n
Notice that, for any fixed k,
n   1 n n  k
m n (n) m(m 1) (m k + 1) (n) m (k) k
k = pm = pm pm = E X (n)
m=k k k m=k n(n 1) (n k + 1) m=0 n

as n . By the Selection Theorem, one can choose a subsequence X (ni ) that converges to some r.v. X in distribution
and, as a result,
k
E X (ni ) EX k = k ,
which means that k are the moments of X. t
u
de Finettis theorem. As a consequence of the Hausdorff theorem above, we will now prove the classical de Finetti
representation for coin flips. The general case will be considered in the next section. A sequence (Xn )n1 of {0, 1}-
valued random variables is called exchangeable if, for any n 1 and any x1 , . . . , xn {0, 1}, the probability

P(X1 = x1 , . . . , Xn = xn )

depends only on the number of successes x1 + . . . + xn and does not depend on the order of 1s or 0s. Another way
to say this is that, for any n 1 and any permutation of {1, . . . , n}, the distribution of (X(1) , . . . , X(n) ) does not
depend on . Then the following holds.

146

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 81 (de Finetti) There exists a distribution F on [0, 1] such that
Z 1 
n k
pk := P(X1 + . . . + Xn = k) = p (1 p)nk dF(p).
0 k

In other words, in order to generate such an exchangeable sequence of 0s and 1s, we first pick p [0, 1] from some
distribution F and then generate a sequence of i.i.d Bernoulli random variables with probability of success p.

Proof. Let 0 = 1 and, for k 1, define

k = P(X1 = 1, . . . , Xk = 1). (26.0.3)

We have

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0) = P(X1 = 1, . . . , Xk = 1) P(X1 = 1, . . . , Xk = 1, Xk+1 = 1) = k k+1 = k .

Next, using exchangeability,

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, Xk+2 = 0) = P(X1 = 1, . . . , Xk = 1, Xk+1 = 0)


P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, Xk+2 = 1)
= k (k+1 ) = 2 k .

Similarly, by induction,

P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) = (1)nk nk k 0.

By the Hausdorff theorem, k = EX k for some r.v. X [0, 1] and, therefore,


Z 1
P(X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) = (1)nk nk k = EX k (1 X)nk = pk (1 p)nk dF(p).
0

Since, by exchangeability, changing the order of 1s and 0s does not affect the probability, we get
Z 1 
n k
P(X1 + . . . + Xn = k) = p (1 p)nk dF(p),
0 k

and this finishes the proof. t


u

Example (Polyas urn model). Suppose we have b blue and r red balls in the urn. We pick a ball randomly and return

Pick

b r
+ c of the same color

Figure 26.1: Polya urn model.

it with c balls of the same color. Consider the random variable



1, if the ith ball picked is blue
Xi =
0, otherwise.

Xi s are not independent but it is easy to check that they are exchangeable. For example,
b b+c r b r b+c
P(bbr) = = P(brb) = .
b + r b + r + c b + r + 2c b + r b + r + c b + r + 2c

147

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


To identify the distribution F in de Finettis theorem, let us look at its moments k in (26.0.3),
 b b+c b + (k 1)c
k = P b
| .{z
. . b} = .
b+r b+r+c b + r + (k 1)c
k times
One can recognize or easily check that (k ) are the moments of Beta(, ) distribution with the density

( + ) 1
x (1 x) 1 I(0 x 1)
()( )

with the parameters = b/c, = r/c. By de Finettis theorem, we can generate Xi s by first picking p from the
distribution Beta b/c, r/c and then generating i.i.d. Bernoulli (Xi )s with the probability of success p. By strong law
of large numbers, the proportion of blue balls in the first n repetitions will converge to this probability of success p,
i.e. in the limit it will be random with the Beta distribution. Recall that this example came up in Section 15 on the
convergence of martingales. t
u

148

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 27

The general de Finetti and Aldous-Hoover


representations.

We will begin by proving the classical de Finetti representation for exchangeable sequences. A sequence (s` )`1 is
called exchangeable if for any permutation of finitely many indices we have equality in distribution
 d 
s(`) `1
= s` `1
. (27.0.1)

The following holds.

Theorem 82 (de Finetti) If the sequence (s` )`1 is exchangeable then there exists a measurable function g : [0, 1]2
R such that  d 
s` `1 = g(w, u` ) `1 , (27.0.2)
where w and (u` ) are i.i.d. random variables with the uniform distribution on [0, 1].

Before we proceed to prove the above results, let us recall the following definition.

Definition. A measurable space (, B) is called a Borel space if there exists a one-to-one function from onto a
Borel subset A [0, 1] such that both and 1 are measurable. t
u

Perhaps, the most important example of Borel spaces are complete separable metric spaces and their Borel subsets (see
e.g. Section 13.1 in R.M. Dudley Real Analysis and Probability). The existence of the isomorphism automatically
implies that if we can prove the above results in the case when the elements of the sequence (s` ) or array (s`,`0 ) take
values in [0, 1] then the same representation results hold when the elements take values in a Borel space. Similarly, other
standard results for real-valued or [0, 1]-valued random variables are often automatically extended to Borel spaces. For
example, one can generate any real-valued random variable as a measurable function of a uniform random variable
on [0, 1] using the quantile transform and, therefore, any random element on a Borel space can also be generated by a
function of a uniform random variable on [0, 1].
Let us describe another typical measure theoretic argument that will be used many times below.

Lemma 56 (Coding Lemma) Suppose that a random pair (X,Y ) takes values in the product of a measurable space
(1 , B1 ) and a Borel space (2 , B2 ). Then, there exists a measurable function f : 1 [0, 1] 2 such that
d 
(X,Y ) = X, f (X, u) , (27.0.3)

where u is a uniform random variable on [0, 1] independent of X.

Another way to state this is to say that, conditionally on X, Y can be generated as a function Y = f (X, u) of X
and an independent uniform random variable u on [0, 1]. Rather than using the Coding Lemma itself we will often use

149

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


the general ideas of conditioning and generating random variables as functions of uniform random variables on [0, 1]
that are used in its proof, without repeating the same argument.

Proof. By the definition of a Borel space, one can easily reduce the general case to the case when 2 = [0, 1] equipped
with the Borel -algebra. Since in this case the regular conditional distribution Pr(x, B) of Y given X exists, for a fixed
x 1 , we can define by F(x, y) = Pr(x, [0, y]) the conditional distribution function of Y given x and by

f (x, u) = F 1 (x, u) = inf s [0, 1] : u F(x, s)




its quantile transformation. It is easy to see that f is measurable on the product space 1 [0, 1] because, for any t R,
 
(x, u) | f (x, u) t = (x, u) | u F(x,t)

and, by the definition of the regular conditional probability, F(x,t) = Pr(x, [0,t]) is measurable in x for a fixed t. If u is
uniform on [0, 1] then, for a fixed x 1 , f (x, u) has the distribution Pr(x, ) and this finishes the proof. t
u

The most important way in which the exchangeability condition (27.0.1) will be used is to say that for any infinite
subset I N,
 d 
s` `I = s` `1 . (27.0.4)
Let us describe one immediate consequence of this simple observation. If

FI = s` : ` I

(27.0.5)

is the -algebra generated by the random variables s` for ` I then the following holds.

Lemma 57 For any infinite subset I N and j < I, the conditional expectations

E f (s j ) FI = E f (s j ) FN\{ j}
 

almost surely, for any bounded measurable function f : R R.

Proof. First, using the property (27.0.4) for I { j} instead of I implies the equality in distribution,
 d
E f (s j ) FI = E f (s j ) FN\{ j} ,


and, therefore, the equality of the L2 -norms,


E f (s j ) FI = E f (s j ) FN\{ j} .
 
2 2
(27.0.6)

On the other hand, I N \ { j} and FI FN\{ j} , which implies that


2   2
E E f (s j ) FI E f (s j ) FN\{ j} = E E f (s j ) FI = E f (s j ) FI 2 .
 

Combining this with (27.0.6) yields that

E f (s j ) FI E f (s j ) FN\{ j} 2 = 0,
 
2
(27.0.7)

which finishes the proof. t


u

Proof of Theorem 82. Let us take any infinite subset I N such that its complement N \ I is also infinite. By (27.0.4),
we only need to prove the representation (27.0.23) for (s` )`I . First, we will show that, conditionally on (s` )`N\I , the
random variables (s` )`I are independent. This means that, given n 1, any distinct `1 , . . . , `n I, and any bounded
measurable functions f1 , . . . , fn : R R,
 
E f j (s` j ) FN\I = E f j (s` j ) FN\I ,

(27.0.8)

jn jn

150

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


or, in other words, for any A FN\I ,

EIA f j (s` j ) = EIA E f j (s` j ) FN\I .



jn jn

Since IA and f1 (s`1 ), . . . , fn1 (s`n1 ) are FN\{`n } -measurable, we can write

f j (s` j )E fn (s`n ) FN\{`n }



EIA f j (s` j ) = EIA
jn jn1

f j (s` j )E fn (s`n ) FN\I ,



= EIA
jn1

where the second equality follows from Lemma 57. This implies that
   
E f j (s` j ) FN\I = E f j (s` j ) FN\I E fn (s`n ) FN\I

jn jn1

and (27.0.8) follows by induction on n. Let us also observe that, because of (27.0.4), the distribution of the array
(s` , (s j ) jN\I ) does not depend on l I. This implies that, conditionally on (s j ) jN\I , the random variables (s` )`I
are identically distributed in addition to being independent. The product space [0, 1] with the Borel -algebra corre-
sponding to the product topology is a Borel space (recall that, equipped with the usual metric, it becomes a complete
separable metric space). Therefore, in order to conclude the proof, it remains to generate X = (s` )`N\I as a function
of a uniform random variable w on [0, 1], X = X(w), and then use the argument in the Coding Lemma 56 to generate
the sequence (s` )`I as ( f (X(w), u` ))`I , where (u` )`I are i.i.d. random variables uniform on [0, 1]. This finishes the
proof. t
u
Comments about de Finettis theorem. Consider an exchangeable sequence (s` )`1 taking values in some com-
plete separable space . It is equal in distribution to (g(w, u` ))`1 . For a fixed w, this is an i.i.d. sequence from the
distribution
w = g(w, )1 ,
where is the Lebesgue measure on [0, 1]. By the strong law of large numbers for empirical measures (Varadarajans
Theorem in Section 17),
1 n
w = lim g(w,u` ) .
n n
`=1

The limit on the right hand side is taken on the complete separable metric space P(, B) of probability measures
on equipped, for example, with the bounded Lipschitz metric or Levy-Prokhorov metric , and this limit exists
almost surely over (u` )`1 . This implies that

(i) the limit = limn 1n n`=1 s` exists almost surely;


(ii) is a (random) probability measure on , i.e. a random element in P(, B);
(iii) given , the sequence (s` )`1 is i.i.d. from with the distribution .

The measure is called the empirical measure of the sequence (s` )`1 . One can now interpret de Finettis represen-
tation as follows. First, we generate as a function of a uniform random variable w on [0,1] and then generate s` as
a function of and i.i.d. random variables u` uniform on [0, 1], using the Coding Lemma. Combining two steps, we
generate s` as a function of w and u` .
Remark. There are two equivalent definitions of a random probability measure on a complete separable metric space
with the Borel -algebra B. On the one hand, as above, these are just random elements taking values in the space
P(, B) of probability measures on (, B) equipped with the topology of weak convergence or a metric that metrizes
weak convergence. On the other hand, we can think of a random measure as a probability kernel, i.e. as a function
= (x, A) of a generic point x X for some probability space (X, F , Pr) and a measurable set A B such that for a
fixed x, (x, ) is a probability measure and for a fixed A, (, A) is a measurable function on (X, F ). It is well known
that these two definitions coincide (see, e.g. Lemma 1.37 and Theorem A2.3 in Foundations of Modern Probability).

151

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Example. (de Finettis representation for {0, 1}-valued sequences) Consider an exchangeable sequence (s` )`1 of
random variables that take values {0, 1}. Then the empirical measure is a random probability measure on {0, 1}. It is
encoded by one (random) parameter p = ({1}) [0, 1]. To fix means to fix p and, given p, the sequence (s` )`1
is i.i.d. Bernoulli with the probability of success equal to p. If is the distribution of p then, for any n 1 and any
1 , . . . , n {0, 1},  
n k
Z
P(s1 = 1 , . . . , sn = n ) = p (1 p)k d(p),
[0,1] k

where k = 1 + . . . + n . t
u

Another piece of information that can be deduced from de Finettis theorem and is often useful is the following.
Suppose that in addition to the sequence (s` )`1 we are given another random element Z taking values in a complete
separable metric space, and their joint distribution is not affected by a permutation of the sequence,
 d 
Z, (s` )`1 = Z, (s(`) `1 ). (27.0.9)

Suppose that is the empirical measure of (s` )`1 .

Theorem 83 Conditionally on , the sequence (s` )`1 is i.i.d. with the distribution and independent of Z.

For example, one can deduce Theorem 86 in the exercise below from this statement rather directly.

Proof of Theorem 83. If we define t` = (Z, s` ) then (27.0.9) implies that (t` )`1 is exchangeable. Let

1 n 1 n
= lim t` = lim (Z,s` )
n n n n
`=1 `=1

be the empirical measure of this sequence. Obviously, = Z . Since, given , the sequence (Z, s` ) is i.i.d. from
, this means that, given , the sequence (s` ) is i.i.d. from . In other words, given Z and , the sequence (s` ) is i.i.d.
from . The statement then follows from the following simple lemma, which we will leave as an exercise. t
u

Lemma 58 Suppose that three random elements X,Y and Z take values on some complete separable metric spaces.
The conditional distribution of X given the pair (Y, Z) depends only on Y ( (Y )-measurable) if and only if X and Z
are independent conditionally on Y .

The Aldous-Hoover representation. We will now prove two analogues of de Finettis theorem for two-dimensional
arrays. Let us consider an infinite random array s = (s`,`0 )`,`0 1 . The array s is called an exchangeable array if for any
permutations and of finitely many indices we have equality in distribution,
 d 
s(`),(`0 ) `,`0 1
= s`,`0 `,`0 1
, (27.0.10)

in the sense that their finite dimensional distributions are equal. Here is one natural example of an exchangeable array.
Given a measurable function : [0, 1]4 R and sequences of i.i.d. random variables w, (u` ), (v`0 ), (x`,`0 ) that have the
uniform distribution on [0, 1], the array 
(w, u` , v`0 , x`,`0 ) `,`0 1 (27.0.11)
is, obviously, exchangeable. It turns out that all exchangeable arrays are of this form.

Theorem 84 (Aldous-Hoover) Any infinite exchangeable array (s`,`0 )`,`0 1 is equal in distribution to (27.0.11) for
some function .

Another version of the Aldous-Hoover representation holds in the symmetric case. A symmetric array s is called
weakly exchangeable if for any permutation of finitely many indices we have equality in distribution
 d 
s(`),(`0 ) `,`0 1
= s`,`0 `,`0 1
. (27.0.12)

152

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Because of the symmetry, we can consider only half of the array indexed by (`, `0 ) such that ` `0 or, equivalently,
consider the array indexed by sets {`, `0 } and rewrite (27.0.12) as
 d 
s{(`),(`0 )} `,`0 1
= s{`,`0 } `,`0 1
. (27.0.13)

Notice that, compared to (27.0.10), the diagonal elements now play somewhat different roles from the rest of the array.
One natural example of a weakly exchangeable array is given by

s{`,`} = g(w, u` ) and s{`,`0 } = f w, u` , u`0 , x{`,`0 } for ` , `0 ,



(27.0.14)

for any measurable functions g : [0, 1]2 R and f : [0, 1]4 R, which is symmetric in its middle two coordinates
u` , u`0 , and i.i.d. random variables w, (u` ), (x{`,`0 } ) with the uniform distribution on [0, 1]. Again, it turns out that such
examples cover all possible weakly exchangeable arrays.

Theorem 85 (Aldous-Hoover) Any infinite weakly exchangeable array is equal in distribution to the array (27.0.14)
for some functions g and f , which is symmetric in its two middle coordinates.

After we give a proof of Theorem 85, we will leave a similar proof of Theorem 84 as an exercise (with some hints).
We will then give another, quite different, proof of Theorem 84. The proof of the Dovbysh-Sudakov representation in
the next section will be based on Theorem 84.
The most important way in which the exchangeability condition (27.0.12) will be used is to say that, for any
infinite subset I N,
 d 
s{`,`0 } `,`0 I = s{`,`0 } `,`0 1 . (27.0.15)
Again, one important consequence of this observation will be the following. Given j, j0 I such that j , j0 , let us now
define the -algebra
FI ( j, j0 ) = s{`,`0 } : `, `0 I, {`, `0 } , { j, j0 } .

(27.0.16)
In other words, this -algebra is generated by all elements s{`,`0 } with both indices ` and `0 in I, excluding s{ j, j0 } . The
following analogue of Lemma 57 holds.

Lemma 59 For any infinite subset I N and any j, j0 I such that j , j0 , the conditional expectations

E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 )
 

almost surely, for any bounded measurable function f : R R.

Proof. The proof is almost identical to the proof of Lemma 57. The property (27.0.15) implies the equality in distri-
bution,
 d
E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 ) ,


and, therefore, the equality of the L2 -norms,


E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 ) .
 
2 2

On the other hand, since FI ( j, j0 ) FN ( j, j0 ), as in (27.0.7), this implies that

E f (s{ j, j0 } ) FI ( j, j0 ) E f (s{ j, j0 } ) FN ( j, j0 ) 2 = 0,
 
2

which finishes the proof. t


u
Proof of Theorem 85. Let us take an infinite subset I N such that its complement N \ I is also infinite. By (27.0.15),
we only need to prove the representation (27.0.14) for (s{`,`0 } )`,`0 I . For each j I, let us consider the array
    
S j = s{`,`0 } `,`0 (N\I){ j} = s{ j, j} , s{ j,`} `N\I , s{`,`0 } `,`0 N\I . (27.0.17)

It is obvious that the weak exchangeability (27.0.13) implies that the sequence (S j ) jI is exchangeable, since any
permutation of finitely many indices from I in the array (27.0.13) results in the corresponding permutation of the

153

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


sequence (S j ) jI . We can view each array S j as an element of the Borel space [0, 1] and de Finettis Theorem 82
implies that
 d 
S j jI = X(w, u j ) jI (27.0.18)
for some measurable function X on [0, 1]2 taking values in the space of such arrays. Next, we will show that, condi-
tionally on the sequence (S j ) jI , the off-diagonal elements s{ j, j0 } for j, j0 I, j , j0 , are independent and, moreover,
the conditional distribution of s{ j, j0 } depends only on S j and S j0 . This means that if we consider the -algebras

F = (S j ) jI , F j, j0 = (S j , S j0 ) for j, j0 I, j , j0 ,

(27.0.19)
then we would like to show that, for any finite set C of indices {`, `0 } for `, `0 I such that ` , `0 and any bounded
measurable functions f`,`0 corresponding to the indices {`, `0 } C , we have
 
E f`,`0 (s{`,`0 } ) F = E f`,`0 (s{`,`0 } ) F`,`0 .

(27.0.20)

{`,`0 }C {`,`0 }C

Notice that the definitions (27.0.16) and (27.0.19) imply that F j, j0 = F(N\I){ j, j0 } ( j, j0 ), since all the elements s{`,`0 }
with both indices in (N\I){ j, j0 }, except for s{ j, j0 } , appear as one of the coordinates in the arrays S j or S j0 . Therefore,
by Lemma 59,
E f (s{ j, j0 } ) F j, j0 = E f (s{ j, j0 } ) FN ( j, j0 ) .
 
(27.0.21)
Let us fix any { j, j0 } C , let C 0 = C \ {{ j, j0 }} and consider an arbitrary set A F . Since IA and f`,`0 (s{`,`0 } ) for
{`, `0 } C 0 are FN ( j, j0 )-measurable,
EIA f`,`0 (s{`,`0 } ) f j, j0 (s{ j, j0 } )
{`,`0 }C 0
 
f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } ) FN ( j, j0 )

= E IA
{`,` }C 0
0
 
f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } ) F j, j0 ,

= E IA
{`,`0 }C 0

where the second equality follows from (27.0.21). Since F j, j0 F , this imples that
   
f`,`0 (s{`,`0 } ) F E f j, j0 (s{ j, j0 } ) F j, j0

E f`,`0 (s{`,`0 } ) F = E

{`,`0 }C {`,`0 }C 0

and (27.0.20) follows by induction on the cardinality of C . By the argument in the Coding Lemma 56, (27.0.20)
implies that, conditionally on F , we can generate
 d 
s{ j, j0 } j, j0 I = h(S j , S j0 , x{ j, j0 } ) j, j0 I , (27.0.22)

for some measurable function h and i.i.d. uniform random variables x{ j, j0 } on [0, 1]. The reason why the function
h can be chosen to be the same for all { j, j0 } is because, by symmetry, the distribution of (S j , S j0 , s{ j, j0 } ) does not
depend on { j, j0 } and the conditional distribution of s{ j, j0 } given S j and S j0 does not depend on { j, j0 }. Also, the
arrays (S j , S j0 , s{ j, j0 } ) and (S j0 , S j , s{ j, j0 } ) are equal in distribution and, therefore, the function h is symmetric in the
coordinates S j and S j0 . Finally, let us recall (27.0.18) and define the function
 
f w, u j , u j0 , x{ j, j0 } = h X(w, u j ), X(w, u j0 ), x{ j, j0 } ,
which is, obviously, symmetric in u j and u j0 . Then, the equations (27.0.18) and (27.0.22) imply that
    
 d  
S j jI , s{ j, j0 } j, j0 I = X(w, u j ) jI , f w, u j , u j0 , x{ j, j0 } j, j0 I .

In particular, if we denote by g the first coordinate of the map X corresponding to the element s{ j, j} in the array S j in
(27.0.17), this proves that
   
  d  
s{ j, j} jI , s{ j, j0 } j, j0 I = g(w, u j ) jI , f w, u j , u j0 , x{ j, j0 } j, j0 I .

154

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Recalling (27.0.15) finishes the proof of the representation (27.0.14). t
u

Proof of Theorem 84. One proof of Theorem 84 similar to the proof of Theorem 85 is sketched in an exercise below.
We will now give a different proof, and the main part of the proof will be based on the following observation. Suppose
that we have an exchangeable sequence of pairs (t` , s` )`1 with coordinates in complete separable metric spaces.
Given the sequence (t` )`1 , how can we generate the sequence of second coordinates (s` )`1 ? Consider the empirical
measures
1 n 1 n
= lim (t` ,s` ) , 1 = lim t` .
n n n n
`=1 `=1

Obviously, 1 is the marginal of on the first coordinate and, moreover, 1 is a measurable function of the sequence
(t` )`1 , so if we are given this sequence we automatically know 1 . The following holds.

Lemma 60 Given (t` )`1 , we can generate the sequence (s` )`1 in distribution as
 d 
s` `1
= f (1 ,t` , v, x` ) `1

for some measurable function f and i.i.d. uniform random variables v and (x` )`1 on [0, 1].

Proof. First, let us note how to generate the empirical measure given the sequence t = (t` )`1 . Given , the sequence
(t` , s` )`1 is i.i.d. from and, since 1 is the first marginal of , (t` )`1 are i.i.d. from 1 . This means that, if we
consider the triple (, 1 ,t) then the conditional distribution of t given (, 1 ) depends only on 1 ,

P t , 1 ) = P t 1 ).

By Lemma 58, t and are independent given 1 . Therefore, by Lemma 58,



P t, 1 ) = P 1 ).

On the other hand, 1 is a function of t, so P t, 1 ) = P t) and, thus,

P t) = P 1 ).

In other words, to generate given t, we can simply compute 1 and generate given 1 . By the Coding Lemma, we
can generate = g(1 , v) as a function of 1 and independent uniform random variable v on [0, 1].
Now, recall that, given , the sequence (t` , s` )`1 is i.i.d. with the distribution so, given t` and , we can
simply generate s` from the conditional distribution (s` |t` ) independently from each other. Again, using the
Coding Lemma, we can generate s` = h(,t` , x` ) as a function of i.i.d. x` uniform random variables on [0, 1]. Finally,
recalling that = g(1 , v), we can write

s` = h(g(1 , v),t` , x` ) = f (1 ,t` , v, x` ).

This finishes the proof. t


u

Proof of Theorem 84. Let us for convenience index the array s`,`0 by ` 1 and `0 Z instead of `0 1. Let us denote
by  
X`0 = s`,`0 `1 and X = X`0 `0 0
the `0 -th column and left half of this array. Since the sequence of columns (X`0 )`0 Z is exchangeable, we showed in
the proof of de Finettis theorem that, conditionally on X, the columns (X`0 )`0 1 in the right half of the array are i.i.d.
If we describe the distribution of one column X1 given X then we can generate all columns (X`0 )`0 1 independently
from this distribution. Therefore, our strategy will be to describe the distribution of X1 given X, and then combine it
with the structure of the distribution of X. Both steps will use exchangeability with respect to permutations of rows,
because so far we have only used exchangeability with respect to permutations of columns. Let

Y` = s`,`0 `0 0

155

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


be the elements in the `-th row of the left half X of the array. We want to describe the distribution of X1 = (s`,1 )`1
given X = (Y` )`1 and we will use the fact that the sequence (Y` , s`,1 )`1 is exchangeable. By Lemma 60, conditionally
on X = (Y` )`1 , (s`,1 )`1 can be generated as

s`,1 = f (1 ,Y` , v1 , x`,1 ),

where 1 is the empirical measure of (Y` ) and where instead of v and (x` ) we wrote v1 and (x`,1 ) to emphasize the
first column index 1. Since, conditionally on X, the columns (X`0 )`0 1 in the right half of the array are i.i.d., we can
generate
s`,`0 = f (1 ,Y` , v`0 , x`,`0 ),
where v`0 and x`,`0 are i.i.d. uniform random variables on [0, 1]. Finally, since (Y` )`1 are i.i.d. given the empirical
distribution 1 , we can generate 1 = h(w) as a function of a uniform random variable w on [0, 1] and then, using the
Coding Lemma, generate Y` = Y (1 , u` ) = Y (h(w), u` ) as a function of 1 and i.i.d. uniform random variables u` on
[0, 1]. Plugging these into f above gives s`,`0 = (w, u` , v`0 , x`,`0 ) for some function , which finishes the proof. t
u

Exercise. Prove Lemma 58.

Exercise. Give a similar proof of Theorem 84, using only global symmetry considerations. Hints: Since the row
and column indices play a different role in this case, the first step will be slightly different (the second step will be
essentially the same). One has to consider two sequences indexed by j I,
   
S1j = s`, j `N\I , s`,`0 `,`0 N\I and S2j = s j,` `N\I , s`,`0 `,`0 N\I .
   

These two sequences are separately exchangeable, which means that


   d  1  
1 2 2

S( j) jI , S( j) jI = S j jI , S j jI

for any permutations and of finitely many indices. Notice that these sequences are not independent. In this case
one needs to prove (as a part of the exercise) the following modification of de Finettis representation.

Theorem 86 If the sequences (s1` )`1 and (s2` )`1 are separately exchangeable then there exist measurable functions
g1 , g2 : [0, 1]2 R such that
    d   
s1` `1 , s2` `1 = g1 (w, u` ) `1 , g2 (w, v` ) `1

(27.0.23)

where w, (u` ) and (v` ) are i.i.d. random variables uniform on [0, 1].

Exercise. Consider a pair of random arrays

s1`,`0 and s2`,`0


 
`,`0 1 `,`0 1

that are separately exchangeable in the first coordinate and jointly exchangeable in the second coordinate, that is,
   
d
s11 (`),(`0 ) `,`0 1 , s22 (`),(`0 ) `,`0 1 = s1`,`0 `,`0 1 , s2`,`0 `,`0 1
   

for any permutations 1 , 2 , of finitely many coordinates. Show that there exist two functions 1 , 2 such that these
arrays can be generated in distribution by

s1`,`0 = 1 (w, u1` , v`0 , x`,`


1 2 2 2
0 ), s`,`0 = 2 (w, u` , v`0 , x`,`0 ),

where all the arguments are i.i.d. uniform random variables on [0, 1].

156

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 28

The Dovbysh-Sudakov representation.

Let us consider an infinite symmetric random array R = (R`,`0 )`,`0 1 , which is weakly exchangeable in the sense defined
in (27.0.12), i.e. for any permutation of finitely many indices we have equality in distribution
 d 
R(`),(`0 ) `,`0 1 = R`,`0 `,`0 1 . (28.0.1)
In addition, suppose that R is positive definite with probability one, where by positive definite we will always mean
non-negative definite. Such weakly exchangeable positive definite arrays are called Gram-de Finetti arrays. It turns
out that all such arrays are generated essentially as the covariance matrix of an i.i.d. sample from a random measure
on a Hilbert space. Let H be the Hilbert space L2 ([0, 1], dv), where dv denotes the Lebesgue measure on [0, 1].
Theorem 87 (Dovbysh-Sudakov) There exists a random probability measure on H R+ such that the array R =
(R`,`0 )`,`0 1 is equal in distribution to 
h` h`0 + a` `,`0 `,`0 1 , (28.0.2)
where, conditionally on , (h` , a` )`1 is a sequence of i.i.d. random variables with the distribution and h h0 denotes
the scalar product on H.
In particular, the marginal G of on H can be used to generate the sequence (h` ) and the off-diagonal elements of the
array R.
Proof of Theorem 87. Since the array R is positive definite, conditionally on R, we can generate a Gaussian vector g
in RN with the covariance equal to R. Now, also conditionally on R, let (gi )i1 be independent copies of g. If, for each
i 1, we denote the coordinates of gi by g`,i for ` 1 then, since the array R = (R`,`0 )`,`0 1 is weakly exchangeable,
it should be obvious that the array (g`,i )`,i1 is exchangeable in the sense of (27.0.10). By Theorem 84, there exists a
measurable function : [0, 1]4 R such that
 d 
g`,i `,i1 = (w, u` , vi , x`,i ) `,i1 , (28.0.3)

where w, (u` ), (vi ), (x`,i ) are i.i.d. random variables with the uniform distribution on [0, 1]. By the strong law of large
numbers (applied conditionally on R), for any `, `0 1,
1 n
g`,i g`0 ,i R`,`0
n i=1
almost surely as n . Similarly, by the strong law of large numbers (now applied conditionally on w and (u` )`1 ),
1 n
(w, u` , vi , x`,i ) (w, u`0 , vi , x`0 ,i ) E0 (w, u` , v1 , x`,1 ) (w, u`0 , v1 , x`0 ,1 )
n i=1

almost surely, where E0 denotes the expectation with respect to the random variables v1 and (x`,1 )`1 . Therefore,
(28.0.3) implies that
d
R`,`0 `,`0 1 = E0 (w, u` , v1 , x`,1 ) (w, u`0 , v1 , x`0 ,1 ) `,`0 1 .
 
(28.0.4)

157

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


If we denote Z Z
(1) (2)
(w, u, v) = (w, u, v, x) dx, (w, u, v) = (w, u, v, x)2 dx,

then the off-diagonal and diagonal elements on the right hand side of (28.0.4) are given by
Z Z
(1) (w, u` , v) (1) (w, u`0 , v) dv and (2) (w, u` , v) dv

correspondingly. Notice that, for almost all w and u, the function v (1) (w, u, v) is in H = L2 ([0, 1], dv), since
Z Z
(1)
(w, u, v) dv 2
(2) (w, u, v) dv

and, by (28.0.4), the right hand side is equal in distribution to R1,1 . Therefore, if we denote
Z
h` = (1) (w, u` , ), a` = (2) (w, u` , v) dv h` h` ,

then (28.0.4) becomes


 d 
R`,`0 `,`0 1
= h` h`0 + a` `,`0 `,`0 1
. (28.0.5)
It remains to observe that (h` , a` )`1 is an i.i.d. sequence from the random measure on H R+ given by the image
of the Lebesgue measure du on [0, 1] by the map
 Z Z 
u (w, u, ), (w, u, v) dv (1) (w, u, v) (1) (w, u, v) dv .
(1) (2)

This finishes the proof. t


u
Let us now suppose that there is equality instead of equality in distribution in (28.0.2), i.e. the array R is generated
by an i.i.d. sample (h` , a` ) from the random measure . To complete the picture, let us show that one can reconstruct
, up to an orthogonal transformation of its marginal on H, as a measurable function of the array R itself, with values
in the set P(H R+ ) of all probability measures on H R+ equipped with the topology of weak convergence.

Lemma 61 There exists a measurable function 0 = 0 (R) of the array (R`,`0 )`,`0 1 with values in P(H R+ ) such
that 0 = (U, id)1 almost surely for some orthogonal operator U on H that depends on the sequence (h` )`1 .

Proof. Let us begin by showing that the norms kh` k can be reconstructed almost surely from the array R. Consider a
sequence (g` ) on H such that g` g`0 = R`,`0 for all `, `0 1. In other words, kg` k2 = kh` k2 + a` and g` g`0 = h` h`0 for

all ` , `0 . Without loss of generality, let us assume that g` = h` + a` e` , where (e` )`1 is an orthonormal sequence
orthogonal to the closed span of (h` ). If necessary, we identify H with H H to choose such a sequence (e` ). Since
(h` ) is an i.i.d. sequence from the marginal G of the measure on H, with probability one, there are elements in the
sequence (h` )`2 arbitrarily close to h1 and, therefore, the length of the orthogonal projection of h1 onto the closed
span of (h` )`2 is equal to kh1 k. As a result, the length of the orthogonal projection of g1 onto the closed span of
(g` )`2 is also equal to kh1 k, and it is obvious that this length is a measurable function of the array g` g`0 = R`,`0 .
Similarly, we can reconstruct all the norms kh` k as measurable functions of the array R and, thus, all a` = R`,` kh` k2 .
Therefore, 
h` h`0 `,`0 1 and (a` )`1 (28.0.6)
are both measurable functions of the array R. Given the matrix (h` h`0 ), we can find a sequence (x` ) in H isometric to
(h` ), for example, by choosing x` to be in the span of the first ` elements of some fixed orthonormal basis. This means
that all x` are measurable functions of R and that there exists an orthogonal operator U = U((h` )`1 ) on H such that

x` = Uh` for ` 1. (28.0.7)

Since (h` , a` )`1 is an i.i.d. sequence from distribution , by the strong law of large numbers for empirical measures
(Varadarajans theorem in Section 17),
1
(h` ,a` ) weakly
n 1`n

158

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


almost surely and, therefore, (28.0.7) implies that
1
(x` ,a` ) (U, id)1 weakly
n 1`n
(28.0.8)

almost surely. The left hand side is, obviously, a measurable function of the array R in the space of all probability
measures on H R+ equipped with the topology of weak convergence and, therefore, as a limit, 0 = (U, id)1 is
also a measurable function of R. This finishes the proof. t
u

Exercise. Consider a sequence of pairs (G1N , G2N )N1 of random probability measures on {1, +1}N that are not
necessarily independent. Let ( ` )`0 be an i.i.d. sample of replicas from G1N and ( ` )`1 be an i.i.d. sample of replicas
from G2N (for convenience of notation, we denote both by ` but use difference sets of indices for `). Let RN =
(RN`,`0 )`,`0 Z be the array of the so-called overlaps

1 N ` `0
RN`,`0 = i i .
N i=1

Suppose that RN converges in distributions to an array R = (R`,`0 )`,`0 Z . Notice that this array is weakly exchangeable,
 d 
R(`),(`0 ) `,`0 Z
= R`,`0 `,`0 Z
,

but not under all permutations of integers Z, but only those permutations that map positive integers into positive and
non-positive into non-positive. Prove that there exists a pair of random measures G1 and G2 on a separable Hilbert
space H (not necessarily independent) such that
 d 
R`,`0 `,`0 Z
= h` h`0 `,`0 Z
,

where (h` )`0 is an i.i.d. sample from G1 and (h` )`1 is an i.i.d. sample from G2 .

159

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Section 29

Poisson processes.

In this section we will introduce and review several important properties of Poisson processes on a measurable space
(S, S ). In applications, (S, S ) is usually some nice space, such as a Euclidean space Rn with the Borel -algebra.
However, in general, it is enough to require that the diagonal {(s, s0 ) : s = s0 } is measurable on the product space S S
which, in particular, implies that every singleton set {s} in S is measurable as a section of the diagonal. This condition
is needed to be able to write P(X = Y ) for a pair (X,Y ) of random variables defined on the product space. From now
on, every time we consider a measurable space we will assume that it satisfies this condition. Let us also notice that
a product S S0 of two such spaces will also satisfy this condition since {(s1 , s01 ) = (s2 , s02 )} = {s1 = s2 } {s01 = s02 }.
Let and n for n 1 be some non-atomic (not having any atoms) measures on S such that

= n , n (S) < . (29.0.1)


n1

For each n 1, let Nn be a random variable with the Poisson distribution (n (S)) with the mean n (S) and let
(Xnl )l1 be i.i.d. random variables, also independent of Nn , with the distribution

n (B)
pn (B) = . (29.0.2)
n (S)

We assume that all these random variables are independent for different n 1. The condition that is non-atomic
implies that P(Xnl = Xm j ) = 0 if n , m or l , j. Let us consider random sets
 [
n = Xn1 , . . . , XnNn and = n . (29.0.3)
n1

The set will be called a Poisson process on S with the mean measure . Let us point out a simple observation that
will be used many times below that if, first, we are given the means n (S) and distributions (29.0.2) that were used to
generate the set (29.0.3) then the measure in (29.0.1) can be written as

= n (S)pn . (29.0.4)
n1

We will show in Theorem 91 below that when the measure is -finite then, in some sense, this definition of a Poisson
process is independent of the particular representation (29.0.1). However, for several reasons it is convenient to think
of the above construction as the definition of a Poisson process. First of all, it allows us to avoid any discussion about
what a random set means and, moreover, many important properties of Poisson processes follow from it rather
directly. In any case, we will show in Theorem 92 below that all such processes satisfy the traditional definition of a
Poisson process.
One important immediate consequence of the definition in (29.0.3) is the Mapping Theorem. Given a Poisson
process on S with the mean measure and a measurable map f : S S0 into another measurable space (S0 , S 0 ), let
us consider the image set [ 
f () = f (Xn1 ), . . . , f (XnNn ) .
n1

160

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Since random variables f (Xnl ) have the distribution pn f 1 on S 0 , the set f () resembles the definition (29.0.3)
corresponding to the measure
n (S) pn f 1 = n f 1 = f 1 .
n1 n1
Therefore, to conclude that f () is a Poisson process on S0 we only need to require that this image measure is non-
atomic.
Theorem 88 (Mapping Theorem) If is a Poisson process on S with the mean measure and f : S S0 is such
that f 1 is non-atomic then f () is a Poisson process on S0 with the mean measure f 1 .
Next, let us consider a sequence (m ) of measures that satisfy (29.0.1), m = n1 mn and mn (S) < . Let m =
n1 mn be a Poisson process with the mean measure m defined as in (29.0.3) and suppose that all these Poisson
processes are generated independently over m 1. Since
:= m = mn = m(l)n(l)
m1 m1 n1 l1

and := m1 m = l1 m(l)n(l) for any enumeration of the pairs (m, n) by the indices l N, we immediately get
the following.
Theorem 89 (Superposition Theorem) If m for m 1 are independent Poisson processes with the mean measures
m then their superposition = m1 m is a Poisson process with the mean measure = m1 m .
Another important property of Poisson processes is the Marking Theorem. Consider another measurable space (S0 , S 0 )
and let K : S S 0 [0, 1] be a transition function (a probability kernel), which means that for each s S, K(s, ) is
a probability measure on S 0 and for each A S 0 , K(, A) is a measurable function on (S, S ). For each point s ,
let us generate a point m(s) S0 from the distribution K(s, ), independently for different points s. The point m(s) is
called a marking of s and a random subset of S S0 ,
= (s, m(s)) : s ,

(29.0.5)
is called a marked Poisson process. In other words, = n1 n , where
n = (Xn1 , m(Xn1 )), . . . , (XnNn , m(XnNn )) ,


and each point (Xnl , m(Xnl )) S S0 is generated according to the distribution


"
pn (C) = K(s, ds0 ) pn (ds).
C

Since pn
is obviously non-atomic, by definition, this means that is a Poisson process on S S0 with the mean
measure " "
n (S) pn (C) = K(s, ds )n (ds) = K(s, ds0 )(ds).
0
C C
n1 n1
Therefore, the following holds.
Theorem 90 (Marking Theorem) The random subset in (29.0.5) is a Poisson process on S S0 with the mean
measure "

(C) = K(s, ds0 )(ds). (29.0.6)
C

The above results give us several ways to generate a new Poisson process from the old one. However, in each case,
the new process is generated in a way that depends on a particular representation of its mean measure. We will now
show that, when the measure is -finite, the random set can be generated in a way that, in some sense, does not
depend on the particular representation (29.0.1). This point is very important, because, often, given a Poisson process
with one representation of its mean measure, we study its properties using another, more convenient, representation.
Suppose that S is equal to a disjoint union m1 Sm of sets such that 0 < (Sm ) < , in which case
= |Sm (29.0.7)
m1

is another representation of the type (29.0.1), where |Sm is the restriction of to the set Sm . The following holds.

161

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Theorem 91 (Equivalence Theorem) The process in (29.0.3) generated according to the representation (29.0.1)
is statistically indistinguishable from a process generated using the representation (29.0.7).

More precisely, we will show the following. Let us denote

|Sm (B) (B Sm )
pSm (B) = = . (29.0.8)
|Sm (S) (Sm )
We will show that, given the Poisson process in (29.0.3) generated according to the representation (29.0.1), the
cardinalities N(Sm ) = | Sm | are independent random variables with the distributions ((Sm )) and, conditionally
on (N(Sm ))m1 , each set Sm looks like an i.i.d. sample of size N(Sm ) from the distribution pSm . This means that,
if we arrange the points in Sm in a random order, the resulting vector has the same distribution as an i.i.d. sample
from the measure pSm . Moreover, conditionally on (N(Sm ))m1 , these vectors are independent over m 1. This is
exactly how one would generate a Poisson process using the representation (29.0.7). The proof of Theorem 91 will be
based on the following two properties that usually appear as the definition of a Poisson process on S with the mean
measure :

(i) for any A S , the cardinality N(A) = | A| has the Poisson distribution ((A)) with the mean (A);
(ii) the cardinalities N(A1 ), . . . , N(Ak ) are independent, for any k 1 and any disjoint sets A1 , . . . , Ak S .

When (A) = , it is understood that A is countably infinite. Usually, one starts with (i) and (ii) as the definition of
a Poisson process and the set constructed in (29.0.3) is used to demonstrate the existence of such processes, which
explains the name of the following theorem.

Theorem 92 (Existence Theorem) The process in (29.0.3) satisfies the properties (i) and (ii).

Proof. Given x 0, let us denote the weights of the Poisson distribution (x) with the mean x by

 xj
j (x) = (x) { j} = ex .
j!

Consider disjoint sets A1 , . . . , Ak and let A0 = (ik Ai )c be the complement of their union. Fix m1 , . . . , mk 0 and let
m = m1 + + mk . Given any set A, denote Nn (A) = |n A|. With this notation, let us compute the probability of the
event 
= Nn (A1 ) = m1 , . . . , Nn (Ak ) = mk .
Recall that the random variables (Xnl )l1 are i.i.d. with the distribution pn defined in (29.0.2) and, therefore, condi-
tionally on Nn in (29.0.3), the cardinalities (Nn (Al ))0lk have multinomial distribution and we can write
 
P() = P Nn = m + j P Nn = m + j
j0
  
= P Nn (A0 ) = j Nn = m + j m+ j n (S)
j0

(m + j)! n (S)m+ j n (S)


= pn (A0 ) j pn (A1 )m1 pn (Ak )mk e
j0 j!m1 ! mk ! (m + j)!
1
= n (A0 ) j n (A1 )m1 n (Ak )mk en (S)
j0 j!m1 ! mk !
n (A1 )m1 n (A1 ) n (Ak )mk n (Ak )
= e e .
m1 ! mk !
This means that the cardinalities Nn (Al ) for 1 l k are independent random variables with the distributions (n (Al )).
Since all measures pn are non-atomic, P(Xn j = Xmk ) = 0 for any (n, j) , (m, k). Therefore, the sets n are all disjoint
with probability one and
N(A) = | A| = |n A| = Nn (A).
n1 n1

162

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


First of all, the cardinalities N(A1 ), . . . , N(Ak ) are independent, since we showed that Nn (A1 ), . . . , Nn (Ak ) are indepen-
dent for each n 1, and it remains to show that N(A) has the Poisson distribution with the mean (A) = n1 n (A).
The partial sum Sm = nm Nn (A) has the Poisson distribution with the mean nm n (A) and, since Sm N(A), for
any integer r 0, the probability P(N(A) r) equals
  
lim P(Sm r) = lim j n (A) = j (A) .
m m
jr nm jr

If (A) < , this shows that N(A) has the distribution ((A)). If (A) = then P(N(A) r) = 0 for all r 0,
which implies that N(A) is countably infinite. t
u
Proof of Theorem 91. First, suppose that (S) < and suppose that the Poisson process was generated as in
(29.0.3) using some sequence of measures (n ). By Theorem 92, the cardinality N = N(S) has the distribution
((S)). Let us show that, conditionally on the event {N = n}, the set looks like an i.i.d. sample of size n
from the distribution p() = ()/(S) or, more precisely, if we randomly assign labels {1, . . . , n} to the points in ,
the resulting vector (X1 , . . . , Xn ) has the same distribution as an i.i.d. sample from p. Let us consider some measurable
sets B1 , . . . , Bn S and compute
 
Pn X1 B1 , . . . , Xn Bn = P X1 B1 , . . . , Xn Bn N = n , (29.0.9)

where we denoted by Pn the conditional probability given the event {N = n}. To use Theorem 92, we will reduce this
case to the case of disjoint sets as follows. Let A1 , . . . , Ak be the partition of the space S generated by the sets B1 , . . . , Bn ,
obtained by taking intersections of the sets Bl or their complements over l n. Then, for each l n, we can write

I(x Bl ) = l j I(x A j ),
jk

where l j = 1 if Bl A j and l j = 0 otherwise. In terms of these sets, (29.0.9) can be rewritten as



Pn X1 B1 , . . . , Xn Bn = En l j I(Xl A j ) (29.0.10)
ln jk

= 1 j1 n jn Pn X1 A j1 , . . . , Xn A jn .
j1 ,..., jn k

If we fix j1 , . . . , jn and for j k let I j = {l n : jl = j} then


 
Pn X1 A j1 , . . . , Xn A jn = Pn Xl A j for all j k, l I j .

If n j = |I j | then the last event can be expressed in words by saying that, for each j k, we observe n j points of the
random set in the set A j and then assign labels in I j to the points A j . By Theorem 92, the probability to observe
n j points in each set A j , given that N = n, equals

P(N(A j ) = n j , j k)
Pn (N(A j ) = n j , j k) =
P(N = n)
(A j )n j (A j ) . (S)n (S)
= n j! e e ,
jk n!

while the probability to randomly assign the labels in I j to the points in A j for all j k is equal to jk n j !/n!.
Therefore,
 (A ) n j
 j
Pn X1 A j1 , . . . , Xn A jn = = p(A jl ). (29.0.11)
jk (S) ln

If we plug this into (29.0.10), we get



Pn X1 B1 , . . . , Xn Bn = 1 j1 p(A j1 ) n jn p(A jn )
j1 k jn k
= p(B1 ) p(Bn ).

163

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Therefore, conditionally on {N = n}, X1 , . . . , Xn are i.i.d. with the distribution p.
Now, suppose that is -finite and (29.0.7) holds. Consider the random sets in (29.0.3) and define N(Sm ) = |
Sm |. By Theorem 92, the random variables (N(Sm ))m1 are independent and have the Poisson distributions ((Sm )),
and we would like to show that, conditionally on (N(Sm ))m1 , each set Sm can be generated as a sample of size
N(Sm ) from the distribution pSm = |Sm ()/(Sm ), independently over m. This can be done by considering finitely
many m at a time and using exactly the same computation leading to (29.0.11) based on the properties proved in
Theorem 92, with each subset A Sm producing a factor pSm (A). t
u

164

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Bibliography

These lecture notes were mostly influenced by [2] and [1].

165

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00


Bibliography

[1] Borovkov, A. A.: Probability Theory. Universitext, Springer-Verlag, London (2013)


[2] Dudley, R. M.: Real Analysis and Probability. Cambridge Studies in Advanced Mathematics, 74. Cambridge
University Press, Cambridge (2002)

[3] Feller, W.: An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., New York-
London-Sydney (1966)
[4] Kallenberg, O.: Foundations of Modern Probability. Probability and its Applications, Springer-Verlag, New York
(1997)

[5] Kingman, J. F. C.: Poisson Processes. Oxford University Press, New York (1993)
[6] Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs, 89. American
Mathematical Society, Providence, RI (2001)
[7] Panchenko, D.: The Sherrington-Kirkpatrick Model. Springer Monographs in Mathematics. Springer-Verlag,
New York (2014)
[8] Resnick, S. I.: A Probability Path. Modern Birkhauser Classics, Springer Science & Business Media (2014)
[9] Villani, C.: Topics in Optimal Transportation. No. 58. American Mathematical Soc., (2003)

166

AMS Open Math Notes: Works in Progress 2016-12-20 09:28:00

Potrebbero piacerti anche