Sei sulla pagina 1di 33

Mathematics of Operations Research:

Foundations of probability
Jim Dai
October 21, 2009, December 7, 2010, October 13, 2011
Ton Dieker
November 27, 2012, November 14, 2013
1 Introduction
As an undergraduate student, you summarized probabilistic information through functions on R such as
probability mass functions or probability density functions. This machinery is insucient for advanced
topics such as stochastic processes. For instance, since a stochastic process is a random function, it is
not possible to dene a notion of probability density function the way you have seen it done for random
variables. The key idea of measure theory is to interpret a probability function P as a function on sets,
i.e., a measure. Note the similarity in notation of P(A) (the probability of the event A) and f(x) (the
value of f evaluated at x). Measure theory is the study of functions on sets (events).
You can think of measure theory as a language. Its often easier to say something in this language,
so you may encounter it in talks, research papers, etc. Knowledge of the language basics will help you
understand and parse research in related areas, including (stochastic) operations research.
2 -elds
Let be a set, called the sample space. A collection of subsets of is called a class.
Denition 2.1. Let be a set and T be a class of subsets of . The set T is said to be a -eld (or
-algebra) if
1. T
2. For each A T, one has A
c
T.
3. For each (countable) sequence A
i
with A
i
T for i = 1, 2, . . ., one has

i=1
A
i
T.
Note that, as a consequence of this denition, the third requirement must also hold for countable
intersections instead of unions.
Example 2.1. Suppose that =
1
,
2
,
3
. Refer to Figure 1. Consider the value of a stock.
Currently, i.e., at time t = 0, it has some known value. Suppose that by time t = 1, the stock has either
moved up or down. If it moves up, then there are two possible outcomes
1
or
2
at time t = 2. If it
moves down, it stays down at time t = 2 (outcome
3
). Note that these
i
are dened as outcomes with
respect to time t = 2.
Figure 1: Illustration of the connection between the notion of information and -elds.
The following three collections are -elds on :
1
T
0
= , ; this corresponds to the information that is known at time t = 0 (we cannot distinguish
any of the
i
).
T
1
= , ,
1
,
2
,
3
; this corresponds to the information available at time t = 1 (we cannot
distinguish
1
and
2
).
T
2
= , ,
1
,
2
,
3
,
1
,
2
,
1
,
3
,
2
,
3
; this corresponds to the information avail-
able at time t = 2 (we can distinguish all outcomes).
Note that T
0
T
1
T
2
. This is a simple example of a ltration, which is closely related to history
or information as can be seen through the above example. The concept of ltrations is often used in
nance.
Exercise 1. Show that the intersection of a family of -elds on is a -eld on .
We use R = (, ) to denote the real line.
Denition 2.2. The Borel -eld B = B(R) on R is dened to be the smallest -eld that contains all
open intervals (a, b), a, b R and a < b.
Let T be a class of subsets of . We use (T) to denote the smallest -eld that contains T, which
is dened as the intersection of all -elds containing T (it is readily seen that this must be a -eld).
We say that (T) is the -eld generated by T. Clearly, when T itself is a -eld, (T) = T. According
to the denition, B = (T
1
), where
T
1
= (a, b) : a, b R, a b.
Here we interpret intervals of the type (a, a) as , and similarly for halfopen intervals.
Exercise 2. Prove that the Borel -eld is also the smallest -eld that contains all intervals (a, b] of R
with a < b, i.e., B = (T
2
), where T
2
= (a, b] : a, b R, a b.
Sets B with B B are called Borel measurable.
2.1 The - theorem
Denition 2.3. A class T of subsets of is a -system if it is closed under the formation of nite
intersections:
A, B T implies A B T.
Denition 2.4. A class / is a -system if it contains and is closed under the formation of complements
and of countable (including nite) disjoint unions:
/,
A / implies A
c
/,
A
1
, A
2
, . . . / and A
n
A
m
= for m ,= n imply
_
n
A
n
/.
Example 2.2. Suppose we toss two coins and we choose = HH, HT, TH, TT for a sample space.
The set / = HH, HT, HH, TH, HT, TT, TH, TT, , is a -system. However, it is not a
-eld since it is not closed under intersections.
Exercise 3. We toss two coins and choose = HH, HT, TH, TT for a sample space. Dene
/ = , HH, HT, HT, TH, TH, HH, HH, HT, TH, TT.
1. Is / a -system on ? Explain.
2
2. Is / a -eld on ? Explain.

Exercise 4. Show that if a collection of sets is both a -system and a -system, it is a -eld.
The following theorem is due to Dynkin.
Theorem 2.1 (- theorem). If T is a -system and / is a -system, then T / implies (T) /.
Note that if / is assumed to be a -eld instead of a -system, the conclusion follows trivially from
the denition of (T) as the smallest -eld containing T.
Exercise 5. (a) Prove that T
1
is a -system. (b) Prove that T
2
is a -system.
Exercise 6. Prove the - theorem if you know that (T), which is the intersection of all -systems
containing T, is a -system.
Exercise 7. For A (T), dene (
A
= B (T) : A B (T). Show that (
A
is a -system.
3 Measures on a general space
Let be a set and T be a -eld on .
Denition 3.1. A map from T to R
+
is said to be a measure if it satises

_
i=1
A
i
_
=

i=1
(A
i
) (3.1)
for any sequence A
j
with A
i
T and A
i
A
j
= for each pair i ,= j. It is said to be a nite measure
if () < , and it is said to be a probability measure if () = 1. Property (3.1) is known as the
-additivity property.
If we want to emphasize that is a measure on T consisting of subsets of , we say that is a
measure on (, T).
Theorem 3.1. Suppose that
1
and
2
are two probability measures on (T), where T is a -system. If

1
and
2
agree on T, then they agree on (T).
Proof. Let / (T) be the class of sets A in (T) such that
1
(A) =
2
(A). Clearly, / because
both are probability measures. If A /, then
1
(A
c
) = 1
1
(A) = 1
2
(A) =
2
(A
c
) , and hence
A
c
/. If A
n
are disjoint sets in /, then
1
(

n
A
n
) =

1
(A
n
) =

2
(A
n
) =
2
(

n
A
n
), and
hence

n
A
n
/. Therefore, / is a -system. Since by hypothesis T / and T is a -system, then
- theorem gives (T) /, as required. (In fact, from this inclusion we deduce that / = (T) and
therefore / is a -eld.)
To avoid clutter, for measures on (subsets of) R, it is customary to write (a, b), [a, b), (a, b], and
[a, b] for ((a, b)), ([a, b)), ((a, b]), and ([a, b]), respectively.
You have probably seen the following consequence of the next corollary of Theorem 3.1: a probability
measure on (R, B) is completely determined by its cumulative distribution function (cdf).
Corollary 3.1. Two probability measures
1
and
2
on (R, B) are equal if and only if

1
(a, b] =
2
(a, b] (3.2)
for any a, b R with a < b.
Denition 3.2. A sequence A
i
is said to be a partition of (, T) if A
i
T for each i, A
i
A
j
= for
i ,= j, and

i=1
A
i
= .
3
Denition 3.3. A measure on (, T) is said to be -nite if there exists a partition A
i
such that
(A
i
) < for each i.
Exercise 8. Let be a probability measure on (, T). (a) Prove that the following holds

_
_
i
A
i
_
= lim
i
(A
i
)
for any sequence A
i
T with A
1
A
2
A
3
. (b) Prove that the following holds

i
A
i
_
= lim
i
(A
i
)
for any sequence A
i
T with A
1
A
2
A
3
.
Exercise 9. Let be a measure on (R, B). Prove that
(a, b) = lim
n
(a, b 1/n], (3.3)
[a, b] = lim
n
(a 1/n, b], (3.4)
for any a < b.
4 Lebesgue-Stieltjes measures on R
Let h be a nondecreasing function on R that is right-continuous. We now dene a measure =
h
on R.
Theorem 4.1. Given the function h, there is a unique (-nite) measure =
h
on (R, B) such that
(a, b] = h(b) h(a) (4.1)
for each a, b R with a < b.
Denition 4.1. When h(t) = t for t R, the corresponding measure =
h
on (R, B) is said to be the
Borel measure.
Example 4.1. When h is a cumulative distribution function of a discrete random variable with corre-
sponding probability mass function p(x
i
) for i = 1, 2, . . .. The measure identied in Theorem 4.1 is a
probability measure given by

h
(A) =

x
i
A
p(x
i
)
for each A R.
In this example we can verify the statement of Theorem 4.1 directly. Indeed, dene (A) =

x
i
A
p(x
i
)
for each A R. One readily veries that is a measure on (R, B). Moreover, it agrees with
h
on T
2
.
By Corollary 3.1, they must also agree on (R, B).
Example 4.2. Let h be the function on R given by
h(t) =
_

_
t/2 if 0 t < 1
3/4 if 1 t < 2
1 if t 2
0 otherwise.
It is readily veried that this function is right-continuous and nondecreasing. This is again a distribution
function, and the theorem says that there is a unique probability measure with this distribution function.
This measure
h
has mass 1/4 at 1 and 2, and it has mass 1/2 uniformly distributed over the interval
(0, 1]. This distribution is neither discrete nor it has a density; it is a mixture of the two.
4
Readers can skip the proof of Theorem 4.1 upon rst reading of this material. We split the proof of
Theorem 4.1 into a series of lemmas.
For a half open interval (a, b], one uses (4.1) to dene (a, b]. For an open interval (a, b), one uses
(3.3) to dene (a, b). For each (, b), we dene (, b) = lim
n
(n, b). For each (a, ), we
dene (a, ) = lim
n
(a, n).
Each open set O R is a countable union of disjoint open intervals. Assume O =

i
(a
i
, b
i
) and
(a
i
, b
i
) (a
j
, b
j
) = for i ,= j. One denes (O) =

i
(a
i
, b
i
) via (3.1).
Lemma 4.1. Let O
1
and O
2
be two open sets with O
1
O
2
= . We have
(O
1
O
2
) = (O
1
) +(O
2
).
Exercise 10. Prove that

_
i=1
O
i
_
=

i=1
(O
i
).
when O
i
is a sequence of open sets with O
i
O
j
= for i ,= j.
Lemma 4.2. Let O
1
and O
2
be two open sets. We have
(O
1
O
2
) (O
1
) +(O
2
).
Exercise 11. Prove that

_
i=1
O
i
_

i=1
(O
i
)
when O
i
is a sequence of open sets.
We wish to dene (A) for an arbitrary Borel set A R. The idea is to approximate the set A by
a sequence of open sets that are at least as big as A. For each A R, we dene the outer measure

(A) = inf
OA
(O), (4.2)
where O is restricted to open sets.
Lemma 4.3. For any sequence A
i
with A
i
R for each i,

_
i=1
A
i
_

i=1

(A
i
).
Denition 4.2. Let ( be the set of A R such that

(E) =

(A E) +

(A
c
E) for each E R.
Each set A ( is said to be a -measurable set.
In light of Lemma 4.3, a set A is measurable if and only if

(E)

(A E) +

(A
c
E) for each E R. (4.3)
Lemma 4.4. Each open interval (a, b) is measurable set.
5
Proof. Given an open interval (c, d), assume that a < c < b < d. Then (a, b) (c, d) = (c, b) and Then
(a, b)
c
(c, d) = [b, d). Because
(c, d) = ((c, b)) +([b, d)),
we have
(c, d) =

((a, b) (c, d)) +

((a, b)
c
(c, d)).
For each open set O =
i
(c
i
, d
i
), where (c
i
, d
i
) are disjoint, we have
(O) =

i
((c
i
, d
i
)) =

i
(

((a, b) (c
i
, d
i
))) +

((a, b)
c
(c
i
, d
i
)))
=

((a, b) (c
i
, d
i
))) +

((a, b)
c
(c
i
, d
i
))

(
i
(a, b) (c
i
, d
i
))) +

(
i
(a, b)
c
(c
i
, d
i
))
=

((a, b) O) +

((a, b)
c
O).
where the inequality follows from Lemma 4.3.
Exercise 12. If A is measurable, then A
c
is measurable.
Exercise 13. If A and B are measurable, then A B is measurable.
The next two lemmas imply that the restriction of

to ( is a measure. This measure on (R, () is


called the Lebesgue-Stieltjes measure corresponding to h. In the important special case h(t) = t, one calls
it the Lebesgue measure.
Lemma 4.5. Assume that A
i
is a sequence of measurable disjoint sets. Then

_
_
i
A
i
_
=

i=1

(A
i
).
Lemma 4.6. ( is a -eld.
We have proved that ( contains each open interval (a, b). Since B is the smallest -eld that contains
all open intervals, B (. Thus,

can also be restricted to the -eld B and the restriction


h
is
a measure. Therefore, =
h
is a well-dened measure on (R, B). This is sometimes called Stieltjes
measure, although Lebesgue-Stieltjes measure is also used; we use the latter in these notes. By Corollary
3.1, the measure is the unique one that satises (4.1). This establishes Theorem 4.1.
5 Integrals with respect to a measure
Let (, T, ) be a measure space, i.e., is a set, T is a -eld and is a measure on (, T).
In this section we need to work with functions taking values in the set of extended real numbers R,
and we therefore need a -eld on R. We let B be the -eld generated by B , . It can be
veried that any set A B has the form B, B , B , B , for some B B.
Denition 5.1. A function f : R is said to be (T, B)-measurable if : f() B T for every
B B.
A function g : R is said to be (T, B)-measurable if : g() B T for every B B.
When it is clear from the context what the underlying -elds are, we simply call the function
measurable.
6
The set : f() B in the above denition is sometimes abbreviated by f
1
(B), the pre-image of
B under f.
In this section, whenever we use the symbol g for a function (or g
n
, g
1
, etc.), it is implicit that the
range is R (as opposed to R), and that measurability is dened with respect to B.
Exercise 14. Prove that f : R is measurable if and only if for every (a, b) with a < b, : f()
(a, b) T. Hint: rst show that / = B R : f
1
(B) T is a -eld.
Exercise 15. Let the collection O consist of all open subsets of R. Suppose f : R R is a function.
1. Show that (O) = B. You may use that every non-empty open set can be expressed as a countable
union of pairwise disjoint open intervals.
2. Show that f is measurable if and only if f
1
(O) lies in B for every O O.
3. Using the denition of a continuous function, show that any continuous function must be measur-
able.

It is our aim to dene integral


_

g(x) d. Alternative notations for the integral are


_

g(x)(dx) and
_

g(x)d(x).
Denition 5.2. A function f is said to be a simple function if
f(x) =
m

i=1
f
i
1
A
i
(x), x (5.1)
for some integer m and some constants f
i
R and some Borel sets A
i
T, i = 1, . . . , m. Here 1
A
(x) is
the indicator function of A.
For the simple function f in (5.1), dene
_

f(x) d =
m

i=1
f
i
(A
i
).
Denition 5.3. For a nonnegative measurable function g : R, we dene
_

g(x) d = sup
fg
f is simple
_

f(x) d.
Exercise 16. For two nonnegative measurable functions g
1
and g
2
, assume that g
1
(x) g
2
(x) for x .
Prove that _

g
1
(x) d
_

g
2
(x) d.

Exercise 17. Let T be a -eld on some set . For a nonnegative measurable function g : R, show
that
_

g(x) d =
_

g(x) d for a scalar 0.


Denition 5.4. For a measurable function g, we dene
_

g(x) d =
_

g
+
(x) d
_

(x) d
as long as
_

g
+
(x) d and
_

(x) d are not both equal to .


7
Exercise 18. Assume that g
n
is a sequence of measurable nonnegative functions that satisfy g
n
(x)
g
n+1
(x) for x . Prove that the function g : R dened through g(x) = lim
n
g
n
(x) for x ,
the so-called pointwise limit, is well-dened for each x and that g is a measurable nonnegative function.
The following theorem is useful in many situations.
Theorem 5.1 (Monotone convergence theorem (MCT)). Assume that g
n
is a sequence of measurable
nonnegative functions satisfying
g
n
(x) g
n+1
(x) for x .
for each n. Then we have
lim
n
_

g
n
(x) d =
_

lim
n
g
n
(x) d.
Before we discuss the proof, let us see this theorem in action.
Example 5.1. Suppose = R, and equip with the -algebra of Borel sets and Lebesgue measure .
The functions g
n
dened by g
n
(x) = 1
(0,1/21/n)
(x) for x satisfy the requirements of the MCT, so
lim
n
_

g
n
(x)d =
_

1
(0,1/2)
(x)d = 1/2.
Proof of Theorem 5.1. For each x , let
g(x) = lim
n
g
n
(x).
Note that the limit exists (may equal to ) for each x by the monotonicity assumption. Clearly, for each
n, g
n
(x) g(x) for all x . Therefore, by Exercise 16,
_

g
n
(x) d
_

g(x) d for each n,


which implies by Exercise 18 and 16 that
lim
n
_

g
n
(x) d
_

g(x) d.
It remains to prove that
lim
n
_

g
n
(x) d
_

g(x) d.
For this, let f g be a simple function
f(x) =
m

i=1
f
i
1
A
i
(x),
where f
i
> 0 and A
i
A
j
= for i ,= j. Then,
_

f(x) d =
m

i=1
f
i
(A
i
).
It suces to prove
lim
n
_

g
n
(x) d
m

i=1
f
i
(A
i
). (5.2)
We consider two cases. (a) (A
i
) < for all i = 1, . . . , m. (b) (A
i
) = for some i = 1, . . . , m.
(a) For each i, let
A
n,i
= x A
i
: g
n
(x) > f
i
.
8
Then we have
g
n
(x)
m

i=1
(f
i
)1
A
n,i (x),
A
n,i
A
n+1,i
, and

n=1
A
n,i
= A
i
. Then we have
_

g
n
(x) d
m

i=1
(f
i
)(A
n,i
).
Thus we obtain
lim
n
_

g
n
(x) d
m

i=1
(f
i
)(A
i
) (5.3)
=
m

i=1
f
i
(A
i
)
_
m
_
i=1
A
i
_
.
Next note that (

m
i=1
A
i
) =

m
i=1
(A
i
) < since we assumed to be in case (a). Therefore, we can
choose > 0 arbitrarily small to obtain (5.2).
(b) Without loss of generality, assume that (A
1
) = . Choose > 0 small enough so that f
i
> 0
for each i = 1, . . . , m. Then (5.3) implies that
lim
n
_

g
n
(x) d
m

i=1
(f
i
)(A
i
) (f
1
)(A
1
) = ,
proving (5.2).
Exercise 19. Let be a measure on T, which is a -eld on some set . Given a sequence of nonnegative
measurable functions g
k
, where g
k
: R, we are interested in the identity

k=1
_

g
k
(x)d =
_

k=1
g
k
(x)d.
1. Explain why both sides of this equation are well-dened.
2. Prove this identity.

It is useful to give an alternative characterization of the integral, and to compare this with the
denition of the Riemann integral (see for instance wikipedia). To get the alternative characterization,
we are going to approximate a nonnegative measurable function g (that might take the value ) by a
sequence of simple functions g
n
. For each positive integer n, dene
g
n
(x) =
_
n if g(x) n,
i/2
n
if i/2
n
g(x) < (i + 1)/2
n
for i = 0, 1, . . . , n2
n
1.
(5.4)
See Figure 2 for an illustration.
Now we are ready to compare the ideas underlying Riemann and Lebesgue integration. For this,
it is important to realize that the notion of Riemann integration inherently only considers integration
of functions on (subsets of) Euclidean space, such as R. Therefore, any comparison of Lebesgue and
Riemann integrability can only be done for such functions, even though the notion of Lebesgue integral
is much more general. For the Riemann integral, the idea is to dene the integral using a grid on the
horizontal axis. For the Lebesgue integral, as explained through the next exercises, the idea is to dene
9
Figure 2: For a function to be Riemann integrable, we need the upper and lower Riemann sum to converge
to the same value as the partition on the x-axis gets ner and ner (blue). For the Lebesgue integral, we
partition the y-axis and always approximate the integral from below (red). Picture taken from wikipedia.
the integral using a grid on the vertical axis. To see why it useful to consider a grid on the vertical axis,
consider the indicator function of the rational numbers. The Riemann integral does not exist, yet the
Lebesgue integral is well-dened and equal to zero (recall that the set of rational numbers is countable).
In typical cases, the Riemann integral and Lebesgue integral are well-dened and they are equal.
Exercise 20. (a) Prove that for each n and x R, g
n
(x) g
n+1
(x). (b) Prove that lim
n
g
n
(x) = g(x)
for each x . (c) Prove that for each n, g
n
is a simple function.
Exercise 21. For a nonnegative measurable function g, let g
n
be dened in (5.4). Show that
_

g(x) d = lim
n
_

g
n
(x) d.

Exercise 22. For two nonnegative measurable functions g


1
and g
2
, prove that
_

(g
1
(x) +g
2
(x))d =
_

g
1
(x) d +
_

g
2
(x)d.
Use the monotone convergence theorem.
Exercise 23. Let be the Lebesgue-Stieltjes measure corresponding to the function h : R R
+
given
by, for x R,
h(x) =
_
x| x 0
0 otherwise,
where x| is the largest integer smaller than (or equal to) x, i.e., it rounds down x. For instance
1.34| = 1.92| = 1, 2| = 2.
Dene the functions g
n
through g
n
(x) = x1
(0,n]
(x) for x R. Prove that lim
n
_
R
g
n
(x)d/n
2
=
1/2.
For a measurable set A T, one denes
_
A
g(x) d =
_

g(x)1
A
(x) d. (5.5)
The following lemma gives a class of integrands for which the Riemann integrals and Lebesgue integrals
are equal.
Lemma 5.1. Given a, b R with a < b, suppose that f : R R is continuous on [a, b]. Then we have
_
[a,b]
fd = !-
_
b
a
f(x)dx,
where !-
_
denotes the Riemann integral and denotes Lebesgue measure.
10
Assume that g 0 and
_

g(x)d < . For each A T, dene


(A) =
_
A
g(x)d. (5.6)
Exercise 24. Prove that (a) (A) < for each A T, (b) Prove using Exercise 22 that (A
1
A
2
) =
(A
1
) +(A
2
) when A
1
A
2
= . (c) is a measure on (, T).
Exercise 25. Assume that g 0 and
_

g(x)d < . Dene measure by (5.6). Let f 0 be a


measurable function. Prove that _

f(x)d =
_

f(x)g(x)d. (5.7)

When and satisfy (5.6), we say that g is the density of with respect to (youre very familiar
with probability densities with respect to Lebesgue measure!). The measures and are related through
a change of measure, a concept that is critical in much of probability as well as in mathematical nance
(the so-called Girsanov theorem is a prime example). The function g is also known as the Radon-
Nikodym derivative, and g(x) is sometimes denoted by d/d(x). This notation is motivated by the
formula in the previous exercise, which can then be interpreted as justifying the formal relationship
d(x) = (d/d)(x) d(x). There is a deep theorem, the Radon-Nikodym theorem, which gives a precise
characterization when Radon-Nikodym derivatives exist (for nite measures). This is beyond the scope
of this course.
6 Lebesgue-Stieltjes Integrals
Let h : R R be a function that is nondecreasing and right-continuous. Our aim is to dene integral
_
R
g(t) dh(t) for a class of good functions g.
Recall
h
is the Lebesgue-Stieltjes measure associated with the nondecreasing function h; see Section
3. Assume that g : R R is Borel measurable. Then the integral
_
R
g(t)d
h
(6.1)
is well-dened as long as
_
R
g
+
(t)d
h
and
_
R
g

(t)d
h
are not both innite. We use (6.1) as the denition
of
_
R
g(t) dh(t).
About the notation: in addition to
_
R
g(t)d
h
,
_
R
g(t) dh(t) is sometimes written as
_
R
g(t)
h
(dt) or
_
R
g(t) h(dt). In the special case h(t) = t, we simply write
_
R
g(t)dt. Note that this same notation is used
for the Riemann integral, and they are often equal.
Recall the integral in (5.5). For a Borel set A R, one denes
_
A
g(t) dh(t) =
_
R
g(t)1
A
(t) dh(t).
This convention requires that h is dened on R or at least on an open interval that contains A. For such
an h, both
_
(a,b]
g(t) dh(t) and
_
[a,b]
g(t) dh(t) are well-dened. But they are not equal in general.
Exercise 26. Show that
_
[a,b]
g(t) dh(t) = g(a)(h(a) h(a)) +
_
(a,b]
g(t) dh(t).
Here we use the notation h(a) = lim
ta
h(t) (and you noticed that this limit always exists, right?).
Often the notation
_
b
a
g(t) dh(t) does not make much sense because we do not know whether it means
_
[a,b]
g(t) dh(t),
_
(a,b]
g(t) dh(t),
_
[a,b)
g(t) dh(t), or perhaps
_
(a,b)
g(t) dh(t) unless h does not have jumps
at a and b.
11
Example 6.1. If h is a right-continuous, nondecreasing piecewise constant function with jumps at points
a
1
, a
2
, . . ., then we have
_
R
g(t) dh(t) =

i
g(a
i
)(h(a
i
) h(a
i
))
Example 6.2. If h satises h(t) =
_
t

f(s) ds for t R for some function f 0 with


_
R
f(s)ds < ,
then we have _
R
g(t) dh(t) =
_
R
g(t)f(t) dt. (6.2)
To see this, dene the measure on B through (A) =
_
A
f(s)ds, cf. (5.6), and then conclude that
h
=
by Theorem 3.1. Exercise 25 then yields (6.2).
Exercise 27. Let the function h : R R be given by
h(x) =
_

_
0 if x < 1
1/2 + (x + 1)
2
if 1 x < 1/2
13/4 +x/2 if 1/2 x < 3/2
5 otherwise,
and dene the function g through
g(x) =
_

_
x + 1 if 1 x 0
1 x/2 if 0 < x < 2
0 otherwise.
(a) Prove that g is measurable. (b) Calculate
_
[1/2,1]
g(x)dh(x). (c) Calculate
_
R
g(x)dh(x).
7 Random variables and expectation
Let be a space and T be a -eld on .
Denition 7.1. A map P : T [0, 1] is said to be a probability measure if
P() = 1,
P
_
_
i
A
i
_
=

i
P(A
i
) for each sequence A
i
with A
i
A
j
= , i ,= j.
Each A T is said to be an event. For every event A, P(A) is said to be the probability of A.
The triple (, T, P) is said to be a probability space. Each is said to be a sample from the
probability space.
Denition 7.2. A function X : R is said to be a random variable if it is (T, B)-measurable.
Thus, a random variable is nothing else than an R-valued measurable function on the sample space
. Exercise 14 gives us a way of verifying that X is a random variable. There are also alternative
characterizations, which can be proved along the same lines. The following characterization is often
useful: X is a random variable if and only if for every a R, we have : X() a T.
You may be used to thinking of a random variable as something that can be described in words, which
depends on the outcome of some experiment. The function denition is much more exible, and thats
the reason for using it. It is instructive to see an example where we discuss both these dierent ways of
thinking about a random variable: as a function and as a description in words.
12
Example 7.1. We toss a fair coin three times, and choose = H, T
3
and T = T(), the set of all
subsets of (power set). We let P be the probability measure on T determined by P() = 1/8 for
each (this captures the fair coin setting).
Dene a random variable X as follows:
X((H, H, H)) = 3, X((T, H, H)) = X((H, T, H)) = X((H, H, T)) = 2,
X((T, T, H)) = X((T, H, T)) = X((H, T, T)) = 1, X((T, T, T)) = 0.
In words, this corresponds to the number of heads in the three tosses.
Exercise 28. Let = [0, 1] be equipped with the restriction (called trace) of the Borel -eld, i.e., the
-eld T = B [0, 1] comprising the sets [0, 1] B for B B. Let X be given by X() = . (a) Verify
that X is a random variable. (b) What is the name of the distribution of X if P is the restriction of
Lebesgue measure to T?
Exercise 29. The idea behind this exercise is often used in computer simulation. Consider the setting
of the previous exercise. Suppose that we are given some strictly positive probability density function
f. (The idea can be formulated in a much more general setting, but it becomes a bit messier.) Write
F(x) =
_
x

f(s)ds for x R and let F


1
its inverse. (a) Verify that Y () = F
1
() denes a random
variable. (b) Calculate the cdf of Y . (c) Assuming that one can obtain realizations
1
,
2
, . . . from P by
simulation, explain how one gets realizations from a random variable with cdf F.
For a nonnegative random variable X : R
+
, the integral
_

XdP is dened as in Section 5. We


call the integral the expectation of random variable X. It is denoted by E(X). Recall that
E(X) =
_

XdP = sup
Y X
Y is simple
E(Y ).
When Y is simple, Y has the following representation
Y () =
n

i=1
a
i
1
A
i
(),
where A
i
T and A
i
A
j
= for i ,= j. For such a simple random variable, by denition,
E(Y ) =
n

i=1
a
i
PA
i
.
Example 7.2. This is a continuation of Example 7.1. In this case, we have
_

XdP =
_

_
3 1
{(H,H,H)}
() + 2 1
{(H,H,T),(H,T,H),(T,H,H)}
() + 1
{(T,H,H),(H,T,H),(H,H,T)}
()

dP
= 3P((H, H, H)) + 2P((H, H, T), (H, T, H), (T, H, H))
+P((T, H, H), (H, T, H), (H, H, T)) = E(X),
and the latter coincides with the undergraduate denition of expectation.
When X is a random variable that is not necessarily nonnegative, we dene
_

X dP =
_

X
+
dP
_

dP,
as long as is avoided. When occurs, we say that
_

X dP is undened.
13
Theorem 7.1. Let X be a random variable. Let F(x) = PX x be its cumulative distribution function
(c.d.f.). For any measurable g : R R, we have dened
_
R
g(x)dF(x) =
_
R
g(x)d
F
as the Lebesgue-Stieltjes integral. Then we have
E[g(X)] =
_
R
g(x)dF(x) (7.1)
as long as one side of (7.1) is well-dened.
Proof. It suces to prove (7.1) for g 0. It follows from the construction in (5.4) and the Monotone
Convergence theorem, it suces to prove (7.1) for simple functions g. By linearity of integrals, it suces
to prove (7.1) for g(x) = 1
B
(x) when B is a Borel set in B(R). When g(x) = 1
B
(x), (7.1) reduces to
P(X B) =
F
(B). (7.2)
For each A B(R), dene
(A) = P : X() A.
One can check that : B(R) [0, 1] is a measure on B(R). We know that
F
is also a measure on B(R).
Two measures are identical on B(R) if they are identical on intervals (a, b] for a, b R with a < b. Now,
(a, b] = Pa < X b = F(b) F(a). Also,
F
(a, b] = F(b) F(a). Thus, (a, b] =
F
(a, b]. Thus, we
have proved (7.2), thus completing the proof of the theorem.
The measure
F
on B from this theorem is called the distribution of X. It is sometimes denoted by
P X
1
.
Exercise 30. Explain the notation P X
1
.
Exercise 31. Consider the probability space (, T, P) with = (0, 1), T = B (0, 1) : B B(R),
and P is the restriction of Lebesgue measure to T. Dene X : R by X() = log for . (a)
Prove that X is a random variable. (b) Calculate the distribution function F of X, and explain how this
function is related to the measure P X
1
.
8 Independence
Let (, T, P) be a probability space. Let T
1
T and T
2
T be two -elds on .
Denition 8.1. Two -elds T
1
and T
2
are said to be independent if for each A T
1
and each B T
2
,
we have
P(A B) = P(A)P(B). (8.1)
Exercise 32. Let T be an arbitrary -eld on a sample space . Prove that T is independent of
( = , .
Denition 8.2. Let X be a real-valued function on . We use (X) to denote the smallest -eld ( on
such that X is a random variable on (, (), and say that ( is the -eld generated by X.
Exercise 33. Given X as in the above denition, prove that (X) = X
1
(B) : B B(R).
Example 8.1. Suppose = 1, 2, 3 and let X : R and Y : R be dened through X(1) =
1, X(2) = 1, X(3) = and Y (1) = 1, Y (2) =

2, Y (3) =

3, respectively. We then have (X) =


, , 1, 2, 3 and (Y ) = , , 1, 2, 3, 1, 2, 1, 3, 2, 3.
14
Denition 8.3. Two random variables X and Y are said to be independent if (X) and (Y ) are
independent.
In view of Exercise 33, this denition shows that X and Y are independent if, for any Borel sets A
and B,
P(X
1
(A), Y
1
(B)) = P(X
1
(A))P(Y
1
(B)),
which can also be written as
P(X A, Y B) = P(X A)P(Y B).
Theorem 8.1. Assume that random variables X and Y are independent. Then for any two functions
f, g : R R that are Borel measurable,
E(f(X)g(Y )) = E(f(X))E(g(Y )), (8.2)
as long as all expectations are well-dened.
Proof. It is sucient to prove f 0 and g 0. Otherwise, we work with f
+
, f

, g
+
and g

. We rst
prove (8.2) when g(x) = 1
B
(x) for some Borel measurable set B R, namely,
E(f(X)1
B
(Y )) = E(f(X))E(1
B
(Y )) (8.3)
for all Borel measurable f 0. To prove (8.3), we rst note that this identity holds when f(x) = 1
A
(x)
for each Borel set A. This implies that (8.3) holds when f is a simple function. Construct f
n
as in
(5.4). We have that f
n
is monotonously increasing, and f
n
(x) f(x) for each x R. Therefore, for
each , f
n
(X()) is monotonously increasing in n, and f
n
(X()) f(X()). Since f
n
is a simple
function, we have
E(f
n
(X)1
B
(Y )) = E(f
n
(X))E(1
B
(Y ))
Taking n , we have obtained (8.3) using the monotone convergence theorem.
Now we prove (8.2) holds. Since (8.3) holds for each Borel set B, we have (8.2) holds when g is a
simple function. For general Borel measurable g 0, we construct g
n
as in (5.4), and (8.2) follows from
the monotone convergence theorem.
9 Inequalities
Lemma 9.1. (a) For any random variable X, any a > 0 and any k 1,
P[X[ a
E([X[
k
)
a
k
. (9.1)
(b) For any > 0 and a R
PX a
E(e
X
)
e
a
. (9.2)
Remark 9.1. When k = 1 and X 0, then (9.1) is known as the Markov inequality. When k = 2 and
E(X) = 0, then (9.1) becomes the Chebyshev inequality.
Proof. To prove (a), we note that
[X[
k
a
k
1
{|X|a}
.
Thus, E([X[
k
) E(a
k
1
{|X|a}
) = a
k
P[X[ a, from which (9.1) follows. To prove (b), note that
e
X
a
a
1
{Xa}
,
from which (9.2) follows.
15
We next recall the notion of variance.
Denition 9.1. For a random variable X, dene Var(X) = E[(XE(X))
2
]. We call Var(X) the variance
of X.
Exercise 34. (a) Prove that Var(X) = E(X
2
) (E(X))
2
. (b) Prove that
Var(X +Y ) = Var(X) + Var(Y )
when X and Y are two independent random variables.
Exercise 35. (a) For any random variable X
(E([X[))
2
E(X
2
).
(b) (Jensens inequality) Suppose that f : R R is convex and X is any random variable. Prove that
f(E(X)) E(f(X)). (9.3)

Solutions to selected exercises


Solution to Exercise 1. Suppose that T

are -elds for J. The set J is the index set and can be
arbitrary. Dene
T =

I
T

.
We next use Denition 2.1 to verify that T is a -eld.
1. T

for all , in particular T.


2. If A T, then A T

for all J. Since each T

is a -eld, we must also have A


c
T

for all
J and therefore A
c
T.
3. Given A
i
with A
i
T for i = 1, 2, . . ., we must have A
i
T

for every i and every . In


particular,

i
A
i
T

for every . This yields

i
A
i
T.
Solution to Exercise 2. It is sucient to show that T
1
(T
2
) and T
2
(T
1
). Indeed, this
would imply (T
1
) = (T
2
). The rst follows from (a, b) =

n
(a, b 1/n], the latter follows from
(a, b] =

n
(a, b + 1/n).
Solution to Exercise 3. (a) Since HH, HT
c
= TT, TH , /, / cannot be a -system. (b) Since /
is not a -system, it cannot be a -eld.
Solution to Exercise 4. Suppose the -system / is also a -system. To check that it is a -eld, only
the third requirement (that / is closed under countable unions) requires a proof. Suppose A
1
, A
2
, . . . /,
and write A =

i
A
i
. Since / is closed under countable disjoint unions, we are done if we can write A
as a disjoint union of sets in /. To do so, set E
i
= A
i

i1
j=1
A
j
_
; clearly the sets E
i
are disjoint and
A =

i
E
i
. We next argue that E
i
/ by writing E
i
= A
i
A
c
i1
A
c
1
. Since / is a -system,
each of the sets A
c
i1
, . . . , A
c
1
lie in /. We assume that / is also a -system, so that A, B / implies
A B /. As a result, each of the sets E
i
lies in /.
Solution to Exercise 5. We focus on T
1
; the argument for T
2
is identical. For a < b < c < d, we have
(a, b) (c, d) = , (a, c) (b, d) = (b, c) and (a, d) (b, c) = (b, c).
Solution to Exercise 6. It is straightforward to verify that the intersection of a -system is again a
-system; thus we can interpret (T) as the smallest -system containing T. Since T /, we thus obtain
16
that (T) /. If we know that (T) is a -system, then it is a -eld by Exercise 4. Therefore (T)
contains (T), the smallest -eld containing T. Putting things together, we obtain (T) (T) /.
Solution to Exercise 8. For part (a), we write B
i
= A
i
A
c
i1
with A
0
= , so that

i
B
i
=

i
A
i
and
B
i
T. Since the B
i
are mutually disjoint, we have

_
i=1
A
i
_
=
_

_
i=1
B
i
_
=

i=1
(B
i
) = lim
L
L

i=1
(B
i
) = lim
L

_
L
_
i=1
B
i
_
= lim
L
(A
L
).
For part (b), we take the complements of the sets A
i
and use part (a) together with DeMorgans law.
Solution to Exercise 9. Let n
0
satisfy b a > 1/n
0
. The rst claim follows from Exercise 8 and

_
n=n
0
(a, b 1/n] = (a, b).
Similarly, the second claim follows from

n=1
(a 1/n, b] = [a, b].
Solution to Exercise 14. The only if part is trivial, so we focus on the if part. For a set B R,
we use f
1
(B) to denote the set : f() B . Let / be the set of subsets B R such
that f
1
(B) T. We rst show that / is a -eld on R. (i) Note that f
1
(R) = T. Thus, we
have R /. (ii) Given an A /, we have f
1
(A
c
) = (f
1
(A))
c
because f
1
(A
c
) is equivalent
to f() , A and (f
1
(A))
c
is equivalent to f() , A. Assume A /. Then f
1
(A) T, which
implies that (f
1
(A))
c
T. Thus, f
1
(A
c
) T, which implies that A
c
/. (iii) Assume that A
i
/,
we would like to show that

A
i
/. It is sucient to prove that
f
1
_
_
i
A
i
_
=
_
i
f
1
(A
i
).
We rst prove that f
1
(

i
A
i
)

i
f
1
(A
i
). To see this, let f
1
(

A
i
). Then, f()

A
i
.
Therefore, there exists a j such that f() A
j
, which implies that f
1
(A
j
) and thus

i
f
1
(A
i
),
proving f
1
(

A
i
)

i
f
1
(A
i
). Now we prove

i
f
1
(A
i
) f
1
(

A
i
). Let

i
f
1
(A
i
). There
exists a j such that f
1
(A
j
). Thus, f() A
j


A
i
. Therefore, f
1
(

A
i
). Hence we have

i
f
1
(A
i
) f
1
(

A
i
). Thus, we have proved / is a -eld on R. We know that f
1
((a, b)) T for
a < b, so (a, b) / for a < b. Since B is the smallest -eld containing (a, b) for a < b, we must have
B /. In particular, for each B B, we have B /, which means that f
1
(B) T.
Solution to Exercise 15.
1. Since any O O can be written as O =

i
I
i
for open intervals I
i
, and each of the I
i
lie in T
1
B,
we must have O B and therefore (O) B. Conversely, T
1
O implies B = (T
1
) (O).
2. A function f is measurable if an only if f
1
(B) lies in B for every B B. Thus, the only if part
is an immediate consequence of O B. For if, note that / = B R : f
1
(B) B is a -eld
(Exercise 14). It is given that O /, so part (a) yields B = (O) /.
3. f is continuous if and only if f
1
(O) O for every O O. Since O B, this implies that
f
1
(O) B for every O O, i.e., measurability of f.
Solution to Exercise 16. Any nonnegative simple function f for which f(x) g
1
(x) for x also
satises f(x) g
2
(x) for x and therefore
sup
fg
1
f is simple
_

f(x) d sup
fg
2
f is simple
_

f(x) d,
17
as required.
Solution to Exercise 17. By denition of the integral, we have
_

g(x)d = sup
fg
_

f(x)d,
omitting the subscript f simple for notational convenience. The function f is simple if and only
if the function f is simple, so using the substitution f = h for simple h, we get
_

g(x)d =
sup
hg
_

h(x)d. Since h is simple, we readily nd that


_

h(x)d =
_

h(x)d. Therefore,
_

g(x)d = sup
hg
_

h(x)d =
_

g(x)d.
Solution to Exercise 19. Write h
n
(x) =

n
k=1
g
k
(x), and note that h
n
h
n+1
for every n. (a) On
the right-hand side,

k=1
g
k
(x) = lim
n
h
n
(x), and this limit exists in R as the limit of a monotone
sequence. The left-hand side is dened as lim
n

n
k=1
_

g
k
(x)d = lim
n
_

h
n
(x)d by Exercise 22.
This limit is well-dened in R since
_

h
n
(x)d is also monotone by Exercise 16. (b) The MCT applied
to h
n
yields

k=1
_

g
k
(x)d = lim
n
_

h
n
(x)d =
_

lim
n
h
n
(x)d =
_

k=1
g
k
(x)d.
Solution to Exercise 20. Throughout, write a| for the largest integer that does not exceed a (rounding
down). Part (a). Select some n and x R. If g(x) n + 1, then n = g
n
(x) g
n+1
(x) = n + 1. If
n g(x) < n+1, then g
n+1
(x) [n, n+1) and therefore n = g
n
(x) g
n+1
(x). If g(x) < n, then g
n
(x) =
2
n
2
n
g(x)| and g
n+1
(x) = 2
n1
2
n+1
g(x)|. Thus, the statement follows from 22
n
g(x)| 2 2
n
g(x)|.
Part (b). Fix some x R. Suppose g(x) < , and choose n g(x). Writing g
n
(x) = g(x) +[g
n
(x)
g(x)], note that 0 g(x) g
n
(x)[ 2
n
0. Next suppose g(x) = ; then g
n
(x) = n = g(x).
Part (c). Write T
n
= [i2
n
, (i + 1)2
n
) : i = 0, 1, . . . , n2
n
1 [n, ). Since g is measurable,
g
1
(A) is Borel measurable for each A T
n
; see also Exercise 14. Therefore, we have
g
n
(x) =
n2
n
1

i=0
i
2
n
1
g
1
([i2
n
,(i+1)2
n
))
(x) +n1
g
1
([n,))
(x),
so g
n
is a simple function.
Solution to Exercise 21. We immediately get this from the previous exercise and the MCT.
Solution to Exercise 22. Let g
n
1
and g
n
2
be dened as in (5.4) but with g
1
and g
2
replacing g, respectively.
Since g
n
1
(x) +g
n
2
(x) is nondecreasing in n for every x, we have by the MCT,
_

(g
1
(x) +g
2
(x))d = lim
n
_

(g
n
1
(x) +g
n
2
(x))d.
Since g
n
1
and g
n
2
are simple functions by Exercise 20(c), we have
_

(g
n
1
(x) + g
n
2
(x))d =
_

g
n
1
(x)d +
_

g
n
2
(x)d. This identity can be veried upon noting that for any two simple functions f(x) =

m
i=1
f
i
1
A
i
(x)
and h(x) =

n
i=1
h
i
1
B
i
(x), we have since f +h is simple,
_
(f +h)d =
m

i=1
f
i
(A
i
) +
n

i=1
h
i
(B
i
) =
_
fd +
_
hd.
We conclude that (again applying the MCT, twice!)
lim
n
_

g
n
1
(x) +g
n
2
(x))d =
_

g
1
(x)d +
_

g
2
(x)d,
as required.
Solution to Exercise 23. It may be tempting to apply the monotone convergence theorem since the
g
n
are a monotone sequence of functions, but the MCT only says that
lim
n
_
g
n
(x)d =
_
g(x)d,
18
where g(x) = x for x 0 and g(x) = 0 otherwise. The latter integral is

i=1
i = , so the MCT doesnt
give us what we want.
Instead we simply note that
_
g
n
(x)d =
n

i=1
i =
1
2
n(n + 1),
where the rst equality follows for instance by noting that (by the MCT)
_
g
n
(x)d = lim
m
_
g
n,m
d = lim
m
n2
m
1

k=1
2
m
[k/2
m
, n]
= lim
m
2
m
[n2
m
+ (n 1)2
m
+. . . + (2
m
1)] =
n

i=1
i,
where the simple function g
n,m
is dened as
g
n,m
(x) =
n2
m
1

k=1
2
m
1
[k/2
m
,n]
(x) =
_
2
m
2
m
x| 0 x n
0 otherwise.
To conclude, the integral
_
g
n
(x)d converges to 1/2 upon division by n
2
, as n .
Solution to Exercise 24. Part (a). Since g(x)1
A
(x) g(x), we get the claim immediately from
Exercise 16 and the assumption
_

g(x)d < .
Part (b). Using Exercise 22 as required, we obtain
(A
1
A
2
) =
_

g(x)1
A
1
A
2
(x)d =
_

g(x)[1
A
1
(x) + 1
A
2
(x)]d
=
_

g(x)1
A
1
(x)d +
_

g(x)1
A
2
(x)d = (A
1
) +(A
2
).
Part (c). Suppose A
i
T is a sequence of mutually disjoint sets, The argument in (b) extends to
any nite number of sets, so (A
1
A
k
) =

k
i=1
(A
i
). Write B
n
=

n
i=1
A
i
and B

i=1
A
i
.
Then, by the MCT,
(B

) =
_

g(x)1
B
(x)d =
_

lim
n
g(x)1
Bn
(x)d
= lim
n
_

g(x)1
Bn
(x)d = lim
n
(B
n
) = lim
n
n

i=1
(A
i
) =

i=1
(A
i
),
as required.
Solution to Exercise 25. Write f
n
for the approximation of f that is constructed as in (5.4). We know
from Exercise 21 that
_

f(x)d = lim
n
_

f
n
(x)d. For any simple function h we have, in self-evident
notation,
_

h(x)d =
m

i=1
h
i
(A
i
) =
m

i=1
h
i
_
A
i
g(x)d =
_

h(x)g(x)d.
Thus, since the f
n
are simple functions, we have
_

f(x)d = lim
n
_

f
n
(x)d = lim
n
_

f
n
(x)g(x)d.
Since g is nonnegative and f
n
f
n+1
by Exercise 20, we also have gf
n
gf
n+1
. Moreover, the pointwise
limit of f
n
g is fg, so the claim follows from the MCT.
19
Solution to Exercise 26. It is sucient to prove this for nonnegative g. First note that
_
[a,b]
g(t) dh(t) =
_
{a}
g(t) dh(t) +
_
(a,b]
g(t) dh(t).
Using the denition of the rst integral on the right-hand side, we nd that
_
{a}
g(t) dh(t) =
_
R
g(t)1
{a}
(t) dh(t) =
_
R
g(a)1
{a}
(t) dh(t) = g(a)
_
R
1
{a}
(t) dh(t).
Since lim
n
1
(a1/n,a]
(t) = 1
{a}
(t), the MCT shows that
_
R
1
{a}
(t) dh(t) =
_
R
lim
n
1
(a1/n,a]
(t) dh(t) = lim
n
_
R
1
(a1/n,a]
(t) dh(t)
= lim
n
[h(a) h(a 1/n)] = h(a) h(a),
and this was to be shown.
Solution to Exercise 27.
(a) We know that g is measurable if and only if g
1
((, a]) = : g() a B. (You can
alternatively check this for open intervals instead of sets of the form (, a], but this one requires
less writing.) If a > 1, then g
1
((, a]) = R B. If a < 0, then g
1
((, a]) = B. If
0 a 1, then g
1
((, a]) = (, a 1] [2(1 a), ) B.
(b) We have
_
[1/2,1]
g(x)dh(x) = g(1/2)(h(1/2) h(1/2)) +
_
(1/2,1]
g(x)dh(x)
= (3/4)(14/4 11/4) +
_
(1/2,1]
(1 x/2)(1/2)dx
=
9
16
+
1
2
_
x
x
2
4
_

1
x=1/2
=
9
16
+
5
32
=
23
32
.
(c) We have
_
R
g(x)dh(x) =
_
(,1)
0dh(x) +g(1)(h(1) h(1)) +
_
(1,0]
g(x)dh(x)
+
_
(0,
1
2
)
g(x)dh(x) +g(1/2)(h(1/2) h(1/2)) +
_
(
1
2
,
3
2
)
g(x)dh(x)
+g(3/2)(h(3/2) h(3/2)) +
_
(
3
2
,2)
g(x)dh(x) +
_
[2,)
0dh(x)
=
_
(1,0]
2(x + 1)
2
dx +
_
(0,
1
2
)
2(1 x/2)(x + 1)dx + (3/4)(3/4)
+
_
(
1
2
,
3
2
)
(1/2)(1 x/2)dx + (1/4)(5 16/4)
=
2
3
(x + 1)
3

0
x=1
+
_
1
3
x
3
+
1
2
x
2
+ 2x
_

1
2
x=0
+
9
16
+
1
2
_
x
x
2
4
_

3
2
x=
1
2
+
1
4
=
2
3
+
13
12
+
9
16
+
1
4
+
1
4
=
135
48
.
20
Solution to Exercise 28. Part (a). For any B R, we nd that X
1
(B) = [0, 1] B. Thus, if B is of
the form (, a] we must have that X
1
(B) lies in B since we know that both [0, 1] and B lie in B.
Part (b). Note that P(X a) = P([0, 1] (, a]), which equals a for a [0, 1], it equals 0 for a < 0,
and it equals 1 for a > 1. Therefore X has a uniform distribution on [0, 1].
Solution to Exercise 30. For two functions f : X Y and g : Y Z, the notation g f is used for
the composition of f and g, i.e., g f(x) = g(f(x)). Note that g f is a function from X to Z.
Because we write X
1
(B) = : X() B and because X is a random variable, we can
interpret X
1
as a map from B to T. Since P is a map from T to [0, 1], the notation P X
1
makes
sense. To see that P X
1
agrees with
F
, note that P(X
1
(B)) = P(X B) for B B.
Solution to Exercise 31. Part (a). Fix a R. We need to check that : X() a T. For
a 0, this set equals (which lies in T), so we let a > 0. Then : X() a = : log
a = : e
a
= [e
a
, 1) T, as required.
Part (b). Let F be the distribution function of X. Then clearly F(x) = 0 for x 0, and for x 0
we have
F(x) = P : X() x,
= P : log() x,
= P : e
x
,
= P[e
x
, 1),
= 1 e
x
.
The distribution function is related to the measure P X
1
through F(x) = P X
1
(, x] for x R.
Solution to Exercise 32. T and ( are independent if for each A T and B (, it holds that
P(A B) = P(A)P(B). We see that this is true because P() = 1 and P() = 0, so P(A ) = P(A) =
P(A) 1 = P(A)P() and P(A ) = P() = 0 = P(A)P().
Solution to Exercise 33. X is a random variable on (, () if and only if X
1
(B) ( for B B, which
can be reformulated as X
1
(B) : B B(R) (. Therefore, since (X) is dened as the smallest
possible choice for the -eld (, we are done if we show that X
1
(B) : B B(R) is a -eld. We do
not work this out here, but one readily veries the three conditions.
Solution to Exercise 35. (a) By denition, we have Var([X[) 0. But also Var([X[) = E(X
2
)
(E([X[))
2
, so the conclusion follows. (b) Since f has a subderivative at E(X), we have ax + b f(x)
for some a, b R, with equality for x = E(X). Integrating aX() + b f(X()) over P, we get
aE(X) +b E(f(X)). The left-hand side equals f(E(X)).
10 The law of large numbers
In your undergraduate probability course, you may have encountered a formalization of the idea that
averages approach the mean if one repeats the same experiment many times. The simplest of these
formulations is the weak law of large numbers: if
i
is a sequence of i.i.d. random variables with
E([
1
[) < , then for any > 0,
lim
n
P[(
1
+ +
n
)/n E(
1
)[ > = 0.
This is an immediate consequence of Chebyshevs inequality in the nite variance case.
In this section, our focus lies on formulating and proving a stronger statement, known as the strong
law of large numbers. In the strong version, the limit is taken inside of the probability instead of outside;
take a quick look at Theorem 10.1 to see the dierence. The exact connection between the weak and
strong statement will be explored further in the next section.
21
To prove this theorem, we need quite a few auxiliary lemmas and results. Let a
n
R be a sequence
of real numbers.
Denition 10.1. (a) The limit of a
n
is said to exist and is equal to a R if for every > 0, there
exists N Z
+
such that for each n N
[a
n
a[ < .
(b) The limit of a
n
is said to be equal to if for every M > 0, there exists N Z
+
such that for each
n N
a
n
> M.
(c) The limit of a
n
is said to be equal to if for every M > 0, there exists N Z
+
such that for
each n N
a
n
< M.
We write lim
n
a
n
= a when the limit exists and is equal to a. Here, a can be any real number in R
or a = or a = . The three parts of this denition can be unied upon dening a neighborhood of
+ and as a set of the form (M, ) and of the form (, M) for some M. In words, the sequence
a
n
has limit a R , if eventually all elements in the sequence lie in any neighborhood of a.
Remark 10.1. In the denition, be replaced by 1/k for each positive integer k. Similarly, M can be
replaced by k. Also, all strict inequalities can be replaced by less than or equal to.
To avoid cumbersome notation, we write lim
n
a
n
,= a if we are in one of two cases: (1) lim
n
a
n
does not exist, or (2) lim
n
a
n
exists and is not equal to a.
Lemma 10.1. (a) lim
n
a
n
= a if and only if for each positive integer k, the set
n Z
+
: [a
n
a[ 1/k is nite.
(b) lim
n
a
n
,= a if and only if there exists a positive integer k such that the set
n Z
+
: [a
n
a[ 1/k is innite.
Whenever the set n Z
+
: [a
n
a[ 1/k is innite, we say that [a
n
a[ 1/k innitely often
(i.o.). It is important to realize that this statement is a property of the whole sequence, so the n in this
statement is an auxiliary variable. It only looks like this statement depends on n, but it doesnt!
Exercise 36. Fix some integer k 1. Show that [a
n
a[ 1/k i.o. if and only if for each m 1, there
exists some n m such that [a
n
a[ 1/k.
For a given probability space (, T, P), let X
n
: n Z
+
be a sequence of random variables. For
each n Z
+
, X
n
: R is a random variable. For each , X
n
() : n Z
+
is a sequence of real
numbers. Now we characterize the set of s such that lim
n
X
n
() = 0.
Lemma 10.2. For any sequence of random variables X
n
: n Z
+
, the following holds:
: lim
n
X
n
() ,= 0 =

_
k=1
: [X
n
()[ 1/k i.o.. (10.1)
Proof. Suppose that : lim
n
X
n
() ,= 0. Then, there exists a k = k() such that
[X
n
()[ 1/k i.o. Therefore,

k=1
: [X
n
()[ 1/k i.o.. Conversely, if

k=1
:
[X
n
()[ 1/k i.o., there exists a k = k() such that : [X
n
()[ 1/k i.o..
22
The crux of the preceding lemma is that an event has been rewritten as a countable union. The
following exercise establishes a useful bound for the measure of a countable union. No analog exists for
an uncountable union; this is the reason we switched from to 1/k!
Exercise 37. Given a measure on (, T) and any sequence A
i
T, prove the following union
bound:

_
i=1
A
i
_

i=1
(A
i
).

Lemma 10.3. Assume that for each > 0, one has


P : [X
n
[ i.o. = 0. (10.2)
Then we have
P : lim
n
X
n
= 0 = 1.
Proof. It suces to prove that
P : lim
n
X
n
,= 0 = 0. (10.3)
To prove (10.3), note that
P : lim
n
X
n
,= 0 = P(

_
k=1
: [X
n
()[ 1/k i.o.)

k=1
P( : [X
n
()[ 1/k i.o.)
= 0,
where the last equality follows from assumption (10.2).
Exercise 38. For a sequence X
n
of random variables, show that for any xed > 0,
: [X
n
[ i.o. =

m=1

_
n=m
: [X
n
()[ .

Lemma 10.4 (Borel-Cantelli lemma). Let > 0 be xed. If

n=1
P : [X
n
()[ < , (10.4)
then
P : [X
n
[ i.o. = 0.
Proof. Let > 0 be xed. By Exercise 38, we have
P : [X
n
[ i.o. = P
_

m=1

_
n=m
: [X
n
()[
_
= lim
m
P
_

_
n=m
: [X
n
()[
_
lim
m

n=m
P( : [X
n
()[ )
= 0,
23
where the last equality follows from (10.4) and the fact that

n=1
[a
n
[ < if and only if lim
m

n=m
[a
n
[ = 0
for any sequence a
n
R.
Exercise 39. Let Z
1
, Z
2
, . . . be i.i.d. random variables with the standard normal distribution. Write
r(n) =
_
log(n + 1) for some > 4. Show that
P : [Z
n
[ r(n) i.o. = 0.

Theorem 10.1 (Strong law of large numbers (SLLN)). Let


i
: i = 1, . . . , be a sequence of i.i.d. random
variables. Assume that E(
1
) = exists and is nite. Then
P
_
: lim
n

1
+. . . +
n
n
=
_
= 1. (10.5)
Proof. Let
X
n
() =

1
+. . . +
n

n
.
It suces to prove that
P : lim
n
X
n
() = 0 = 1.
By Lemmas 10.3 and 10.4, it suces to prove (10.4) for each > 0. Let > 0. We now prove (10.4)
under the additional assumption that
E(
1
)
4
< . (10.6)
By inequality (9.1) with k = 4, we have
P[X
n
[
E([X
n
[
4
)

4
.
Now, we show that

n=1
E([X
n
[
4
) < when = 0. (When ,= 0, we set

i
=
i
. Then

i
is iid
and E(

1
) = 0). Note that
E([X
n
[
4
) =
1
n
4
E
_
(
1
+
n
)
4
_
.
When we expand the right-hand side, since E
i
= 0 and the
i
are i.i.d., the only nonzero terms are those
terms of the form E
4
i
= E
4
1
and of the form E(
2
i

2
j
) = (E(
2
1
))
2
. Therefore,
E([X
n
[
4
) =
E(
1
)
4
n
3
+
n
2
n
n
4
_
E(
2
1
)
_
2

E(
1
)
4
n
3
+
_
E(
2
1
)
_
2
n
2
.
We thus obtain

n=1
E([X
n
[
4
) < , which proves the theorem under the assumption (10.6).
11 Modes of convergence
There are several ways in which a sequence of random variables can converge, and it is the objective of
this section to discuss these dierent modes and the relationships amongst them.
24
Denition 11.1. (a) A sequence of random variables X
n
is said to converge to a random variable X
in probability if for each > 0,
lim
n
P : [X
n
() X()[ = 0.
(b) A sequence of random variables X
n
is said to converge to a random variable X almost surely (a.s.)
if
P : lim
n
[X
n
() X()[ = 0 = 1.
(c) A sequence of random variables X
n
is said to converge to a random variable X in distribution if
lim
n
F
n
(x) = F(x)
for each x R such that F is continuous at x, where F
n
is the cdf of X
n
and F is the cdf of X.
(d) A sequence of random variables X
n
is said to converge to a random variable X in L
k
if
lim
n
E([X
n
X[
k
) = 0,
where k > 0 is xed.
An example of convergence in probability is given by the weak law of large numbers. An example of
almost sure convergence is given by the strong law of large numbers, which is discussed in Section 10. In
mathematical nance, convergence in L
2
is used to dene stochastic integrals which represent portfolio
gains; such integrals require a dierent construction than in Section 4 since they cannot be interpreted
in the usual (Lebesgue-Stieltjes) sense. The relations between the dierent modes of convergence are
depicted in Figure 3, and it is the objective of this section to discuss this diagram in more detail.
Figure 3: Relations between the dierent modes of convergence. Also given are a summary of the
denitions and common ways to denote these modes of convergence.
Exercise 40. (a) If X
n
converges to X almost surely, then X
n
converges to X in probability. (b) If X
n
converges to X in L
1
, then X
n
converges to X in probability.
Exercise 41. In this exercise, you show that convergence in probability does not imply almost sure
convergence. Suppose X
n
is a sequence of independent random variables on with P(X
n
= 1) =
1 exp(1/n) and P(X
n
= 0) = exp(1/n). (a) Show that X
n
converges to zero in probability. (b)
25
Show that P(

m=1

n=m
: [X
n
()[ ) = 0 for every > 0. (c) Use part (b) to show that X
n
does
not converge to zero almost surely.
Exercise 42. Show that convergence in probability does not imply convergence in L
k
.
Theorem 11.1 (Dominated convergence theorem (DCT)). Assume that a sequence of random variables
X
n
converges to X a.s. Assume further that there exists a random variable Y such that
P : [X
n
()[ Y () = 1
for each n 1 and E(Y ) < . Then we have
lim
n
E(X
n
) = E(X) .
Exercise 43. Assume that X
n
converges to X a.s. Assume further that [X
n
()[ 100 for all n and all
. Prove that X
n
converges to X in L
k
for any k > 0.
Exercise 44. In this exercise, you study an example where no dominating random variable Y exists and
the statement of the DCT does not hold.
Set = (0, 1), and equip with the trace of the Borel -eld (see Exercise 28) on . We also let
P be Lebesgue measure on . For each n, dene X
n
() = n1
(0,1/n]
(). Find the almost sure limit X of
X
n
, and show that lim
n
E(X
n
) ,= E(X).
Proposition 11.1. If X
n
converges to X in probability, then X
n
converges to X in distribution.
Proof. Let F
n
be the cdf of X
n
and F be the cdf of X. For any x R and any > 0,
F
n
(x) = PX
n
x, X x + +PX
n
x, X > x +
F(x +) +P[X
n
X[ > ,
where we have used the fact that X
n
x and X > x + imply that [X
n
X[ > . Letting n , we
have
limsup
n
F
n
(x) F(x +).
Similarly, we have
F(x ) = PX
n
x, X x +PX
n
> x, X x
F
n
(x) +P[X
n
X[ > .
Letting n , we have
F(x ) liminf
n
F
n
(x).
Thus, we have
F(x ) liminf
n
F
n
(x) limsup
n
F
n
(x) F(x +).
Assume F is continuous at x. By taking 0, we have
F(x) liminf
n
F
n
(x) limsup
n
F
n
(x) F(x),
which implies that lim
n
F
n
(x) exists and is equal to F(x).
The remainder of this section is devoted to an alternative characterization of convergence in distri-
bution, for which the next theorem is the key ingredient. The notation X
d
= Y means that X and Y
are equal in distribution, meaning that X and Y have the same distribution function. In particular, the
random variables need not be dened on the same sample space. Note that this is dierent from X = Y ,
which means that X and Y are dened on the same sample space and that they are equal as functions,
i.e., X() = Y () for every .
26
Theorem 11.2 (Skorohod representation theorem). Assume that a sequence of random variables X
n

converges to X in distribution. There exists a sequence of random variables Y


n
and Y , dened on some
probability space (

,

T,

P), such that X


n
d
= Y
n
and X
d
= Y and Y
n
converges to Y almost surely.
Sketch of a proof. We give the argument for an example: suppose X
n
has exponential distribution with
rate
n
and X has exponential distribution with rate . To ensure that X
n
converges to X in distribution,
we suppose that
n
as n . Then, we have F
n
(x) = F(x) = 0 for x 0 and
F
n
(x) = 1 e
nx
and F(x) = 1 e
x
for x 0.
Write

= [0, 1],

T = [0, 1] B, and let

P be Lebesgue measure restricted to

T. Let Y ( ) = F
1
( ), i.e.,
= 1 e
Y ( )
or Y ( ) =
1

ln(1 ).
It is readily checked that Y is a random variable on (

,

T). We claim that Y is a random variable that
has exponential distribution with rate under

P. To see this, for x 0, it is clear that

PY x = 0.
Assume x > 0,

PY x =

P : ln(1 ) x =

P : (1 ) e
x

=

P : 1 e
x
= 1 e
x
= F(x).
Thus Y
d
= X. Similarly, let
Y
n
( ) =
1

n
ln(1 ).
Then Y
n
d
= X
n
for each n. Clearly, when
n
, we have that Y
n
( ) Y ( ) for each

. Thus, Y
n
converges to Y almost surely under

P.
Proposition 11.2. Assume that a sequence of random variables X
n
converges to X in distribution.
Then for any bounded continuous function g : R R,
lim
n
E(g(X
n
)) = E(g(X)). (11.1)
Proof. Using Skorohod representation theorem, we have
Eg(X
n
) =

Eg(Y
n
)

Eg(Y ) = E(g(X)),
where

Eg(Y
n
)

Eg(Y ) follows from the dominated convergence theorem.
It can in fact be shown that weak convergence of X
n
to X is equivalent with (11.1) for any bounded
and continuous g.
12 Further convergence results
In addition to the two versions of the law of large numbers, there are several other key convergence results
for random variables. It is the aim of this section to discuss a few of these.
27
12.1 Central limit theorems
The law of large numbers says that the average of n centered i.i.d. random variables converges to zero,
i.e., it becomes a degenerate random variable. Central limit theorems tell us by which number to multiply
these random variables in order to get a non-degenerate random variable. In particular, it implies that
the rate of convergence in the law of large numbers is of order 1/

n. We do not give any proofs for this


section.
The rst theorem is called the Lindeberg-Levy central limit theorem, usually just called central limit
theorem. Youve probably already encountered this theorem in other classes. The remarkable aspect of
the theorem is that it only involves the mean and variance.
Theorem 12.1. Let
i
be a sequence of i.i.d. random variables with E(
i
) = , Var(
i
) =
2
< .
Then the sequence of random variables given by

n
i=1
(
i
)

n
converges in distribution to a standard normal random variable.
The next result is known as the Lyapunov central limit theorem. We only assume the X
i
are inde-
pendent, not necessarily identically distributed.
Theorem 12.2. Write
i
= E(X
i
) and
2
i
= Var(X
i
). Set s
2
n
=

n
i=1

2
i
and suppose that for some
> 0,
lim
n
1
s
2+
n
n

i=1
E([X
i

i
[
2+
) = 0.
Then the sequence of random variables given by
1
s
n
n

i=1
(X
i

i
)
converges in distribution to a standard normal random variable.
12.2 The Glivenko-Cantelli and Kolmogorov theorems
The next theorem is important for statistics and simulation. Given an i.i.d. sequence
i
with common
distribution function F, we dene the empirical distribution function for
1
, . . . ,
n
by, with x R,
F
n
(x) =
1
n
n

i=1
1
{
i
x}
.
Note that the strong law of large numbers states that F
n
(x) converges to F(x) almost surely, for each x.
The following theorem, known as the Glivenko-Cantelli theorem, strengthens this pointwise convergence
to uniform convergence. To this end, dene D
n
= sup
xR
[F
n
(x) F(x)[.
Theorem 12.3. D
n
converges almost surely to 0 as n .
The next theorem is a renement of this theorem, and it implies that the rate of convergence to zero
in the Glivenko-Cantelli theorem is of order 1/

n. Note the analogies with the central limit theorem.


Theorem 12.4.

nD
n
converges in distribution to a random variable K with the so-called Kolmogorov
distribution, as n .
The Kolmogorov distribution can be expressed using a so-called Brownian bridge, but this is outside
of the scope of these notes.
Exercise 45. Let
1
,
2
, . . . be i.i.d. standard exponentially distributed random variables. Determine a
sequence a
n
such that max(
1
, . . . ,
n
) a
n
converges in distribution, and work out the distribution
function of the limit.
28
13 Conditional expectation
13.1 Denition and examples
In your undergraduate probability class, you have seen the denition
E[X[Y = y] =

x
xP(X = x, Y = y)/P(Y = y), (13.1)
provided X and Y are discrete random variables and P(Y = y) > 0. For continuous random variables,
the denition was similar but dierent.
Here we formulate a new general denition. A random vector is dened as a vector of random variables.
Denition 13.1. Let X be a random variable with E([X[) < or X 0, and let Y = (Y
1
, . . . , Y
m
) be
a random vector. The conditional expectation E[X[Y ] is the unique real-valued function of Y such that
for all bounded measurable functions g:
E[Xg(Y )] = E[E[X[Y ]g(Y )].
Lets now check if this is consistent with the undergraduate denition (13.1), so we assume both X
and Y are discrete. Dene the function h : R R be setting
h(y) =

x
xP(X = x[Y = y),
which is the right-hand side of (13.1). Then we have, for all bounded measurable g,
E[Xg(Y )] =

x
xg(y)P(X = x, Y = y)
=

y
g(y)P(Y = y)

x
xP(X = x[Y = y)
=

y
g(y)P(Y = y)h(y)
= E[h(Y )g(Y )],
as stipulated in Denition 13.1.
We also dene P(X A[Y ) = E[1
{XA}
[Y ].
Perhaps the strangest consequence of the new denition is that a conditional expectation is a random
variable. The following simple example illustrates this point.
Example 13.1. Suppose X and Y are random variables on (, T) for which P(X = 0, Y = 0) = P(X =
1, Y = 1) = 3/8 and P(X = 0, Y = 1) = P(X = 1, Y = 0) = 1/8. Then E[X[Y = 0] = P(X = 1[Y =
0) = 1/4 and E[X[Y = 1] = P(X = 1[Y = 1) = 3/4, so the random variable E[X[Y ] is dened as
E[X[Y ]() =
_
1/4 if Y () = 0
3/4 if Y () = 1.
The following list summarizes key properties of conditional expectation:
1. (Denition) E[Xg(Y )] = E[E[X[Y ]g(Y )].
2. (Linearity) E[aU +bV [Y ] = aE[U[Y ] +bE[V [Y ], for a, b R.
3. (Positivity) If X 0 then E[X[Y ] 0.
29
4. (Stability) For every measurable f, E[f(Y )Z[Y ] = f(Y )E[Z[Y ].
5. (Independence law) If X is independent of Y , then E[X[Y ] = E[X].
6. (Tower property) If Z is a function of Y then E[E[X[Y ][Z] = E[X[Z].
7. (Expectation law) E[E[X[Y ]] = E[X].
8. E[a[Y ] = a, for a R.
9. (Jensens inequality) If : R R is convex and E([X[) < then E[(X)[Y ] (E[X[Y ]).
Conditional expectation can also be dened with respect to a -eld. Suppose we are given (, T
0
, P)
and a -eld T T
0
.
Denition 13.2. Suppose the random variable X is T
0
-measurable and that E([X[) < . Then E[X[T]
is any random variable Z for which
1. Z is T-measurable, and
2. for all A T, we have
_
A
XdP =
_
A
ZdP.
Any Z satisfying these two conditions is called a version of E[X[T].
This denition should be compared to Denition 13.1; let g(Y ) be the indicator 1
A
(). The proofs
of existence and almost sure uniqueness of conditional expectation rely on the Radon-Nikodym theorem
discussed at the end of Section 5. To see the connection, consider the measure on T given by (A) =
_
A
XdP; then Z = d/dP satises the above two requirements.
13.2 Conditional expectation as projection
We now develop another way of thinking about conditional expectation, through projections. Given some
probability space (, T, P), consider the space 1 of square integrable random variables, usually denoted
by L
2
. Throughout this subsection, we interpret an element of 1 as a vector. To get a sense for how
this is done, note that we can interpret a vector v R
d
as a function v : 1, . . . , d R such that v(i) is
the i-th element of the vector. For the random variable X : R, we can similarly interpret X() as
a vector element, but the underlying space need not be ordered.
On the space 1, we can introduce the notion of vector addition: for X, Y 1, we dene X + Y
through
(X +Y )() = X() +Y (), ,
and it can be veried that the resulting variable X + Y lies in 1. We can also introduce the notion of
scalar multiplication: for X 1 and R, we dene X through
(X)() = X(), ,
and we have X 1. Note that the random variable that is identically equal to zero is the origin in
this context.
In what follows, for simplicity, we only consider random variables with a nite number of possible
values. Let X take values in A = x
1
, . . . , x
n
. For i = 1, . . . , n, dene the vector
x
i
(X) through

x
i
(X)() =
_
1 if X() = x
i
0 otherwise.
We call this vector
x
i
(X) instead of (say) Y
i
because it can also be represented through and indicator
function.
30
We next dene
(X) = span
x
1
(X), . . . ,
xn
(X).
One can verify that:
1.
x
1
(X), . . . ,
xn
(X) are linearly independent,
2.
x
1
(X), . . . ,
xn
(X) is a basis for (X),
3. dim( (X)) = n.
Lemma 13.1. Elements of (X) are in bijection with functions h : x
1
, . . . , x
n
R, and thus
also with h(X)[ h : x
1
, . . . , x
m
R. The coordinate vector of h(X) with respect to the basis

x
1
(X), . . . ,
xn
(X) of (X) is (h(x
1
), . . . , h(x
n
)) R
n
.
Proof. Given a function h : x
1
, . . . , x
n
R, we have
h(X) =
n

i=1
h(x
i
)
x
i
(X),
and the right-hand side expresses h(X) as a linear combination of basis vectors of (X). Conversely,
suppose that v (X), i.e., for some c
1
, . . . , c
n
R,
v =
n

i=1
c
i

x
i
(X).
Then we dene a function g through h(x
i
) = c
i
for i = 1, . . . , n.
Exercise 46. Let X : x
1
, . . . , x
n
and (X) be as above. Let (X) be the -eld generated
by X, which in this case consists of any union of the sets X
1
(x
1
), . . . , X
1
(x
n
). Show that ((X), B)-
measurability of h : R is equivalent with h (X).
Suppose that a random variable Y takes values in y
1
, . . . , y
m
, and write (Y ) = span
y
1
(Y ), . . . ,
ym
(Y ).
We let
y
1
(Y ), . . . ,
ym
(Y ) be a basis of (Y ). We next dene, for X, Y 1,
X, Y = E(XY ), |X| =
_
X, X =
_
E(X
2
).
As an aside, note that L
2
convergence can be reformulated as |X
n
X| 0. This is not a proper
norm; |X| = 0 does not imply that X = 0 (only X = 0 almost surely), the implication of which is
that all results we derive have to be understood in an almost sure sense. We still refer to the above as
inner product and norm, respectively. The signicance of this denition is that we now have a notion
of orthogonality. Two independent centered random variables are orthogonal with respect to this inner
product.
Exercise 47. Check that X, Y = E(XY ) satises the properties of an inner product on the vector
space consisting of the set of random variables on some given probability space (, T, P), with the
aforementioned exception that X, X = 0 does not imply that X = 0.
Back to our X and Y with nitely many values.
Proposition 13.1. The projection of h(X) on (Y ) is E(h(X)[Y ).
Proof. Recall that if v
1
, . . . , v
k
is an orthogonal basis of some space S, the projection of u on S is given
by
k

i=1
u, v
i

v
i
, v
i

v
i
.
31
Since
y
1
(Y ), . . . ,
ym
(Y ) is an orthogonal basis for (Y ), the projection of h(X) on (Y ) is
m

j=1
h(X),
y
j
(Y )

y
j
(Y ),
y
j
(Y )

y
j
(Y )
=
m

j=1

n
i=1
h(x
i
)
x
i
(X),
y
j
(Y )

y
j
(Y ),
y
j
(Y )

y
j
(Y )
=
m

j=1
n

i=1
h(x
i
)
P(X = x
i
, Y = y
j
)
P(Y = y
j
)

y
j
(Y )
=
m

j=1
E(h(X)[Y = y
j
)
y
j
(Y )
= E(h(X)[Y ),
as desired.
We note that h(X)E(h(X)[Y ), Z = 0 for any Z (Y ) is the same as, for all g : y
1
, . . . , y
m
R,
E(h(X)g(Y )) = E[E(h(X)[Y )g(Y )].
This is the denition of conditional expectation! To see the precise connection with Denition 13.1,
consider the case h(x) = x. Thus, the geometric way of thinking of conditional expectation as a
projection is the same as our previous denition. The construction is illustrated in Figure 4.
Figure 4: The conditional expectation E[h(X)[Y ] as the orthogonal projection of h(X) onto (Y ). One
should interpret this diagram as an abstraction since the linear spaces (X) and (Y ), which have
dimension n and m, respectively, are depicted as one-dimensional spaces.
Exercise 48. The least square property for vector projection states that the projection of some vector
v onto the linear space V is the element of V that is closest to v in Euclidean distance. Formulate this
property for conditional expectations.
Exercise 49. Show how the tower property for conditional expectation is related to taking successive
projections.
32
14 Stochastic integration
The theory of integration developed in previous sections is extremely powerful, but there are important
examples where it fails to apply. The primary example arises in the theory of mathematical nance,
where so-called stochastic integration has been developed in order to study portfolio gains processes.
References
[1] M. Capi nski and E. Kopp, Measure, integral and probability, Springer-Verlag London Ltd., London,
second ed., 2004.
[2] K. L. Chung, A Course in Probability Theory, Academic Press, San Diego, third ed., 2001.
[3] R. Durrett, Probability: theory and examples, Cambridge University Press, Cambridge, fourth ed.,
2010.
[4] S. I. Resnick, A probability path, Birkhauser Boston Inc., Boston, MA, 1999.
[5] J. S. Rosenthal, A rst look at rigorous probability theory, World Scientic Publishing Co. Pte.
Ltd., Hackensack, NJ, second ed., 2006.
[6] S. Ross and E. Pek oz, A Second Course in Probability, www.ProbabilityBookstore.com, Boston,
MA, 2007.
33

Potrebbero piacerti anche