Sei sulla pagina 1di 121

Stochastic Calculus Notes, Lecture 1

Last modified September 12, 2004

1 Overture

1.1. Introduction: The term stochastic means random. Because it usually


occurs together with process (stochastic process), it makes people think of
something something random that changes in a random way over time. The term
calculus refers to ways to calculate things or find things that can be calculated
(e.g. derivatives in the differential calculus). Stochastic calculus is the study of
stochastic processes through a collection of powerful ways to calculate things.
Whenever we have a question about the behavior of a stochastic process, we will
try to find an expected value or probability that we can calculate that answers
our question.

1.2. Organization: We start in the discrete setting in which there is a


finite or countable (definitions below) set of possible outcomes. The tools are
summations and matrix multiplication. The main concepts can be displayed
clearly and concretely in this setting. We then move to continuous processes
in continuous time where things are calculated using integrals, either ordinary
integrals in Rn or abstract integrals in probability space. It is impossible (and
beside the point if it were possible) to treat these matters with full mathematical
rigor in these notes. The reader should get enough to distinguish mathematical
right from wrong in cases that occur in practical applications.

1.3. Backward and forward equations: Backward equations and forward


equations are perhaps the most useful tools for getting information about stochas-
tic processes. Roughly speaking, there is some number, f , that we want to know.
For example f could be the expected value of a portfolio after following a pro-
posed trading strategy. Rather than compute f directly, we define an array
of related expected values, f (x, t). The tower property implies relationships,
backward equations or forward equations, among these values that allow us to
compute some of them in terms of others. Proceeding from the few known val-
ues (initial conditions and boundary conditions), we eventually find the f we
first wanted. For discrete time and space, the equations are matrix equations or
recurrence relations. For continuous time and space, they are partial differential
equations of diffusion type.

1.4. Diffusions and Ito calculus: The Ito calculus is a tool for studying
continuous stochastic processes in continuous time. If X(t) is a differentiable
function of time, then X = X(t + t) X(t) is of the order of1 t. Therefore
f (X(t)) = f (X(t + t)) f (X(t)) f 0 X to this accuracy. For an Ito
process, X is of the order of t, so f f 0 X + 12 f 00 X 2 has an error
1 This means that there is a C so that |X(t + t X(t)| C || for small t.

1
smaller than t. In the special case where X(t) is Brownian motion, it is often
permissible (and the basis of the Ito calculus) to replace X 2 by its mean value,
t.

2 Discrete probability
Here are some basic definitions and ideas of probability. These might seem dry
without examples. Be patient. Examples are coming in later sections. Although
the topic is elementary, the notation is taken from more advanced probability
so some of it might be unfamiliar. The terminology is not always helpful for
simple problems but it is just the thing for describing stochastic processes and
decision problems under incomplete information.

2.1. Probability space: Do an experiment or trial, get an outcome,


. The set of all possible outcomes is , which is the probability space. The
is discrete if it is finite or countable (able to be listed in a single infinite
numbered list). The outcome is often called a random variable. I avoid that
term because I (and most other people) want to call functions X() random
variables, see below.

2.2. Probability: The probability of a specificXoutcome is P (). We always


assume that P () 0 for any and that P () = 1. The interpreta-

tion of probability is a matter for philosophers, but we might say that P () is
the probability of outcome happening, or the fraction of times event would
happen in a large number of independent trials. The philosophical problem is
that it may be impossible actually to perform a large number of independent
trials. People also sometimes say that probabilities represent our often subjec-
tive (lack of) knowledge of future events. Probability 1 means something that
is certain to happen while probability 0 is for something that cannot happen.
Probability zero impossible is only strictly true for discrete probability.

2.3. Event: An event is a set of outcomes, a subset of . The probability of


an event is the sum of the probabilities of the outcomes that make up the event
X
P (A) = P () . (1)
A

Usually, we specify an event in some way other than listing all the outcomes in
it (see below). We do not distinguish between the outcome and the event that
that outcome occurred A = {}. That is, we write P () for P ({}) or vice
versa. This is called abuse of notation: we use notation in a way that is not
absolutely correct but whose meaning is clear. Its the mathematical version of
saying I could care less to mean the opposite.

2.4. Countable and uncountable (technical detail): A probability space (or

2
any set) that is not countable is called uncountable. This distinction was
formalized by the late nineteenth century mathematician Georg Cantor, who
showed that the set of (real) numbers in the interval [0, 1] is not countable.
Under the uniform probability density, P () = 0 for any [0, 1]. It is hard to
imagine that the probability formula (1) is useful in this case, since every term
in the sum is zero. The difference between continuous and discrete probability
is the difference between integrals and sums.

2.5. Example: Toss a coin 4 times. Each toss yields either H (heads) or T
(tails). There are 16 possible outcomes, TTTT, TTTH, TTHT, TTHH, THTT,
. . ., HHHH. The number of outcomes is #() = || = 16. We suppose that
1
each outcome is equally likely, so P () = 16 for each . If A is the event
that the first two tosses are H, then
A = {HHHH, HHHT, HHTH, HHTT} .
1
There are 4 elements (outcomes) in A, each having probability 16 Therefore
X X 1 4 1
P (first two H) = P (A) = P () = = = .
16 16 4
A A

2.6. Set operations: Events are sets, so set operations apply to events. If A
and B are events, the event A and B is the set of outcomes in both A and
B. This is the set intersection A B, because the outcomes that make both A
and B happen are those that are in both events. The union A B is the set
of outcomes in A or in B (or in both). The complement of A, Ac , is the event
not A, the set of outcomes not in A. The empty event is the empty set, the
set with no elements, . The probability of should be zero because the sum
that defines it has no terms: P () = 0. The complement of is . Events A
and B are disjoint if A B = . Event A is contained in event B, A B, if
every outcome in A is also in B. For example, if the event A is as above and B
is the event that the first toss is H, then A B.

2.7. Basic
P facts: Each of these facts is a consequence of the representation
P (A) = A P (). First P (A) P (B) if A B. Also, P (A) + P (B) =
P (A B) if P (A B) = 0, but not otherwise. If P ( 6= 0 for all , then
P (AB) = 0 only wehn A and B are distoint. Clearly, P (A)+P (Ac ) = P () =
1.

2.8. Conditional probability: The probability of outcome A given that B has


occurred is the conditional probability (read the probability of A given B,
P (A B)
P (A | B) = . (2)
P (B)
This is the fraction of B outcomes that are also A outcomes. The formula is
called Bayes rule. It is often used to calculate P (A B) once we know P (B)
and P (A | B). The formula for that is P (A B) = P (A | B)P (B).

3
2.9. Independence: Events A and B are independent if P (A | B) = P (A).
That is, knowing whether or not B occurred does not change the probability of
A. In view of Bayes rule, this is expressed as

P (A B) = P (A) P (B) . (3)

For example, suppose A is the event that two of the four tosses are H, and B
is the event that the first toss is H. Then A has 6 elements (outcomes), B has
8, and, as you can check by listing them, A B has 3 elements. Since each
1 3 6
element has probability 16 , this gives P (A B) = 16 while P (A) = 16 and
8 1
P (B) = 16 = 2 . We might say duh for the last calculation since we started
the example with the hypothesis that H and T were equally likely. Anyway,
this shows that (3) is indeed satisfied in this case. This example is supposed to
show that while some pairs of events, such as the first and second tosses, are
obviously independent, others are independent as the result of a calculation.
Note that if C is the event that 3 of the 4 tosses are H (instead of 2 for A),
4
then P (C) = 16 = 41 and P (B C) = 16 3
, because

B C = {HHHT, HHTH, HTHH}


3 3
has three elements. Bayes rule (2) gives P (B | C) = 16 / 4 = 34 . Knowing that
there are 3 heads in all raises the probability that the first toss is H from 21 to
3
4.

2.10. Working with conditional probability: Let us fix the event B, and
discuss the conditional probability Pe() = P ( | B), which also is a probability
(assuming P (B) > 0). There are two slightly different ways to discuss Pe. One
way is to take B to be the probability space and define

P ()
Pe() =
P (B)

for all B. Since B is the probability space for Pe, we do not have to define
Pe for / B. This Pe is a probability because Pe() 0 for all B and
P
B P () = 1. The other way is to keep as the probability space and
e
set the conditional probabilities to zero for / B. If we know the event B
happened, then the probability of an outcome not in B is zero.
(
P ()
for B,
P ( | B) = P (B) (4)
0 for
/ B.

Either way, we restrict to outcomes in B and renormalize the probabilities


by dividing by P (B) so that they again sum to one. Note that (4) is just the
general conditional probability formula (2) applied to the event A = {}.
We can condition a second time by conditioning Pe on another event, C. It
seems natural that Pe( | C), which is the conditional probability of given that

4
C, occurred given that B occurred, should be be the P conditional probability
of given that both B and C occurred. Bayes rule verifies this intuition:

Pe()
Pe( | C) =
Pe(C)
P ( | B)
=
P (C | B)
P ()
=
P (C B
P (B)
P (B)
P ()
=
P (B C)
= P ( | B C) .

The conclusion is that conditioning on B and then on C is the same as condi-


tioning on B C (B and C) all at once. This tower property underlies the many
recurrence relations that allow us to get answers in practical situations.

2.11. Algebra of sets and incomplete information: A set of events, F, is an


algebra if
i: A F implies that Ac F.
ii: A F and B F implies that A B F and A B F.
iii: F and F.
We interpret F as representing a state of partial information. We know whether
any of the events in F occurred but we do not have enough information to
determine whether an event not in F occurred. The above axioms are natural
in light of this interpretation. If we know whether A happened, we surely know
whether not A happened. If we know whether A happened and whether B
happened, then we can tell whether A and B happened. We definitely know
whether happened (it did not) and whether happened (it did). Events in
F are called measurable or determined in F.

2.12. Example 1 of an F: Suppose we learn the outcomes of the first two


tosses. One event measurable in F is (with some abuse of notation)

{HH} = {HHHH, HHHT, HHTH, HHTT} .

An example of an event not determined by this F is the event of no more than


one H:
A = {TTTT, TTTH, TTHT, THTT, HTTT} .
Knowing just the first two tosses does not tell you with certainty whether the
total number of heads is less than two.

5
2.13. Example 2 of an F: Suppose we know only the results of the tosses
but not the order. This might happen if we toss 4 identical coins at the same
time. In this case, we know only the number of H coins. Some measurable sets
are (with an abuse of notation)

{4} = {HHHH}
{3} = {HHHT, HHTH, HTHH, THHH}
..
.
{0} = {TTTT}
1 3
The event {2} has 6 outcomes (list them), so its probability is 6 = . There
16 8
are other events measurable in this algebra, such as less than 3 H, but, in
some sense, the events listed generate the algebra.

2.14. algebra: An algebra of sets is a algebra (pronounced sigma


algebra) if it is closed under countable intersections, which means the following.
Suppose An F is a countable family of events measurable in F, and A = n An
is the set of outcomes in all of the An , then A F, too. The reader can
check that an algebra closed under countable intersections is also closed under
countable unions, and conversely. An algebra is automatically a algebra if
is finite. If is infinite, an algebra might or might not be a algebra.2 In
a algebra, it is possible to take limits of infinite sequences of events, just as
it is possible to take limits of sequences of real numbers. We will never (again)
refer to an algebra of events that is not a algebra.

2.15. Terminology: What we call outcome is usually called random


variable. I did not use this terminology because it can be confusing, in that we
often think of variables as real (or complex) numbers. A real valued function
of the random variable is a real number X for each , written X(). The
most common abuse of notation in probability is to write X instead of X().
We will do this most of the time, but not just yet. We often think of X as a
random number whose value is determined by the outcome (random variable) .
A common convention is to use upper case letters for random numbers and lower
case letters for specific values of that variable. For example, the cumulative
Xfunction (CDF), F (x), is the probability that X x, that is:
distribution
F (x) = P ().
X()x

2.16. Informal event terminology: We often describe events in words. For


example, we might write P (X x) where, strictly, we might be supposed to
say Ax = { | X() x} then P (X x) = P (Ax ). For example, if there are
2 Let be the set of integers and A F if A is finite or Ac is finite. This F is an algebra

(check), but not a algebra. For example, if An leaves out only the first n odd integers,
then A is the set of even integers, and neither A nor Ac is finite.

6
two functions, X1 and X2 , we might try to calculate the probability that they
are equal, P (X1 = X2 ). Strictly speaking, this is the probability of the set of
so that X1 () = X2 ().

2.17. Measurable: A function (of a random variable) X() is measurable


with respect to the algebra F if the value of X is completely determined by
the information in F. To give a mathematical definition, for any number, x,
we can consider the event that X = x, which is Bx = { : X() = x}.
In discrete probability, Bx will be the empty set for almost all x values and
will not be empty only for those values of x actually taken by X() for one
of the outcomes . The function X() is measurable with respect to F
if the sets Bx are all measurable. People often write X F (an abuse of
notation) to indicate that X is measurable with respect to F. In Example 2
above, the function X = number of H minus number of T is measurable, while
the function X = number of T before the first H is not (find an x and Bx /F
to show this).

2.18. Generating an algebra of sets: Suppose there are events A1 , . . .,


Ak that you know. The algebra, F, generated by these sets is the algebra
that expresses the information about the outcome you gain by knowing these
events. One definition of F is that an event A is in F if A can be expressed in
terms of the known events Aj using the set operations intersection, union, and
complement a number of times. For example, we could define an event A by
saying is in A1 and (A2 or A3 ) but not in A4 or A5 , which would be written
A = (A1 (A2 A3 )) (A4 A5 )c . This is the same as saying that F is the
smallest algebra of sets that contains the known events Aj . Obviously (think
about this!) any algebra that contains the Aj contains any event described by
set operations on the Aj , that is the definition of algebra of sets. Also the sets
defined by set operations on the Aj form an algebra of sets. For example, if A1
is the event that the first toss is H and A2 is the event that both the first two
are H, then A1 and A2 generate the algebra of events determined by knowing
the results of the first two tosses. This is Example 1 above. To generate a
algebra, we mayhave to allow infinitely many set operations, but a precise
discussion of this would be off message.

2.19. Generating by a function: A function X() defines an algebra of


sets generated by the sets Bx . This is the smallest algebra, F, so that X is
measurable with respect to F. Example 2 above has this form. We can think of
F as being the algebra of sets defined by statements about the values of X().
For example, one A F would be the set of with X either between 1 and 3
or greater than 4.
We write FX for the algebra of sets generated by X and ask what it means
that another function of , Y (), is measurable with respect to FX . The
information interpretation of FX says that Y FX if knowing the value of X()
determines the value of Y (). This means that if 1 and 2 have the same X
value (X(1 ) = X(2 )) then they also have the same Y value. Said another

7
way, if Bx is not empty, then there is some number, u(x), so that Y () = u(x)
for every Bx . This means that Y () = u(X()) for all ). Altogether,
saying Y FX is a fancy way of saying that Y is a function of X. Of course,
u(x) only needs to be defined for those values of x actually taken by the random
variable X.
For example, if X is the number of H in 4 tosses, and Y is the number of
H minus the number of T , then, for any 4 tosses, , Y () = 2X() 4. That
is, u(x) = 2x 4.

2.20. Equivalence relation: A algebra, F, determines an equivalence


relation. Outcomes 1 and 2 are equivalent, written 1 2 , if the information
in F does not distinguish 1 from 2 . More formally, 1 2 if 1 A
2 A for every A F. For example, in Example 2 above, THTT TTTH.
Because F is an algebra, 1 2 also implies that 1 / A 2 / A (think this
through). Note that it is possible that A = A0 while 6= 0 . This happens
when 0 .
The equivalence class of outcome is the set of outcomes equivalent to in
F, indistinguishable from using the information available in F. If A is the
equivalence class of , then A F. (Proof: for any 0 not equivalent to in
F, there is at least one B0 F with B0 but 0 / B0 . Since there are (at
most) countably many 0 , and F is a algebra, A = 0 B0 F. This A
contains every 1 that is equivalent to (why?) and only those.) In Example
2, the equivalence class of THTT is the event {HTTT, THTT, TTHT, TTTH}.

2.21. Partition: A partition of is a collection of events, P = {B1 , B2 , . . .}


so that every outcome is in exactly one of the events Bk . The algebra
generated by P, which we call FP , consists of events that are unions of events
in P (Why are complements and intersections not needed?). For any partition
P, the equivalence classes of FP are the events in P (think this through). Con-
versely, if P is the partition of into equivalence classes for F, then P generates
F. In Example 2 above, the sets Bk = {k} form the partition corresponding to
F. More generally, the sets Bx = { | X() = x} that are not empty are the
partition corresponding to FX . In discrete probability, partitions are a conve-
nient way to understand conditional expectation (below). The ininformation in
FP is the knowledge of which of the Bj happened. The remaining uncertainty
i swhich of the Bj happened.

2.22. Expected value: A random variable (actually, a function of a random


variable) X() has expected value
X
E[X] = X()P () .

(Note that we do not write on the left. We think of X as simply a random


number and as a story telling how X was generated.) This is the average
value in the sense that if you could perform the experiment of sampling X
many times then average the resulting numbers, you would get roughly E[X].

8
This is because P () is the fraction of the time you would get and X() is
the number you get for . If X1 () and X2 () are two random variables, then
E[X1 + X2 ] = E[X1 ] + E[X2 ]. Also, E[cX] = cE[X] if c is a constant (not
random).

2.23. Best approximation property: If we wanted to approximate a random


variable, X, (function X() with not written) by a single non random number,
x, what value would we pick? That would depend on the sense of best. One
such sense is least squares, choosing x to minimize the expected value of (X x)2 .
A calculation, which uses the above properties of expected value, gives
h i
2
E (X x) = E[X 2 2Xx + x2 ]
= E[X 2 ] 2xE[X] + x2 .

Minimizing this over x gives the optimal value

xopt = E[X] . (5)

2.24. Classical conditional expectation: There are two senses of the term
conditional expectation. We start with the original classical sense then turn
to the related but different modern sense often used in stochastic processes.
Conditional expectation is defined from conditional probability in the obvious
way X
E[X|B] = X()P (|B) . (6)
B

For example, we can calculate

E[#of H in 4 tosses | at least one H] .

Write B for the event {at least one H}. Since only =TTTT does not have
1
at least one H, |B| = 15 and P ( | B) = 15 for any B. Let X() be the
number of H in . Unconditionally, E[X] = 2, which means
1 X
X() = 2 .
16
x

Note that X() = 0 for all


/ B (only TTTT), so
X X
X()P () = X()P () ,
B

and therefore
1 X
X()P () = 2
16
B

9
15 1 X
X()P () = 2
16 15
B
1 X 2 16
X()P () =
15 15
B
32
E[X | B] = = 2 + .133 . . . .
15
Knowing that there was at least one H increases the expected number of H by
.133 . . ..

2.25. Law of total probability: Suppose P = {B1 , B2 , . . .} is a partition of


. The law of total probability is the formula
X
E[X] = E[X | Bk ]P (Bk ) . (7)
k

This is easy to understand: exactly one of the events Bk happens. The expected
value of X is the sum over each of the events Bk of the expected value of X
given that Bk happened, multiplied by the probability that Bk did happen. The
derivation is a simple combination of the definitions of conditional expectation
(6) and conditional probability (4):
X
E[X] = X()P ()

!
X X
= X()P ()
k Bk
!
X X P ()
= X() P (Bk )
P (Bk )
k Bk
X
= E[X | Bk ]P (Bk ) .
k

This fact underlies the recurrence relations that are among the primary tools of
stochastic calculus. It will be reformulated below as the tower property when
we discuss the modern view of conditional probability.

2.26. Modern conditional expectation: The modern conditional expectation


starts with an algebra, F, rather than just the set B. It defines a (function of
a) random variable, Y () = E[X | F], that is measurable with respect to F
even though X is not. This function represents the best prediction (in the least
squares sense) of X given the information in F. If X F, then the value of
X() is determined by the information in F, so Y = X.
In the classical case, the information is the occurrance or non occurrance of
a single event, B. That is, the algebra, FB , consists only of the sets B, B c , ,
and . For this FB , the modern definition gives a function Y () so that

E[X | B] if B,
Y () =
E[X | B c ] if
/ B.

10
Make sure you understand the fact that this two valued function Y is measurable
with respect to FB .
Only slightly more complicated is the case where F is generated by a parti-
tion, P = {B1 , B2 , . . .}, of . The conditional expectation Y () = E[X | F] is
defined to be
Y () = E[X | Bj ] if Bj , (8)
where E[X | Bj ] is classical conditional expectation (6). A single set B defines
a partition: B1 = B, B2 = B c , so this agrees with the earlier definition in that
case. The information in F is only which of the Bj occurred. The modern
conditional expectation replaces X with its expected value over the set taht
occurred. This is the expected value of X given the information in F.

2.27. Example of modern conditional expectation: Take to be sequences of


4 coin tosses. Take F to be the algebra of Example 2 determined by the number
of H tosses. Take X() to be the number of H tosses before the first T (e.g.
X(HHTH) = 2, X(TTTT) = 0, X(HHHH) = 4, etc.). With the usual abuse
of notation, we calculate (below): Y ({0}) = 0, Y ({1}) = 1/4, Y ({2}) = 2/3,
Y ({3}) = 3/2, Y ({4}) = 4. Note, for example, that because HHTT and HTHT
are equivalent in F (in the equivalence class {2}), Y (HHTT) = Y (HTHT) = 1/4
even though X(HHTT) 6= X(HTHT). The common value of Y is its average

11
value of X over the outcomes in the equivalence class.

TTTT
{0}
0
expected value = 0

HTTT THTT TTHT TTTH


{1}
1 0 0 0
expected value = (1 + 0 + 0 + 0)/4 = 1/4

HHTT HTHT HTTH THHT THTH TTHH


{2}
2 1 1 0 0 0
expected value = (2 + 1 + 1 + 0 + 0 + 0)/6 = 2/3

HHHT HHTH HTHH THHH


{3}
3 2 1 0
expected value = (3 + 2 + 1 + 0)/4 = 3/2

HHHH
{4}
4
expected value = 4

2.28. Best approximation property: Suppose we have a random variable,


X(), that is not measurable with respect to the algebra F. That is, the
information in F does not completely determine the value of X. The conditional
expectation, Y () = E[X | F], among all functions measurable with respect to
F, is the closest to X in the least squares sense. That is, if Z F, then

E (Z X)2 E (Y X)2 .
   

In fact, this best approximation property will be the definition of conditional


expectation in situations where the partition definition is not directly applica-
ble. The best approximation property for modern conditional expectation is
a consequence of the best approximation for classical conditional expectation.
The least squares error is the sum of the least squares errors over each Bk in the
partition defined by F. We minimize the least squares error in Bk by choosing
Y (Bk ) to be the average of X over Bk (weighted by the probabilities P () for
Bk ). By choosing the best approximation in each Bk , we get the best
approximation overall.

12
This can be expressed in the terminology of linear algebra. The set of func-
tions (random variables) X is a vector space (Hilbert space) with inner product
X
hX, Y i = X()Y ()P () = E [XY ] ,

2  
so kX Y k = E (X Y )2 . The set of functions measurable with respect
to F is a subspace, which we call SF . The conditional expectation, Y , is the
orthogonal projection of X onto SF , which is the element of SF that closest to
X in the norm just given.

2.29. Tower property: Suppose G is a algebra that has less information


than F. That is, every event in G is also in F, but events in F need not be in
G. This is expressed simply (without abuse of notation) as G F. Consider
the (modern) conditional expectations Y = E[X | F] and Z = E[X | G]. The
tower property is the fact that Z = E[Y | G]. That is, conditioning in one step
gives the same result as conditioning in two steps. As we said before, the tower
property underlies the backward equations that are among the most useful tools
of stochastic calculus.
The tower property is an application of the law of total probability to condi-
tional expectation. Suppose P and Q are the partitions of corresponding to
F and G respectively. The partition P is a refinement of Q, which means that
each Ck Q itself is partitioned into events {Bk,1 , Bk,2 , . . .}, where the Bk,j are
elements of P. Then (see Working with conditional probability) for Ck ,
we want to show that Z() = E[Y | Ck ]:

Z() = E[X | Ck ]
X
= E[X | Bjk ]P (Bjk | Ck )
j
X
= Y (Bjk )P (Bjk | Ck )
j
= E[Y | Ck ] .

The linear algebra projection interpretation makes the tower property seem
obvious. Any function measurable with respect to G is also measurable with
respect to F, which means that the subspace SG is contained in SF . If you
project X onto SF then project the projection onto SG , you get the same thing
as projecting X directly onto SG (always orthogonal projections).

2.30. Modern conditional probability: Probabilities can be defined as ex-


pected values of characteristic functions (see below). Therefore, the modern def-
inition of conditional expectation gives a modern definition of conditional prob-
ability. For any event, A, the indicator function, 1A (), (also written A (),
for characteristic function, terminology less used by probabilists because char-
acteristic function means something else to them) is defined by 1A () = 1 if
A, and 1A () = 0 if / A. The obvious formula P (A) = E[1A ] is the

13
representation of the probability as an expected value. The modern conditional
probability then is P (A | F) = E[1A | F]. Unraveling the definitions, this is a
function, YA (), that takes the value P (A | Bk ) whenever Bk . A related
statement, given for practice with notation, is
X
P (A | F)() = P (A | Bk )1Bk () .
Bk PF

3 Markov Chains, I

3.1. Introduction: Discrete time Markov3 chains are a simple abstract class
of discrete random processes. Many practical models are Markov chains. Here
we discuss Markov chains having a finite state space (see below).
Many of the general concepts above come into play here. The probability
space is the space of paths. The natural states of partial information are
described by the algebras Ft , which represent the information obtained by ob-
serving the chain up to time t. The tower property applied to the Ft leads to
backward and forward equations. This section is mostly definitions. The good
stuff is in the next section.

3.2. Time: The time variable, t, will be an integer representing the number
of time units from a starting time. The actual time to go from t to t + 1 could
be a nanosecond (for modeling computer communication networks) or a month
(for modeling bond rating changes), or whatever. To be specific, we usually
start with t = 0 and consider only non negative times.

3.3. State space: At time t the system will be in one of a finite list of states.
This set of states is the state space, S. To be a Markov chain, the state should
be a complete description of the actual state of the system at time t. This
means that it should contain any information about the system at time t that
helps predict the state at future times t + 1, t + 2, ... . This is illustrated with
the hidden Markov model below. The state at time t will be called X(t) or Xt .
Eventually, there may be an also, so that the state is a function of t and :
X(t, ) or Xt (). The states may be called s1 , . . ., sm , or simply 1, 2, . . . , m.
depending on the context.

3.4. Path space: The sequence of states X0 , X1 , . . ., XT , is a path. The set of


paths is path space. It is possible and often convenient to use the set of paths as
the probability space, . When we do this, the path X = (X0 , X1 , . . . , XT ) =
(X(0), X(1), . . . , X(T )) plays the role that was played by the outcome in the
general theory above. We will soon have a formula for the P (X), probability of
path X, in terms of transition probabilities.
3 The Russian mathematician A. A. Markov was active in the last decades of the 19th

century. He is known for his path breaking work on the distribution of prime numbers as well
as on probability.

14
In principle, it should be possible to calculate the probability of any event
(such as {X(2) 6= s}, or {X(t) = s1 for some t T }) by listing all the paths
(outcomes) in that event and summing their probabilities. This is rarely the
easiest way. For one thing, the path space, while finite, tends to be enormous.
For example, if there are m = |S| = 7 states and T = 50 times, then the number
of paths is kk = mT = 750 , which is about 1.8 1042 . This number is beyond
computers.

3.5. Algebras Ft and Gt : The information learned by observing a Markov


chain up to and including time t is Ft . Paths X1 and X2 are equivalent in Ft
if X1 (s) = X2 (s) for 0 s t. Said only slightly differently, the equivalence
class of path X is the set of paths X 0 with X 0 (s) = X(s) for 0 s t. The Ft
form an increasing family of algebras: Ft Ft+1 . (Event A is in Ft if we can
tell whether A occurred by knowing X(s) for 0 s t. In this case, we also
can tell whether A occurred by knowing X(s) for 0 s t + 1, which is what
it means for A to be in Ft+1 .)
The algebra Gt is generated by X(t) only. It encodes the information learned
by observing X at time t only, not at earlier times. Clearly Gt Ft , but Gt is
not contained in Gt+1 , because X(t + 1) does not determine X(t).

3.6. Nonanticipating (adapted) functions: The underlying outcome, which


was called , is now called X. A function of a the outcome, or function of
a random variable, will now be called F (X) instead of X(). Over and over
in stochastic processes, we deal with functions that depend on both X and t.
Such a function will be called F (X, t). The simplest such function is F (X, t) =
X(t). More complicated functions are: (i) F (X, t) = 1 if X(s) = 1 for some
s t, F (X, t) = 0 otherwise, and (ii) F (X, t) = min(s > t) with X(s) = 1 or
F (X, t) = T if X(s) 6= 1 for t < s T .
A function F (X, t) is nonanticipating (also called adapted, though the notions
are slightly different in more sophisticated situations) if, for each t, the function
of X given by F (X, t) is measurable with respect to Ft . This is the same as
saying that F (X, t) is determined by the values X(s) for s t. The function
(i) above has this property but (ii) does not.
Nonanticipating functions are important for several reasons. In time, we
will see that the Ito integral makes sense only for nonanticipating functions.
Moreover, functions F (X, t) are a model of decision making under uncertainty.
That F is nonanticipating means that the decision at time t is made based on
information available at time t and does not depend on future information.

3.7. Markov property: Informally, the Markov property is that X(t) is all the
information about the past that is helpful in predicting the future. In classical
terms, for example,

P (X(t + 1) = k|X(t) = j) = P (X(t + 1) = k|X(t) = j, X(t 1) = l, etc.) .

15
In modern notation, this may be stated

P (X(t + 1) = k | Ft ) = P (X(t + 1) = k | Gt ) . (9)

Recall that both sides are functions of the outcome, X. The function on the
right side, to be measurable with respect to Gt must be a function of X(t) only
(see Generating by a function in the previous section). The left side also is
a function, but in general could depend on all the values X(s) for s t. The
equality (9) states that this function depends on X(t) only.
This may be interpreted as the absence of hidden variables, variables that
influence the evolution of the Markov chain but are not observable or included
in the state description. If there were hidden variables, observing the chain for a
long period might help identify them and therefore change our prediction of the
future state. The Markov property (9) states, on the contrary, that observing
X(s) for s < t does not change our predictions.

3.8. Transition probabilities: The conditional probabilities (9) are transition


probabilities:

Pjk = P (X(t + 1) = k | X(t) = j) = P (j k in one step) .

The Markov chain is stationary if the transition probabilities Pjk are indepen-
dent of t. Each transition probability Pjk is between 0 and 1, with values 0 and
1 allowed, though 0 is more common than 1. Also, with j fixed, the Pjk must
sum to 1 (summing over k) because k = 1, 2, . . ., m is a complete list of the
possible states at time t + 1.

3.9. Path probabilities: The Markov property leads to a formula for the
probabilities of individual path outcomes P (X) as products of transition prob-
abilities. We do this here for a stationary Markov chain to keep the notation
simple. First, suppose that the probabilities of the initial states are known, and
call them
f0 (j) = P (X(0) = j) .
The Bayes rule (2) implies that

P (X(1) = k and X(0) = j)


= P (X(1) = k | X(0) = j) P (X(0) = j) = f0 (j)Pjk .

Using this argument again, and using (9), we find (changing the order of the
factors on the last line)

P (X(2) = l and X(1) = k and X(0) = j)


= P (X(2) = l | X(1) = k and X(0) = j) P (X(1) = k and X(0) = j)
= P (X(2) = l | X(1) = k) P (X(1) = k and X(0) = j)
= f0 (j)Pjk Pkl .

This can be extended to paths of any length.

16
One way to express the general formula uses a notational habit common
in probability, using upper case letters to represent a random value of a vari-
able and lower case for generic values of the same quantity (see Terminol-
ogy, Section 2, but note that the meaning of X has changed). We write
x = (x(0), x(1), , x(T )) for a generic path, and seek P (x) = P (X = x) =
P (X(0) = x(0), X(1) = x(1), ). The argument above shows that this is given
by
1
TY
P (x) = f0 (x(0))Px(0),x(1) Px(T 1),x(T ) = f0 (x(0)) Px(t),x(t+1) . (10)
t=0

3.10. Transition matrix: The transition probabilities form an m m matrix,


P (an unfortunate conflict of notation), called the transition matrix. The (j, k)
entry of P is the transition probability Pjk P = P (j k). The sum of the
entries of the transition matrix P in row j is k Pjk = 1. A matrix with these
properties: no negative entries, all row sums equal to 1, is a stochastic matrix.
Any stochastic matrix can be the transition matrix for a Markov chain.
Methods from linear algebra often help in the analysis of Markov chains. As
we will see in the next lecture, the time s transition probability
s
Pjk = P (Xt+s = k | Xt = j)

is the (j, k) entry of P s , the sth power of the transition matrix (explanation
below). Also, as discussed later, steady state probabilities form an eigenvector
of P corresponding to eigenvalue = 1.

3.11. Example 3, coin flips: The state space has m = 2 states, called U
(up) and D (down). Writing H and T would conflict with T being the length
of the chain. The coin starts in the U position, which means that f0 (U) = 1
and f0 (D) = 0. At every time step, the coin turns over with 20% probability,
so the transition probabilities are PU U = .8, PU D = .2, PDU = .2, PDD = .8.
The transition matrix is (taking U for 1 and D for 2):
 
.8 .2
P =
.2 .8

For example, we can calculate


   
.68 .32 .5648 .4352
P2 = P P = and P 4 = P 2 P 2 = .
.32 .88 .4352 .5648

This implies that P (X(4) = D) = P (X(0) = U X(4) = D) = PU4 D = .5648.


The eigenvalues of P are 1 = 1 and 2 = .6, the former required by theory.
Numerical experimentation should can convince the reader that
 
P .5 .5 = const s2 .
s
.5 .5

17
Take T = 3 and let A be the event UUzU, where the state X(2) = z is
unknown. There are two outcomes (paths) in A:

A = {UUUU, UUDU} ,

so P (A) = P (UUUU) + P (UUDU). The individual path probabilities are cal-


culated using (10):
.8 .8 .8
U U U U so P (UUUU) = 1 .8 .8 .8 = .512 .
.8 .2 .2
U U D U so P (UUDU) = 1 .8 .2 .2 = .032 .
Thus, P (A) = .512 + .032 = .544.

3.12. Example 4: There are two coins, F (fast) and S (slow). Either coin will
be either U or D at any given time. Only one coin is present at any given time
but sometimes the coin is replaced (F for S or vice versa) without changing its
UD status. The F coin has the same UD transition probabilities as example
3. The S coin has UD transition probabilities:
 
.9 .1
.05 .95

The probability of coin replacement at any given time is 30%. The replacement
(if it happens) is done after the (possible) coin flip without changing the UD
status of the coin after that flip. The Markov chain has 4 states, which we
arbitrarily number 1: UF, 2: DF, 3: US, 4: DS. States 1 and 3 are U states
while states 1 and 2 are F states, etc. The transition matrix is 4 4. We can
calculate, for example, the (non) transition probability for UF UF. We first
have a U U (non) transition then an F (non) transition. The probability
is then P (U U | F ) P (F F ) = .8 .7 = .56. The other entries can be
found in a similar way. The transitions are:

U F U F U F DF U F U S U F DS
DF U F DF DF DF U S DF DS
U S U F U S DF U S U S U S DS .

DS U F DS DF DS U S DS DS

The resulting transition matrix is



.8 .7 .2 .7 .8 .3 .2 .3
.2 .7 .8 .7 .2 .3 .8 .3
P = .9 .3 .1 .3 .9 .7
.
.1 .7
.05 .3 .95 .3 .05 .7 .95 .7

If we start with U but equally likely F or S, and want to know the probability
of being D after 4 time periods, the answer is
4 4 4 4

.5 P12 + P14 + P32 + P34

18
because states 1 = UF and 3 = US are the (equally likely) possible initial U
states, and 2 = DF and 4 = DS are the two D states. We also could calculate
P (U U zU ) by adding up the probabilities of the 32 (list them) paths that make
up this event.

3.13. Example 5, incomplete state information: In the model of example 4


we might be able to observe the UD status but not the FS status. Let X(t)
be the state of the Example 4 model above at time t. Suppose Y (t) = U if
X(t) = U F or X(t) = U D, and Y (t) = D if X(t) = DF or X(t) = DD. Then
the sequence Y (t) is a stochastic process but it is not a Markov chain. We can
better predict U D transitions if we know whether the coin is F or S, or
even if we have a basis for guessing its FS status.
For example, suppose that the four states (UF, DF, US, DS) at time t = 0
are equally likely, that we know Y (1) = U and we want to guess whether Y (2)
will again be U. If Y (0) is D then we are more likely to have the F coin so
a Y (1) = U Y (2) = D transition is more likely. That is, with Y (1) fixed,
Y (0) = D makes it less likely to have Y (2) = U . This is a violation of the
Markov property brought about by incomplete state information. Models of this
kind are called hidden Markov models. Statistical estimation of the unobserved
variable is a topic for another day.
Thanks to Laura K and Craig for pointing out mistakes and confusions in
earlier drafts.

19
Stochastic Calculus Notes, Lecture 2
Last modied September 16, 2004

1 Forward and Backward Equations for Markov


chains

1.1. Introduction: Forward and backward equations are useful ways to


get answers to quantitative questions about Markov chains. The probabilities
u(k, t) = P (X(t) = k) satisfy forward equations that allows us to compute all
the numbers u(k, t + 1) once the all the numbers u(j, t) are known. This moves
us forward from time t to time t+1. The expected values f (k, t) = E[V (X(T )) |
X(t) = k] (for t < T ) satisfy a backward equation that allows us to calculate
the numbers f (k, t) once all the f (j, t + 1) are known. A duality relation allows
us to infer the forward equation from the backward equation, or conversely.
The transition matrix is the generator of both equations, though in dierent
ways. There are many related problems that have solutions involving forward
and backward equations. Two treated here are hitting probabilities and random
compound interest.

1.2. Forward equation, functional version: Let u(k, t) = P (X(t) = k). The
law of total probability gives

u(k, t + 1) = P (X(t + 1) = k)

= P (X(t + 1) = k | X(t) = j) P (X(t) = j) .
j

Therefore 
u(k, t + 1) = Pjk u(j, t) . (1)
j

This is the forward equation for probabilities. It is also called the Kolmogorov
forward equation or the Chapman Kolmogorov equation. Once u(j, t) is known
for all j S, (1) gives u(k, t + 1) for any k. Thus, we can go forward in time
from t = 0 to t = 1, etc. and calculate all the numbers u(k, t).
Note that if we just wanted one number, say u(17, 49), still we would have
to calculate many related quantities, all the u(j, t) for t < 49. If the state space
is too large, this direct forward equation approach may be impractical.

1.3. Row and column vectors: If A is an n m matrix, and B is an m p


matrix, then AB is np. The matrices are compatible for multiplication because
the second dimension of A, the number of columns, matches the rst dimension
of B, the number of rows. A matrix with just one column is a column vector.1
1 The physicists more sophisticated idea that a vector is a physical quantity with certain

transformation properties is inoperative here.

1
Just one row makes it a row vector. Matrix-vector multiplication is a special
case of matrix-matrix multiplication. We often denote genuine matrices (more
than one row and column) with capital letters and vectors, row or column, with
lower case. In particular, if u is an n dimensional row vector, a 1n matrix, and
A is an n n matrix, then uA is another n dimensional row vector. We do not
write Au for this because that would be incompatible. Matrix multiplication is
always associative. For example, if u is a row vector and A and B are square
matrices, then (uA)B = u(AB). We can compute the row vector uA then
multiply by B, or we can compute the n n matrix AB then multiply by u.
If u is a row vector, we usually denote the k-th entry by uk instead of u1k .
n vector f is fk instead of fk1 . If both u and f
Similarly, the k-th entry of column
have n components, then uf = k=1 uk fk is a 11 matrix, i.e. a number. Thus,
treating row and column vectors as special kinds of matrices makes the product
of a row with a column vector natural, but not, for example, the product of two
column vectors.

1.4. Forward equation, matrix version: The probabilities u(k, t) form the
components of a row vector, u(t), with components uk (t) = u(k, t) (an abuse of
notation). The forward equation (1) may be expressed (check this)

u(t + 1) = u(t)P . (2)

Because matrix multiplication is associative, we have

u(t) = u(t 1)P = u(t 2)P 2 = = u(0)P t . (3)

Tricks of matrix multiplication give information about the evolution of probabil-


ities. For example, we can write a formula for u(t) in terms of the eigenvectors
and eigenvalues of P . Also, we can save eort in computing u(t) for large t by
repeated squaring:
 2 k
P P2 P2 = P4 P2

using just k matrix multiplications. For example, this computes P 1024 using
just ten matrix multiplies, instead of a thousand.

1.5. Backward equation, functional version: Suppose we run the Markov


chain until time T then get a reward, V (X(T )). For t T , dene the condi-
tional expectations

f (k, t) = E [V (X(T )) | X(t) = k] . (4)

This expression is used so often it often is abbreviated

f (k, t) = Ek,t [V (X(T ))] .

These satisfy a backward equation that follows from the law of total probability:

f (k, t) = E [V (X(T )) | X(t) = k]

2

= E [V (X(T )) | X(t) = k and X(t + 1) = j] P (X(t + 1) = j | X(t) = k)
jS

f (k, t) = f (j, t + 1)Pkj . (5)
jS

The Markov property is used to infer that

E[V (X(T )) | X(t) = k and X(t + 1) = j] = Ej,t+1 [V (X(T ))] .

The dynamics (5) must be supplemented with the final condition

f (k, T ) = V (k) . (6)

Using these, we may compute all the numbers f (k, T 1), then all the numbers
f (k, T 2), etc.

1.6. Backward equation using modern conditional expectation: As usual, Ft


denotes the algebra generated by X(0), . . ., X(t). Dene F (t) = E[V (X(T )) |
Ft ]. The left side is a random variable that is measurable in Ft , which means
that F (t) is a function of (X(0), . . . , X(t)). The Markov property implies that
F (t) actually is measurable with respect to Gt , the algebra generated by X(t)
alone. This means that F (t) is a function of X(t) alone, which is to say that
there is a function f (k, t) so that F (t) = f (X(t), t), and

f (X(t), t) = E[V (X(T )) | Ft ] = E[V (X(T )) | Gt ] .

Since Gt is generated by the partition {k} = {X(t) = k}, this is the same def-
inition (4). Moreover, because Ft Ft+1 and F (t + 1) = E[V (X(T )) | Ft+1 ],
the tower property gives

E[V (X(T )) | Ft ] = E[F (t + 1) | Ft ] ,

so that, again using the Markov property,

F (t) = E[F (t + 1) | Gt ] . (7)

Note that this is a version of the tower property. On the event {X(t) = k}, the
right side above takes the value

f (j, t + 1) P (x(t + 1) = j | X(t) = k) .
jS

Thus, (7) is the same as the backward equation (5). In the continuous time
versions to come, (7) will be very handy.

1.7. Backward equation, matrix version: We organize the numbers f (k, t)


into a column vector f (t) = (f (1, t), f (2, t), )t . It is barely an abuse to write
f (t) both for a function of k and a vector. After all, any computer programmer

3
knows that a vector really is a function of the index. The backward equation
(5) then is equivalent to (check this)

f (t) = P f (t + 1) . (8)

Again the associativity of matrix multiplication lets us write, for example,

f (t) = P T t V ,

writing V for the vector of values of V .

1.8. Invariant expectation value: We combine the conditional expectations


(4) with the probabilities u(k, t) with the law of total probability to get, for any
t,

E[V (X(T ))] = P (X(t) = k) E[V (X(T )) | X(t) = k]
kS

= u(k, t)f (k, t)
kS
= u(t)f (t) .

The last line is a natural example of an inner product between a row vector and a
column vector. Note that the product E[V (X(T ))] = u(t)f (t) does not depend
on t even though u(t) and f (t) are dierent for dierent t. For this invariance
to be possible, the forward evolution equation for u and the backward equation
for f must be related.

1.9. Relationship between the forward and backward equations: It often


is possible to derive the backward equation from the forward equation and
conversely using the invariance of u(t)f (t). For example, suppose we know
that f (t) = P f (t + 1). Then u(t + 1)f (t + 1) = u(t)f (t) may be rewritten
u(t + 1)f (t + 1) = u(t)P f (t + 1), which may be rearranged as (using rules of
matrix multiplication)

( u(t + 1) u(t)P ) f (t + 1) = 0 .

If this is true for enough linearly independent vectors f (t + 1), then the vector
u(t+1)u(t)P must be zero, which is the matrix version of the forward equation
(2). A theoretically minded reader can verify that enough f vectors are produced
if the transition matrix is nonsingular and we choose a linearly independent
family of reward vectors, V . In the same way, the backward evolution of f is
a consequence of invariance and the forward evolution of u.
We now have two ways to evaluate E[V (X(T ))]: (i) start with given u(0),
compute u(T ) = u(0)P T , evaluate u(T )V , or (ii) start with given V = f (T ),
compute f (0) = P T V , then evaluate u(0)f (0). The former might be preferable,
for example, if we had a number a number of dierent reward functions to
evaluate. We could compute u(T ) once then evalute u(T )V for all our V vectors.

4
1.10. Duality: In its simplest form, duality is the relationship between a
matrix and its transpose. The set of column vectors with n components is a
vector space of dimension n. The set of n component row vectors is the dual
space, which has the same dimension but may be considered to be a dierent
space. We can combine an element of a vector space with an element of its dual
to get a number: row vector u multiplied by column vector f yields the number
uf . Any linear transformation on the vector space of column vectors is repre-
sented by an n n matrix, P . This matrix also denes a linear transformation,
the dual transformation, on the dual space of row vectors, given by u uP .
This is the sense in which the forward and backward equations are dual to each
other.
Some people prefer not to use row vectors and instead think of organizing
the probabilities u(k, t) into a column vector that is the transpose of what
we called u(t). For them, the forward equation would be written u(t + 1) =
P t u(t) (note the notational problem: the t in P t means transpose while the
t in u(t) and f (t) refers to time.). The invariance relation for them would be
ut (t + 1)f (t + 1) = ut (t)f (t). The transpose of a matrix is often called its dual.

1.11. Hitting probabilities, backwards: The hitting probability for state 1


up to time T is
P (X(t) = 1 for some t [0, T ]) . (9)
Here and below we write [a, b] for all the integers between a and b, including
a and/or b if they are integers. Hitting probabilities can be computed using
forward or backward equations, often by modifying P and adding boundary
conditions. For one backward equation approach, dene

f (k, t) = P (X(t ) = 1 for some t [t, T ] | X(t) = k) . (10)

Clearly,
f (1, t) = 1 for all t, (11)
and
f (k, T ) = 0 for k = 1. (12)
Moreover, if k = 1, the law of total probabilities yields a backward relation

f (k, t) = Pkj f (j, t + 1) . (13)
jS

The dierence between this and the plain backward equation (5) is that the
relation (13) holds only for interior states k = 1, while the boundary condition
(11) supplies the values of f (1, t). The sum on the right of (13) includes the
term corresponding to state j = 1.

1.12. Hitting probabilities, forward: We also can compute the hitting proba-
bilities (9) using a forward equation approach. Dene the survival probabilities

u(k, t) = P (X(t) = k and X(t ) = 1 for t [0, t]) . (14)

5
These satisfy the obvious boundary condition

u(1, t) = 0 , (15)

and initial condition


u(k, 0) = 1 for k = 1. (16)
The forward equation is (as the reader should check)

u(k, t + 1) = u(j, t)Pjk . (17)
jS

We may include or exclude the term with j = 1 on the right because u(1, t) = 0.
 interior states k = 1. The overall probability
Of course, (17) applies only at
of survival up to time
 T is kS u(k, T ) and the hitting probability is the
complementary 1 kS u(k, T ).
The matrix vector formulation of this involves the row vector

(t) = (u(2, t), u(3, t), . . .)


u

and the matrix P formed from P by removing the rst row and column. The
evolution equation (17) and boundary condition (15) are both expressed by the
matrix equation
(t)P .
(t + 1) = u
u
Note that P is not a stochastic matrix because some of the row sums are less
than one: 
Pkj < 1 if Pk1 > 0 .
j=1

1.13. Absorbing boundaries: Absorbing boundaries are another way to think


about hitting and survival probabilities. The absorbing boundary Markov chain
is the same as the original chain (same transition probabilities) as long as the
state is not one of the boundary states. In the absorbing chain, the state never
again changes after it visits an absorbing boundary point. If P is the transition
matrix of the absorbing chain and P is the original transition matrix, this means
that P jk = Pjk if j is not a boundary state, while P jk = 0 if j is a boundary
state and k = j. The probabilities u(k, t) for the absorbing chain are the same
as the survival probabilities (14) for the original chain.

1.14. Running cost: Suppose we have a running cost functtion, W (x), and
we want to calculate  T 

f =E W (X(t)) . (18)
t=0

Sums like this are called path dependent because their value depends on the
whole path, not just the nal value X(T ). We can calculate (18) with the

6
forward equation using


T
f = E [W (X(t))]
t=0

T
= u(t)W . (19)
t=0

Here W is the column vector with components Wk = W (k). We compute the


probabilities that are the components of the u(t) using the standard forward
equation (2) and sum the products (19).
One backward equation approach uses the quantities
 T 


f (k, t) = Ek,t W (X(t )) . (20)
t =t

These satisfy (check this):

f (t) = P f (t + 1) + W . (21)

Starting with f (T ) = W , we work backwards with (21) until we reach the


desired f (0).

1.15. Multiplicitive functionals: For some reason, a function of a function is


often called a functional. The path, X(t), is a function of t, so a function, F (X),
that depends on the whole path is often called a functional. Some applications
call for nding the expected value of a multiplicative functional:
T 

f =E V (X(t)) . (22)
t=0

For example, X(t) could represent the state of a nancial market and V (k) =
1 + r(k) the interest rate for state k. Then (22) would be the expected total
interest. We also can write V (k) = eW (k) , so that
 

V (X(t)) = exp W (X(t)) = eZ ,



with Z = W (x(t)). This dos not solve the problem of evaluating (22) because
E [e ] = e
z E(Z)
.
The backward equation approach uses the intermediate quantities
 T 


f (k, t) = Ek,t V (X(t )) .
t =t

The t = t term in the product has V (X(t)) = V (k). The nal condition is
f (k, T ) = V (k). The backward evolution equation is derived more or less as

7
before:
 


f (k, t) = Ek,t V (k) V (X(t ))
t >t
 

T

= V (k)Ek,t V (X(t ))
t =t+1
= V (k)Ek,t [f (X(t + 1), t + 1)] (the tower property)
 
f (k, t) = V (k) P f (t + 1) (k) . (23)

In the last line on the right, f (t + 1) is teh column vector with  components
f (k, t + 1) and P f (t + 1) is teh matrix vector product. We write P f (t + 1) (k)
for the k th component of the column vector P f (t + 1). We could express the
whole thing in matrix terms using diag(V ), the diagonal matrix with V (k) in
the (k, k) position:
f (t) = diag(V )P f (t + 1) .
A version of (23) for Brownian motion is called the Feynman-Kac formula.

1.16. Branching processes: One forward equation approach to (22) leads to


a dierent interpretation of the answer. Let B(k, t) be the event {X(t) = k}
and I(k, t) the indicator function of B(k, t). That is I(k, t, X) = 1 if X
B(k, t) (i.e. X(t) = k), and I(k, t, X) = 0 otherwise. It is in keeping with the
probabilists habbit of leaving out the arguents of functions when the argument
is the underlying random outcome. We have u(k, t) = E[I(k, t)]. The forward
equation for the quantities
 
t

g(k, t) = E I(k, t) V (X(t )) (24)
t =0

is (see homework):  
g(k, t) = V (k) g(t 1)P (k) . (25)
This is also the forward equation for a branching process with branching
factors V (k). At time t, the branching process has N (k, t) particles, or walkers,
at state k. The numbers N (k, t) are random. A time step of the branching
process has two parts. First, each particle takes one step of the Markov chain.
A particle at state j goes to state k with probability Pjk . All steps for all
particles are independent. Then, each particle at state k does a branching or
birth/death step in which the particle is replaced by a random number of particles
with expected number V (k). For example, if V (k) = 1/2, we could delete the
particle (death) with probability half. If V (k) = 2.8, we could keep the existing
particle, one new one, then add a third with probability .8. All particles are
treated independently. If there are m particles in state k before the birth/death
step, the expected number after the birth/death step is V (k)m. The expected
number of particles, g(k, t) = E[N (k, t)], satises (25).

8
When V (k) = 1 for all k there need be no birth or death. There will be
just one particle, the path X(t). The number of particles at state k at time t,
N (k, t), will be zero if X(t) = k or one if X(t) = k. In fact, N (k, t) = I(k, t)(X).
The expected values will be g(k, t) = E[N (k, t)] = E[I(k, t)] = u(k, t).
The branching process representation of (22) is possible when V (k) 0 for
all k. Monte Carlo methods based on branching processes are more accurate
than direct Monte Carlo in many cases.

2 Lattices, trees, and random walk

2.1. Introduction: Random walk on a lattice is an important example


where the abstract theory of Markov chains is used. It is the simplest model of
something randomly moving through space with none of the subtlty of Brownian
motion, though random walk on a lattice is a useful approximation to Brownian
motion, and vice versa. The forward and backward equations take a specic
simple form for lattice random walk and it is often possible to calculate or
approximate the solutions by hand. Boundary conditions will be applied at the
boundaries of lattices, hence the name.
We pursue forward and backward equations for several reasons. First, they
often are the best way to calculate expectations and hitting probabilities. Sec-
ond, many theoretical qualitative properties of specic Markov chains are un-
derstood using backward or forward equatins. Third, they help explain and
motivate the partial dierential equations that arise as backward and forward
equations for diusion processes.

2.2. Simple random walk: The state space for simple random walk is the
integers, positive and negative. At each time, the walker has three choices:
(A) move up one, (B) do not move, (C) move down one. The probabilities are
P (A) = P (k k + 1) = a, P (B) = P (X(t + 1) = X(t)) = b, and P (X(t + 1) =
X(t)1) = c. Naturally, we need a, b, and c to be non-negative and a+b+c = 1.
The transation matrix2 has b on the diagonal (Pkk = b for all k), a on the super-
diagonal (Pk,k+1 = a for all k), and c on the sub diagonal. All other matrix
elements Pjk are zero.
This Markov chain is homogeneous or translation invariant: The probalities
of moving up or down are independent of X(t). A translation by k is a shift of
everything by k (I do not know why this is called translation). Translation
invariance means, for example, that the probability of going from m to l in s
steps is the same as the probability of going from m + k to l + k in s steps:
P (X(t + s) = l | X(t) = m) = P (X(t + s) = l + k | X(t) = m + k). It is common
to simplify general discussions by choosing k so that X(0) = 0. Mathematicians
often say without loss of generality or w.l.o.g. when doing so.
2 This matrix is infinite when the state space is infinite. Matrix multiplication is still

defined. For example, the k component of uP is given by (uP )k = u P . This possibly
j j jk
infinite sum has only three nonzero terms when P is tridiagonal.

9
Often, particularly when discussing multidimensional random walk, we use
x, y, etc. instead of j, k, etc. to denote lattice points (states of the Markov
chain). Probabilists often use lower case Latin letters for general possible values
of a random variable, while using the capital letter for the random variable
itself. Thus, we might write Pxy = P (X(t + 1) = x | X(t) = y). As an execise
in denition unwrapping, review Lecture 1 and check that this is the same as
PX(t),x = P (X(t + 1) = x | Ft ).

2.3. Gaussian approximation, drift, and volatility: We can write X(t + 1) =


X(t) + Y (t), where P (Y (t) = 1) = a, P (Y (t) = 0) = b, and P (Y (t) = 1) = c.
The random variables Y (t) are independent of each other because of the Markov
property and homogeniety. Assuming (without loss of generality) that X(0) = 0,
we have

t1
X(t) = Y (s) , (26)
s=0

which expresses X(t) as a sum of iid (independent and identically distributed)


random variables. The central limit theorem then tells us that for large t, X(t)
is approximately Gaussian with mean t and variance 2 t, where = E[Y (t)] =
a b and 2 = var[Y (t)] = a + c (a c)2 . These are called drift and volatility3
respectively. The mean and variance of X(t) grow linearly in time with rate
and 2 respectively. Figure 1 shows some probability distributions for simple
random walk.

2.4. Trees: Simple random walk can be thought of as a sequence of decisions.


At each time you decide: up(A), stay(B), or down(C). A more general sequence
of decisions is a decision tree. In a general decision tree, making choice A at
time 0 then B at time one would have a dierent result than choosing rst B
then A. After t decisions, there could be 3t dierent decision paths and results.
The simple random walk decision tree is recombining, which means that
many dierent decision paths lead to the same X(t) For example, start (w.l.o.g)
with X(0) = 0, the paths ABB, CAA, BBA, etc. all lead to X(3) = 1. A
recombining tree is much smaller than a general decision tree. For simple ran-
dom walk, after t steps there are 2t + 1 possible states, instead of up to 3t . For
t = 10, this is 21 instead of about 60 thousand.

2.5. Urn models: Urn models illustrate several features of more general
random walks. Unlike simple random walk, urn models are mean reverting and
have steady state probabilities that determine their large time behavior. We will
come back to them when we discuss scaling in future lectures.
The simple urn contains n balls that are identical except for their color.
There are k red balls and n k green ones. At each state, someone chooses
one of the balls at random with each ball equally likely to be chosen. He or she
replaces the chosen ball with a fresh ball that is red with probability p and green
3 People
use the term volatility in two distinct ways. In the Black Scholes theory, volatility
means something else.

10
a = 0.20, b= 0.20, c= 0.60, T = 8
0.2

0.15
probability

0.1

0.05

0
8 6 4 2 0 2 4 6
k
a = 0.20, b= 0.20, c= 0.60, T = 60
0.08

0.06
probability

0.04

0.02

0
45 40 35 30 25 20 15 10 5 0
k

Figure 1: The probability distributions after T = 8 (top) and T = 60 (bottom)


steps for simple random walk. The smooth curve and circles represent the cen-
tral limit theorem Gaussian approximation. The plots have dierent probability
and k scales. Values not shown have very small probability.

11
with probability 1 p. All choices are independent. The number of red balls
decreases by one if he or she removes a red ball and returns a green one. This
happens with probabilty (k/n) (1 p). Similarly, the k k + 1 probability is
((n k)/n) p. In formal terms, the state space is the integers from 0 to n and
the transition probabilities are

k(1 p) (2p 1)k + (p 1)n (n k)p


Pk,k1 = , Pkk = , Pk,k+1 = ,
n n n
Pjk = 0 otherwise.
If these formulas are right, then Pk,k1 + Pkk + Pk,k+1 = 1.

2.6. Urn model steady state: For the simple urn model, the probabilities
u(k, t) = P (X(t) = k) converge to steady state probabilities, v(k), as t .
This is illustrated in Figure (2). The steady state probabilities are

n k
v(k) = p (1 p)nk .
k

The steady state probabilities have the property that if u(k, t) = v(k) for all
k, then u(k, t + 1) = v(k) also for all k. This is statistical steady state because
the probabilities have reached steady state values though the states themselves
keep changing, as in Figure (3). In matrix vector notation, we can form the
row vector, v, with entries v(k). Then v is a statistical steady state if vP = v.
It is no coincedence that v(k) is the probability of getting k red balls in n
independent trials with probability p for each trial. The steady state expected
number of red balls is
Ev [X] = np ,
where the notation Ev [] refers to expectation in probability distribution v.

2.7. Urn model mean reversion: If we let m(t) be the expected value if X(t),
then a calculation using the transition probabilities gives the relation
1
m(t + 1) = m(t) + (np m(t)) . (27)
n
This relation shows not only that m(t) = np is a steady state value (m(t) = np
implies m(t + 1) = np), but also that
np as t (if r(t) = m(t) np,
m(t)
then r(t + 1) = r(t) with || = 1 n1 < 1).
Another way of expression mean reversion will be useful in discussing stochas-
tic dierential equations later. Because the urn Model is a Markov chain,

E [X(t + 1) | Ft ] = E [X(t + 1) | X(t)]

Again using the transition probabilities, we get


1
E [X(t + 1) | Ft ] = X(t) + (np X(t)) . (28)
n

12
n = 30, T = 6
0.14

0.12

0.1
probability

0.08

0.06

0.04

0.02

0
0 5 10 15 20 25 30
k

Figure 2: The probability distributions for the simple urn model plotted every
T time steps. The rst curve is blue, low, and at. The last one is red and most
peaked in the center. The computation starts with each state being equally
likely. Over time, states near the edges become less likely.

13
p = 0.5, n = 100
100

90

80

70

60
X

50

40

30

20

10

0
0 50 100 150 200 250 300 350 400
t

Figure 3: A Monte-Carlo sampling of 11 paths from the simple urn model. At


time t = 0 (the left edge), the paths are evenly spaced within the state space.

14
If X(t) > np, we have
1
E[X(t)] = E[X(t + 1) X(t)] (np X(t)) ,
n
is negative. If X(t) < np, it is positive.

2.8. Boundaries: The terms boundary, interior, region, etc. as used in the
general discussion of Markov chain hitting probabilities come from applications
in lattice Markov chains such as simple random walk. For example, the region
x > has boundary x = . The quantities

u(x, t) = P (X(t) = x and X(s) > for 0 s t)

satisfy the forward equation (just (1) in this special case)

u(x, t + 1) = au(x 1, t) + bu(x, t) + cu(x + 1, t)

for x > together with the absorbing boundary condition u(, t) = 0. We could
create a nite state space Markov chain by considering a region < x < with
simple random walk in the interior together with absorbing boundaries at x =
and x = . Absorbing boundary conditions are also called Dirichlet boundary
conditions.
Another way to create a nite state space Markov chain is to put reflecting
boundaries at x = and x = . This chain has the same transition probabilities
as ordinary random walk in the interior ( < x < ). However, transitions from
to 1 are disallowed and replaced by transitions from to + 1. This
means changing the transition probabilities starting from x = to

P ( 1) = P,1 = 0 , P ( ) = P = b , P ( +1) = P,+1 = a+c .

The transition rules at x = are similarly changed to block + 1 transi-


tions. There is some freedom in dening the reection rules at the boundaries.
We could, for example, make P ( ) = b + c and P ( + 1) = a, which
changes the blocked transition to standing still rather than moving right. We
return to this point in discussing oblique reflection in multidimensional random
walks and diusions.

2.9. Multidimensional lattice: The unit square lattice in d dimensions is the


set of dtuples of integers (the set of integers is called Z):

x = (x1 , . . . , xd ) with xj Z for 1 j d .

The scaled square lattice, with lattice spacing h > 0, is the set of points hx =
(hx1 , . . . , hxd ), where x are integer lattice points. In the present discussion, the
scaling is irrelevent, so we use the unit lattice. We say that latice points x and
y are neighbors if

|xj yj | 1 for all coordinates j = 1, . . . , d .

15
Stochastic Calculus Notes, Lecture 3
Last modified September 30, 2004

1 Martingales and stopping times

1.1. Introduction: Martingales and stopping times are inportant technical


tools used in the study of stochastic processes such as Markov chains and diffu-
sions. A martingale is a stochastic process that is always unpredictable in the
sense that E[Ft+t0 | Ft ] = Ft (see below) if t0 > 0. A stopping time is a random
time, (), so that we know at time t whether to stop, i.e. the event { t} is
measurable in Ft . These tools work well together because stopping a martingale
at a stopping also has mean zero: if t t0 , then E [F | Ft ] = Ft . A central
fact about the Ito calculus is that Ito integrals with respect to Brownian motion
are martingales.

1.2. Stochstic process: Here is a more abstract definition of a discrete time


stochastic processes. We have a probability space, . The information available
at time t is represented by the algebra of events Ft . We assume that for each
t, Ft Ft+1 ; since we are supposed to gain information going from t to t + 1,
every known event in Ft is also known at time t + 1. A stochastic process
is a family of random variables, Xt (), with Xt Ft (Xt measureable with
respect to Ft ). Sometimes it happens that the random variables Xt contain
all the information in the Ft in the sense that Ft is generated by X1 , . . ., Xt .
This the minimal algebra in which the Xt form a stochastic process. In other
cases Ft contains more information. Economists use these possibilities when
they distinguish between the weak efficient market hypothesis (the Ft are
minimal), and the strong hypothesis (Ft contains all the public information
in the world, literally). In the case of minimal Ft , it may be possible to identify
the outcome, , with the path X = X1 , . . . , XT . This is less common when the
Ft are not minimal because the extra information may have to do with processes
other than Xt . For the definition of stochastic process, the probabilities are not
important, just the algebras of sets and random variables Xt . An expanding
family of algebras Ft Ft+1 is a filtration.

1.3. Notation: The value of a stochastic process at time t may be written


Xt or X(t). The subscript notation reminds us that the Xt are a family of
functions of the random outcome (random variable) . In practical contexts,
particularly in discussing multidimensional processes (X(t) Rn ), we prefer
X(t) so that Xk (t) can represent the k th component of X(t). When the process
is a martingale, we often call it Ft . This will allow us to let X(t) be a Markov
chain and Ft (X) a martingale function of X.

1.4. Example 1, Markov chains: In this example, the Ft are minimal and
is the path space of sequences of length T from the state space, S. The

1
new information revealed at time t is the state of the chain at time t. The
variables Xt are may be called coordinate functions because Xt is coordinate
t (or entry t) in the sequence X. In principle, we could express this with the
notation Xt (X), but that would drive people crazy. Although we distinguish
between Markov chains (discrete time) and Markov processes (continuous time),
the term stochastic process can refer to either continuous or discrete time.

1.5. Example 2, diadic sets: This is a set of definitions for discussing averages
over a range of length scales. The time variable, t, represents the amount of
averaging that has been done. The new information revealed at time t is finer
scale information about a function (an audio signal or digital image). The state
space is the positive integers from 1 to 2T . We start with a function X() and
ask that Xt () be constant on diadic blocks of length 2T t . The diadic blocks
at level t are

Bt,k = 1 + (k 1)2T t , 2 + (k 1)2T t , . . . , k2T t .



(1)

The reader should check that moving from level t to level t + 1 splits each block
into right and left halves:

Bt,k = Bt+1,2k1 Bt+1,2k . (2)

The algebras Ft are generated by the block partitions

Pt = Bt,k with k = 1, . . . , 2T t .


Because Ft Ft+1 , the Pt+1 is a refinement of Pt . The union (2) shows how.
We will return to this example after discussing martingales.

1.6. Martingales: A real valued stochastic process, Ft , is a martingale1 if

E[Ft+1 | Ft ] = Ft .

If we take the overall expectation of both sides we see that the expectation
value does not depend on t, E[Ft+1 ] = E[Ft ]. The martingale property says
more. Whatever information you might have at time t notwithstanding, still
the expectation of future values is the present value. There is a gambling in-
terpretation: Ft is the amount of money you have at time t. No matter what
has happened, your expected winnings at between t and t + 1, the martingale
difference Yt+1 = Ft+1 Ft , has zero expected value. You can also think of
martingale differences as a generalization of independent random variables. Pt If
the random variables Yt were actually independent, then the sums Ft = k=1 Yt
would form a martingale (using the Ft , generated by the Y1 , . . ., Yt ). The reader
should check this.
1 For finite this is the whole story. For countable we also assume that the sums defining

E[Xt ] converge absolutely. This means that E[|Xt |] < . That implies that the conditional
expectations E[Xt + 1 | Ft are well defined.

2
1.7. Examples: The simplest way to get a martingale is to start with
a random variable, F (), and define Ft = E[F | Ft ]. If we apply this to
a Markov chain with the minimal filtration Ft , and F is a final time reward
F = V (X(T
 )), then Ft = f (X(t), t) as in the previous lecture. If we apply this
to = 1, 2, . . . , 2T , with uniform probability P (k) = 2T for k , and the
diadic filtration, we get the diadic martingale with Ft (j) constant on the diadic
blocks (1) and equal to the average of F over the block j is in.

1.8. A lemma on conditional expectation: In working with martingales we


often make use of a basic lemma about conditional expectation. Suppose U ()
and Y () are real valued random variables and that U F. Then

E[U Y | F] = U E[Y | F] . (3)

We see this using classical conditional expectation over the sets in the partition
defining F. Let B be one of these sets. Let yB = E[Y | B] be the value of
E[Y | F] for B. We know that U () is constant in B because U F. Call
this value uB . Then E[U Y | B] = uB E[Y | B] = ub yb . But this is the value of
U E[Y | F] for B. Since each is in some B, this proves (3) for all .

1.9. Doobs principle: This lemma lets us make new martingales from
old ones. Let Ft be a martingale and Yt = Ft Ft1 the martingale differ-
ences (called innovations by statisticians and returns in finance). We use the
convention that F1 = 0 so that Pt F0 = Y0 . The martingale condition is that
E[Yt+1 | Ft ] = 0. Clearly Ft = t0 =0 Yt0 .
Suppose that at time t we are allowed to place a bet of any size2 on the as
yet unknown martingale difference, Yt+1 . Let Ut Ft be the size of the bet.
The return from betting on Yt will be Ut1 Yt , and the total accumulated return
up to time t is
Gt = U0 Y1 + U1 Y2 + + Ut1 Yt . (4)
Because of the lemma (3), the betting returns have E[Ut Yt+1 | Ft ] = 0, so
E[Gt+1 | Ft ] = Gt and Gt also is a martingale.
The fact that Gt in (4) is a martingale sometimes is called Doobs principle
or Doobs theorem after the probabilist who formulated it. A special case below
for stopping times is Doobs stopping time theorem or the optional stopping
theorem. They all say that strategizing on a martingale never produces anything
but a martingale. Nonanticipating strategies on martingales do not give positive
expected returns.

1.10. Weak and strong efficient market hypotheses: It is possible that the
random variables Ft form a martingale with respect to their minimal filtration,
Ft , but not with respect to an enriched filtration Gt Ft . The simplest example
would be the algebras Gt = Ft+1 , which already know the value of Ft+1 at time
t. Note that the Ft also are a stochastic process with respect to the Gt . The
2 We may have to require that the bet have finite expected value.

3
weak efficient market hypothesis is that et St is a martingale (St being the
stock price and its expected growth rate) with respect to its minimal filtra-
tion. Technical analysis means using trading strategies that are nonanticipating
with respect to the minimal filtration. Therefore, the weak efficient market hy-
pothesis says that technical trading does not produce better returns than buy
and hold. Any extra information you might get by examining the price history
of S up to time t is already known by enough people that it is already reflected
in the price St .
The strong efficient market hypothesis states that et St is a martingale
with respect to the filtration, Gt , representing all the public information in the
world. This includes the previous price history of S and much more (prices of
related stocks, corporate reports, market trends, etc.).

1.11. Investing with Doob: Economists sometimes use Doobs principle and
the efficient market hypotheses to make a point about active trading in the
stock market. Suppose that Ft , the price of a stock at time t, is a martingale3 .
Suppose that at time t we all the information in Ft , and choose an amount,
Ut , to invest at time t. The fact that the resulting accumulated, Gt , has zero
expected value is said to show that active investing is no better than a buy
and hold strategy that just produces the value Ft . The well known book A
Random Walk on Wall Street is mostly an exposition of this point of view.
This argument breaks down when applied to non martingale processes, such as
stock prices over longer times. Active trading strategies such as (4) may produce
reduce the risk more than enough to compensage risk averse investors for small
amounts of lost expected value. Mertons optimal dynamic investment analysis
is a simple example of an active trading strategy that is better for some people
than passive buy and hold.

1.12. Stopping times: We have and the expanding family Ft . A stopping


time is a function () that is one of the times 1, . . ., T , so that the event
{ t} is in Ft . Stopping times might be thought of as possible strategies.
Whatever your criterion for stopping is, you have enough information at time t
to know whether you should stop at time t. Many stopping times are expressed
as the first time something happens, such as the first time Xt > a. We cannot
ask to stop, for example, at the last t with Xt > a because we might not know
at time t whether Xt0 > a for some t0 > t.

1.13. Doobs stopping time theorem for one stopping time: Because stop-
ping times are nonanticipating strategies, they also cannot make money from a
martingale. One version of this statement is that E[X ] = E[X1 ]. The proof of
this makes use of the events Bt , that = t. The stopping time hypothesis is
that Bt Ft . Since has some value 1 T , the Bt form a partition of .
Also, if Bt , () = t, so X = Xt . Therefore,

E[X1 ] = E[XT ]
3 This is a reasonable approximation for much short term trading

4
T
X
= E[XT | Bt ]P (Bt )
t=1
T
X
= E[X ]P ( = t)
t=1
= E[X ] .

In this derivation we made use of the classical statement of the martingale


property, if B Ft then E[XT | B] = E[Xt | B]. In our B = Bt , Xt = X .
This simple idea, using the martingale property applied to the partition
Bt , is crucial for much of the theory of martingales. The idea itself was first
used Kolmogorov in the context of random walk or Brownian motion. Doob
realized that Kolmogorovs was even simpler and more beautiful when applied
to martingales.

1.14. Stopping time paradox: The technical hypotheses above, finite state
space, bounded stopping times, may be too strong, but they cannont be com-
pletely ignored, as this famous example shows. Let Xt be a symmetric random
walk starting at zero. This forms a martingale, so E[X ] = 0 for any stopping
time, . On the other hand, suppose we take = min(t | Xt = 1). Then X = 1
always, so E[X ] = 1. The catch is that there is no T with () T for all
. Even though < almost surely (more to come on that expression),
E[ ] = (explination later). Even that would be OK if the possible values of
Xt were bounded. Suppose you choose T and set 0 = min(, T ). That is, you
wait until Xt = 1 or t = T , whichever comes first, to stop. For large T , it is very
likely that you stopped for Xt = 1. Sill, those paths that never reached 1 prob-
ably drifted just far enough in the negative direction so that their contribution
to the overall expected value cancels the 1 to yield E[X 0 ] = 0.

1.15. More stopping times theorem: Suppose we have an increasing family


of stopping times, 1 1 2 . In a natural way the random variables
Y1 = X1 , Y2 = X2 , etc. also form a martingale. This is a final elaborate way
of saying that strategizing on a martingale is a no win game.

5
Stochastic Calculus Notes, Lecture 4
Last modified October 4, 2004

1 Continuous probability

1.1. Introduction: Recall that a set is discrete if it is finite or countable.


We will call a set continuous if it is not discrete. Many of the probability
spaces used in stochastic calculus are continuous in this sense (examples below).
Kolmogorov1 suggested a general framework for continuous probability based
on abstract integration with respect to abstract probability measures. The
theory makes it possible to discuss general constructions such as conditional
expectation in a way that applies to a remarkably diverse set of examples.
The difference between continuous and discrete probability is the difference
between integration and summation. Continuous probability cannot be based
on the formula X
P (A) = P () . (1)
A

Indeed, the typical situation in continuous probability is that any event consist-
ing of a single outcome has probability zero: P ({}) = 0 for all .
As we explain below, the classical formalism of probability densities also does
not apply in many of the situations we are interested in. Abstract probability
measures give a framework for working with probability in path space, as well
as more traditional discrete probability and probabilities given by densities on
Rn .
These notes outline the Kolmogorovs formalism of probability measures for
continuous probability. We leave out a great number of details and mathemat-
ical proofs. Attention to all these details would be impossible within our time
constraints. In some cases we indicate where a precise definition or a complete
proof is missing, but sometimes we just leave it out. If it seems like something
is missing, it probably is.

1.2. Examples of continuous probability spaces: Be definition, a probabil-


ity space is a set, , of possible outcomes, together with a algebra, F, of
measurable events. This section discusses only the sets . The corresponding
algebras are discussed below.
R, the real numbers. If x0 is a real number and u(x) is a probability density,
then the probability of the event Br (x0 ) = {x0 r X x0 + r} is
Z x0 +r
P ([x0 r, x0 + r]) = u(x)dx 0 as r 0.
x0 r
1 The
Russian mathematician Kolmogorov was active in the middle of the 20th century.
Among his many lasting contributions to mathematics are the modern axioms of probability
and some of its most important theorems. His theories of turbulent fluid flow anticipated
modern fractals be several decades.

1
Thus the probability of any individual outcome is zero. An event with
positive probability (P (A) > 0) is made up entirely of outcomes x0 A,
with P (x0 ) = 0. Because of countable additivity (see below), this is only
possible when is uncountable.
Rn , sequences of n numbers (possibly viewed as a row or column vector depend-
ing on the context): X = (X1 . . . , Xn ). Here too if there is a probability
density then the probability of any given outcome is zero.
S N . Let S be the discrete state space of a Markov chain. The space S T
is the set of sequences of length T of elements of S. An element of S T
may be written x = (x(0), x(1), , x(T 1)), with each of the x(t) in
S. It is common to write xt for x(t). An element of S N is an infinite se-
quence of elements of S. The exponent N stands for natural numbers.
We misuse this notation because ours start with t = 0 while the actual
natural numbers start with t = 1. We use S N when we ask questions
about an entire infinite trajectory. For example the hitting probability is
P (X(t) 6= 1 for all t 0). Cantor proved that S N is not countable when-
ever the state space has more than one element. Generally, the probability
of any particular infinite sequence is zero. For example, suppose the tran-
sition matrix has P11 = .6 and u0 (1) = 1. Let x be the infinite sequence
that never leaves state 1: x = (1, 1, 1, ). Then P (x) = u0 (1) .6 .6 .
Multiplying together an infinite number of .6 factors should give the an-
swer P (x) = 0. More generally, if the transition matrix has Pjk r < 1
for all (j, k), then P (x) = 0 for any single infinite path.
C([0, T ] R), the path space for Brownian motion. The C stands for con-
tinuous. The [0, T ] is the time interval 0 t T ; the square brackets
tell us to include the endpoints (0 and T in this case). Round parentheses
(0, T ) would mean to leave out 0 and T . The final R is the target space,
the real numbers in this case. An element of is a continuous function
from the interval [0, T ] to R. This function could be called X(t) or Xt (for
RT
0 t T ). In this space we can ask questions such as P ( 0 X(t)dt > 4).

1.3. Probability measures: Let F be a algebra of subsets of . A


probability measure is a way to assign a probability to each event A F. In
discrete probability, this is done using (1). In Rn a probability density leads to
a probability measure by integration
Z
P (A) = u(x)dx . (2)
A

There are still other ways to specify probabilities of events in path space. All
of these probability measures satisfy the same basic axioms.
Suppose that for each A F we have a number P (A). The numbers P (A)
are a probability measure if
i. If A F and B F are disjoint events, then P (A B) = P (A) + P (B).

2
ii. P (A) 0 for any event A F.
iii. P () = 1.
iv. If An F is a sequence
P of events each disjoint from all the others and

n=1 An = A, then n=1 P (An ) = P (A).

The last property is called countable additivity. It is possible to consider prob-


ability measures that are not countably additive, but is not bery useful.

1.4. Example 1, discrete probability: If is discrete, we may take F to be


the set of all events (i.e. all subsets of ). If we know the probabilities of each
individual outcome, then the formula (1) defines a probability measure. The
axioms (i), (ii), and (iii) are clear. The last, countable additivity, can be verified
given a solid undergraduate analysis course.

1.5. Borel sets: It is rare that one can define P (A) for all A . Usually,
there are non measurable events whose probability one does not try to define
(see below). This is not related to partial information, but is an intrinsic aspect
of continuous probability. Events that are not measurable are quite artificial,
but they are impossible to get rid of. In most applications in stochastic calculus,
it is convenient to take the largest algebra to be the Borel sets2
In a previous lecture we discussed how to generate a algebra from a
collection of sets. The Borel algebra is the algebra that is generated by all
balls. The open ball with center x0 and radius r > 0 in n dimensional space is
Br (x0 ) = {x | |x x0 | < r. A ball in one dimension is an interval. In two
dimensions it is a disk. Note that the ball is solid, as opposed to the hollow
sphere, Sr (x0 ) = {x | |x x0 | = r}. The condition |x x0 | r instead of
|x x0 | < r, defines a closed ball. The algebra generated by open balls is
the same as the algebra generated by closed balls (check this if you wish).

1.6. Borel sets in path space: The definition of Borel sets works the same
way in the path space of Brownian motion, C([0, T ], R). Let x0 (t) amd x(t) be
two continuous functions of t. The distance between them in the sup norm is

kx x0 k = sup |x(t) x0 (t)| .


0tT

We often use double bars to represent the distance between functions and single
bar absolute value signs to represent the distance between numbers or vectors
in Rn . As before, the open ball of radius r about a path x0 is the set of all
paths with kx x0 k < r.

1.7. The algebra for Markov chain path space: There is a convenient
limit process that defines a useful algebra on S N , the infinite time horizon
path space for a Markov chain. We have the algebras Ft generated by the first
2 The larger algebra of Lebesgue sets seems to more of a nuisance than a help, particularly

in discussing convergence of probability measures in path space.

3
t + 1 states x(0), x(1), . . ., x(t). We take F to be the algebra generated
by all these. Note that the event A = {X(t) 6= 1 for t 0} is not in any of
the Ft . However, the event At = {X(t) 6= 1 for 0 t T } is in Ft . Therefore
A = t0 At must be in any algebra that contains all the Ft . Also note that
the union of all the Ft is an algebra of sets, though it is not a algebra.

1.8. Generating a probability measure: Let M be a collection of events


that generates the algebra F. Let A be the algebra of sets that are finite
intersections, unions, and complements of events in M. Clearly the algebra
generated by M is the same as the one generated by A. The process of going
from the algebra A to the algebra F is one of completion, adding all limits
of countable intersections or unions of events in A.
In order to specify P (A) for all A F, it suffices to give P (A) for all events
A A. That is, if there is a countably additive probability measure P (A) for
all A F, then it is completely determined by the numbers P (A) for those
A A. Hopefully is is plausable that if the events in A generate those in F,
then the probabilities of events in M determine the probabilities of events in F
(proof ommitted).
For example, in Rn if we specify P (A) for event described by finitely many
balls, then we have determined P (A) for any Borel set. It might be that the
numbers P (A) for A A are inconsistent with the axioms of probability (which
is easy to check) or cant be extended in a way that is countably additive to all of
F (doesnt happen in our examples), but otherwise the measure is determined.

1.9. Non measurable sets (technical aside): A construction demonstrates that


non measurable sets are unavoidable. Let be the unit circle. The simplest
probability measure on would seem to be uniform measure (divided by 2 so
that P () = 1). This measure is rotation invariant: if A is a measurable event
having probability P (A) then the event A + = {x + | x A} is measurable
and has P (A + ) = P (A). It is possible to construct a set B and a (countable)
sequence of rotations,
S n , so that the events B + k and B + n are disjoint
if k 6= n and n B + n = . This set cannot be measurable. If it were and
= P (B) then there would
P be two choices: P = 0 or > 0. In the former case
we would have P () = n P (B + n ) = n 0 = 0, which is not what we want.
In the latter case, again using countable additivity, we would get P () = .
The construction of the set B starts with a description of the n . Write n
in base ten, flip over the decimal point to get a number between 0 and 1, then
multiply by 2. For example for n = 130, we get n = 130 = 2 .031. Now
use the n to create an equivalence relation and partition of by setting x y
if x = y + n (mod 2) for some n. The reader should check that this is an
equivalence relation (x y y x, and x y and y z x z). Now,
let B be a set that has exactly one representative from each of the equivalence
classes in the partition. Any x is in one of the equivalence classes, which
means that there is a y B (the representative of the x equivalence class) and
an n so that y + n = x.SThat means that any x has x B + n for some
n, which is to say that n B + n = . To see that B + k is disjoint from

4
B + n when k 6= n, suppose that x B + k and x n . Then x = y + k
and x = z + n for y B and z B. But (and this is the punch line) this
would mean y z, which is impossible because B has only one representative
from each equivalence class. The possibility of selecting a single element from
each partition element without having to say how it is to be done is the axiom
of choice.

1.10. Probability densities in Rn : Suppose u(x) is a probability density in Rn .


Ifg A is an event made from finitely many balls (or rectangles) by set operations,
we can define P (A) by integrating, as in (2). This leads to a probability measure
on Borel sets corresponding to the density u. Deriving the probability measure
from a probability density does not seem to work in path space because there
is nothing like the Riemann integral to use in3 (2) Therefore, we describe path
space probability measures directly rather than through probability densities.

1.11. Measurable functions: Let be a probability space with a algebra


F. Let f () be a function defined on . In discrete probability, f was measur-
able with respect to F if the sets Ba = { | f ( = a)} all were measurable. In
continuous probability, this definition is replaced by the condition that the sets
Aab = { | a f () b} are measurable. Because F is countably additive, and
because the event a < f is the (countable) union of the events a + n1 f , this
eab = { | a < f () < b} to be measurable.
is the same as requiring all the sets A
If is discrete (finite or countable), then the two definitions of measurable
function agree.
In continuous probability, the notion of measurability of a function with
respect to a algebra plays two roles. The first, which is purely technical,
is that f is sufficiently regular (meaning not crazy) that abstract integrals
(defined below) make sense for it. The second, particularly for smaller algebras
G F, again involves incomplete information. A function that is measurable
with respect to G not only needs to be regular, but also must depend on fewer
variables (possibly in some abstract sense).

1.12. Integration with respect to a measure: The definition of integration


with respect to a general probability measure is easier than the definition of the
Riemann integral. The integral is written
Z
E[f ] = f ()dP () .

n
We will see that in R with a density u, this agrees with the classical definition
Z
E[f ] = f (x)u(x)dx ,
Rn
3 The Feynman integral in path space has some properties of true integrals but lacks others.

The probabilist Mark Kac (pronounced cats) discovered that Feynmans ideas applied to
the heat equation rather than the Schrodinger equation can be interpreted as integration with
respect to Wiener measure. This is now called the Feynman Kac formula.

5
if we write dP (x) = u(x)dx. Note that the abstract variable is replaced by
the concrete variable, x, in this more concrete situation. The general definition
is forced on us once we make the natural requirements
i. If A F is any event, then E[1A ] = P (A). The integral of the indicator
function if an event is the probability of that event.
ii. If f1 and f2 have f1 () f2 () for all , then E[f1 ] E[f2 ]. Integra-
tion is monotone.
iii. For any reasonable functions f1 and f2 (e.g. bounded), we have E[af1 +
bf2 ] = aE[f1 ] + bE[f2 ]. (Linearity of integration).

iv. If fn () is an increasing family of positive functions converging pointwise to


f (fn () 0 and fn+1 () fn () for all n, and fn ( f () as n
for all ), then E[fn ] E[f ] as n . (This form of countable additiv-
ity for abstract probability integrals is called the monotone convergence
theorem.)
A function is a simple Pfunction if there are finitely many events Ak , and
weights wk , so that f = k wk 1Ak . Properties (i) and (iii) imply that the
expectation of a simple function is
X
E[f ] = wk P (Ak ) .
k

We can approximate general functions by simple functions to determine their


expectations.
Suppose f is a nonnegative bounded function: 0 f () M for all .
Choose a small number  = 2n and define the4 ring sets Ak = {(k 1)
f < k. The Ak depend on  but we do not indicate that. Although the events
Ak might be complicated, fractal, or whatever, each
P of them is measurable. A
simple function that approximates f is fn () = k (k 1)1Ak . This fn takes
the value (k 1) on the sets Ak . The sum defining fn is finite because f is
bounded, though the number of terms is M/. Also, fn () f () for each
(though by at most ). Property (ii) implies that
X
E[f ] E[fn ] = (k 1)P (Ak ) .
k
P
In the same way, we can consider the upper function gn = k k1Ak and have
X
E[f ] E[gn ] = kP (Ak ) .
k

The reader can check that fn fn+1 f gn+1 gn and that gn fn .


Therefore, the numbers E[fn ] form an increasing sequence while the E[gn ] are a
4 Take f = f (x, y) = x2 + y 2 in the plane to see why we call them ring sets.

6
decreasing sequence converging to the same number, which is the only possible
value of E[f ] consistent with (i), (ii), and (iii).
It is sometimes said that the difference between classical (Riemann) integra-
tion and abstract integration (here) is that the Riemann integral cuts the x axis
into little pieces, while the abstarct integral cuts the y axis (which is what the
simple function approximations amount to).
If the function f is positive but not bounded, it might happen that E[f ] = .
The cut off functions, fM () = min(f (), M ), might have E[fM ] as
M . If so, we say E[f ] = . Otherwise, property (iv) implies that
E[f ] = limM E[fM ]. If f is both positive and negative (for different ),
we integrate the positive part, f+ () = max(f (), 0), and the negative part
f () = min(f (), 0 separately and subtract the results. We do not attempt a
definition if E[f+ ] = and E[f ] = . We omit the long process of showing
that these definitions lead to an integral that actually has the properties (i) -
(iv).

1.13. Markov chain probability measures on S N : Let A = t0 Ft as before.


The probability of any A A is given by the probability of that event in Ft
if A Ft . Therefore P (A) is given by a formula like (1) for any A A. A
theorem of Kolmogorov states that the completion of this measure to all of F
makes sense and is countably additive.

1.14. Conditional expectation: We have a random variable X() that is


measurable with respect to the algebra, F. We have algebra that is a
sub algebra: G F. We want to define the conditional expectation Y = E[X |
G]. In discrete probability this is done using the partition defined by G. The
partition is less useful because it probably is uncountable, and because each
partition element, B() = A (the intersection being over all A G with
A), may have P (B()) = 0 (examples below). This means that we cannot
apply Bayes rule directly.
The definition is that Y () is the random variable measurable with respect
to G that best approximates X in the least squares sense

E[(Y X)2 ] = min E[(Z X)2 ] .


ZG

This is one of the definitions we gave before, the one that works for continuous
and discrete probability. In the theory, it is possible to show that there is a
minimizer and that it is unique.

1.15. Generating a algebra: When the probability space, , is finite, we


can understand an algebra of sets by using the partition of that generates the
algebra. This is not possible for continuous probability spaces. Another way
to specify an algebra for finite was to give a function X(, or a collection
of functions Xk () that are supposed to be measurable with respect to F. We
noted that any function measurable with respect to the algebra generated by
functions Xk is actually a function of the Xk . That is, if F F (abuse of

7
notation), then there is some function u(x1 , . . . , xn ) so that

F () = u(X1 (), . . . , Xn ()) . (3)

The intuition was that F contains the information you get by knowing the
values of the functions Xk . Any function measurable with respect to this alge-
bra is determined by knowing the values of these functions, which is precisely
what (3) says. This approach using functions is often convenient in continuous
probability.
If is a continuous probability space, we may again specify functions Xk
that we want to be measurable. Again, these functions generate an algebra,
a algebra, F. If F is measurable with respect to this algebra then there is
a (Borel measurable) function u(x1 , . . .) so that F () = u(X1 , . . .), as before.
In fact, it is possible to define F in this way. Saying that A F is the same
as saying that 1A is measurable with respect to F. If u(x1 , . . .) is a Borel
measurable function that takes values only 0 or 1, then the function F defined by
(3) defines a function that also takes only 0 or 1. The event A = { | F () = 1
has (obviously) F = 1A . The algebra generated by the Xk is the set of
events that may be defined in this way. A complete proof of this would take a
few pages.

1.16. Example in two dimensions: Suppose is the unit square in two


dimensions: (x, y) if 0 x 1 and 0 y 1. The x coordinate function
is X(x, y) = x. The information in this is the value of the x coordinate, but not
the y coordinate. An event measurable with respect to this F will be any event
determined by the x coordinate alone. I call such sets bar code sets. You can
see why by drawing some.

1.17. Marginal density and total probability: The abstract situation is that
we have a probability space, with generic outcome . We have some
functions (X1 (), . . . , Xn ()) = X(). With in the background, we can ask
for the joint PDF of (X1 , . . . , Xn ), written u(x1 , . . . , xn ). A formal definition of
u would be that if A Rn , then
Z
P (X() A) = u(x)dx . (4)
xA

Suppose we neglect the last variable, Xn , and consider the reduced vector
X() = (X1 , . . . , Xn1 ) with probability density u(x1 , . . . , xn1 ). This u is
the marginal density and is given by integrating u over the forgotten variable:
Z
u(x1 , . . . , xn1 ) = u(x1 , . . . , xn )dxn . (5)

This is a continuous probability analogue of the law of total probability: in-


tegrate (or sum) over a complete set of possibilities, all values of xn in this
case.

8
We can prove (5) from (4) by considering a set B Rn1 and the corre-
sponding set A Rn given by A = B R (i.e. A is the set of all pairs x, xn )
with x = (x1 , . . . , xn1 ) B). The definition of A from B is designed so that
P (X A) = P (X B). With this notation,

P (X B) = P (X A)
Z
= u(x)dx
ZA Z
= u(x, xn )dxn dx
xB xn =
Z
P (X B) = u(x)dx .
B

This is exactly what it means for u to be the PDF for X.

1.18. Classical conditional expectation: Again in the abstract setting ,


suppose we have random variables (X1 (), . . . , Xn ()). Now consider a function
f (x1 , . . . , xn ), its expectated value E[f (X)], and the conditional expectations

v(xn ) = E[f (X) | Xn = xn ] .

The Bayes rule definition of v(xn ) has some trouble because both the denomi-
nator, P (Xn = xn ), and the numerator,

E[f (X) 1Xn =xn ] ,

are zero.
The classical solution to this problem is to replace the exact condition Xn =
xn with an approximate condition having positive (though small) probability:
xn Xn xn + . We use the approximaion
Z xn +
g(x, n )dn g(x, xn ) .
xn

The error is roughly proportional to 2 and much smaller than either the terms
above. With this approximation the numerator in Bayes rule is
Z Z n =xn +
E[f (X) 1xn Xn xn + ] = f (x, n )u(x, xn )dn dx
xRn1 n =xn
Z
 f (x, xn )u(x, xn )dx .
x

Similarly, the denominator is


Z
P (xn Xn xn + )  u(x, xn )dx .
x

9
If we take the Bayes rule quotient and let  0, we get the classical formula
R
f (x, xn )u(x, xn )dx
E[f (X) | Xn = xn ] = x R . (6)
x
u(x, xn )dx
By taking f to be the characteristic function of an event (all possible events)
we get a formula for the probability density of X given that Xn = xn , namely
u(x, xn )
u(x | Xn = xn ) = R . (7)
x
u(x, xn )dx
This is the classical formula for conditional probability density. The integral
in the denominator insures that, for each xn , u is a probability density as a
function of x, that is Z
u(x | Xn = xn )dx = 1 ,

for any value of xn . It is very useful to notice that as a function of x, u and u


almost the same. They differ only by a constant normalization. For example,
this is why conditioning Gaussians gives Gaussians.

1.19. Modern conditional expectation: The classical conditional expectation


(6) and conditional probability (7) formulas are the same as what comes from
the modern definition from paragraph 1.6. Suppose X = (X1 , . . . , Xn ) has
density u(x), F is the algebra of Borel sets, and G is the algebra generated
by Xn (which might be written Xn (X), thinking of X as in the abstract
notation). For any f (x), we have f(xn ) = E[f | G]. Since G is generated by
Xn , the function f being measurable with respect to G is the same as its being
a function of xn . The modern definition of f(xn ) is that it minimizes
Z  2
f (x) f(xn ) u(x)dx , (8)
Rn

over all functions that depend only on xn (measurable in G).


To see the formula (6) emerge, again write x = (x, xn ), so that f (x) =
f (x, xn ), and u(x) = u(x, xn ). The integral (8) is then
Z Z  2
f (x, xn ) f(xn ) u(x, xn )dxdxn .
xn = xRn1

In the inner integral:


Z  2
R(xn ) = f (x, xn ) f(xn ) u(x, xn )dx ,
xRn1

f(xn ) is just a constant. We find the value of f(xn ) that minimizes R(xn ) by
minimizing the quantity
Z
2
(f (x, xn ) g) u(x, xn )dx =
xRn1
Z Z Z
2 2
f (x) u(x, xn )dx + 2g f (x)u(x, xn )dx + g u(x, xn )dx .

10
The optimal g is given by the classical formula (6).

1.20. Modern conditional probability: We already saw that the modern ap-
proach to conditional probability for G F is through conditional expectation.
In its most general form, for every (or almost every) , there should be
a probability measure P on so that the mapping P is measureable
with respect to G. The measurability condition probably means that for every
event A F the function pA () = P (A) is a G measurable function of .
In terms of these measures, the conditional expectation f = E[f | G] would be
f() = E [f ]. Here E means the expected value using the probability measure
P . There are many such subscripted expectations coming.
A subtle point here is that the conditional probability measures are defined
on the original probability space, . This forces the measures to live on
tiny (generally measure zero) subsets of . For example, if = Rn and G is
generated by xn , then the conditional expectation value f(xn ) is an average of
f (using density u) only over the hyperplane Xn = xn . Thus, the conditional
probability
R measures PX depend only on xn , leading us to write Pxn . Since
f(xn ) = f (x)dPxn (x), and f(xn ) depends only on values of f (x, xn ) with
the last coordinate fixed, the measure dPxn is some kind of measure on that
hyperplane. This point of view is useful in many advanced problems, but we
will not need it in this course (I sincerely hope).

1.21. Semimodern conditional probability: Here is an intermediate semi-


modern version of conditional probability density. We have = Rn , and
= Rn1 with elements x = (x1 , . . . , xn1 ). For each xn , there will be a (con-
ditional) probability density function uxn . Saying that u depends only on xn is
the same as saying that the function x uxn is measurable with respect to G.
The conditional expectation formula (6) may be written
Z
E[f | G](xn ) = f (x, xn )uxn (x)dx .
Rn1

In other words, the classical u(x | Xn = xn ) of (7) is the same as the semimodern
uxn (x).

2 Gaussian Random Variables


The central limit theorem (CLT) makes Gaussian random variables important.
A generalization of the CLT is Donskers invariance principle that gives Brow-
nian motion as a limit of random walk. In many ways Brownian motion is a
multivariate Gaussian random variable. We review multivariate normal random
variables and the corresponding linear algebra as a prelude to Brownian motion.

2.1. Gaussian random variables, scalar: The one dimensional standard

11
normal, or Gaussian, random variable is a scalar with probability density
1 2
u(x) = ex /2 .
2

The normalization factor 12 makes u(x)dx = 1 (a famous fact). The
R
2
mean value is E[X] = 0 (the integrand xex /2 is antisymmetric about x = 0).
The variance is (using integration by parts)
Z
1 2
E[X 2 ] = x2 ex /2 dx
2
Z 
1 2

= x xex /2 dx
2
Z  
1 d x2 /2
= x e dx
2 dx
Z
1  x2 /2  1 2
= xe + ex /2 dx
2
2
= 0+1

Similar calculations give E[X 4 ] = 3, E[X 6 ] = 15, and so on. I will often write
Z for a standard normal random variable. A one dimensional Gaussian random
variable with mean E[X] = and variance var(X) = E[(X )2 ] = 2 has
density
1 (x)2
u(x) = e 22 .
2 2
It is often more convenient to think of Z as the random variable (like ) and
write X = + Z. We write X N (, 2 ) to express the fact that X is normal
(Gaussian) with mean and variance 2 . The standard normal random variable
is Z N (0, 1)

2.2. Multivariate normal random variables: The n n matrix, H, is positive


definite if x Hx > 0 for any n component column vector x 6= 0. It is symmetric
if H = H. A symmetric matrix is positive definite if and only if all its eigenvales
are positive. Since the inverse of a symmetric matrix is symmetric, the inverse
of a symmetric positive definite (SPD) matrix is also SPD. An n component
random variable is a mean zero multivariate normal if it has a probability density
of the form
1 1
u(x) = e 2 x Hx ,
z
for some SPD matrix, H. We can get mean = (1 , . . . , n ) either by taking
X + where X has mean zero, or by using the density with x Hx replaced by
(x ) H(x ).
If X Rn is multivariate normal and if A is an m n matrix with rank m,
then Y Rm given by Y = AX is also multivariate normal. Both the cases
m = n (same number of X and Y variables) and m < n occur.

12
2.3. Diagonalizing H: Suppose the eigenvalues and eigenvectors of H are
n
Hvj = j vj . We canPn express x R as a linear combination of the vj either in
vector form, x = j=1 yj vj , or in matrix form, x = V y, where V is the n n
matrix whose columns are the vj and y = (y1 , . . . , yn ) . Since the eigenvectors
of a symmetric matrix are orthogonal to each other, we may normalize them so
that vj vk = jk , which is the same as saying that V is an orthogonal matrix,
V V = I. In the y variables, the quadratic form x Hx is diagonal, as we can
see using the vector or
Pthe matrix notation.
PnWith vectors, the trick is to use the
n
two expressions x = j=1 yj vj and x = k=1 yk vk , which are the same since j
and k are just summation variables. Then we can write
!
X n X n
x Hx = yj vj H yk vk
j=1 k=1
X
vj Hvk

= yj yk
jk
X
= k vj vk yj yk
jk
X
x Hx = k yk2 . (9)
k

The matrix version of the eigenvector/eigenvalue relations is V HV = ( be-


ing the diagonal matrix of eigenvalues). With this we have x Hx = (V y) HV y =
y (V HV )y = y y. A diagonal matrix in the quadratic form is equivalent to
having a sum involving only squares k yk2 . All the k will be positive
Qnif H is
positive definite. For future reference, also remember that det(H) = k=1 k .

2.4. Calculations using the multivariate normal density: We use the y


variables as new integration variables. The point is that if the quadratic form is
diagonal the muntiple integral becomes a product of one dimensional gaussian
integrals that we can do. For example,
Z Z Z
1 2 2 1 2 2
e 2 (1 y1 +2 y2 ) dy1 dy2 = e 2 (1 y1 +2 y2 ) dy1 dy2
R2 y1 = y2 =
Z Z
2 2
= e1 y1 /2 dy1 e2 y2 /2 dy2
y1 = y2 =
p p
= 2/1 2/2 .

Ordinarily we would need a Jacobian determinant representing dx dy , but here

the determinant is det(V ) = 1, for an orthogonal matrix. With this we can find
the normalization constant, z, by
Z
1 = u(x)dx
Z
1 1
= e 2 x Hx dx
z

13
Z
1 1
= e 2 y y
dy
z
Z n
1 1X
= exp( k yk2 ))dy
z 2
k=1
Z n
!
1 Y 2
= ek yk dy
z
k=1
n Z 
1 Y 2
= ek yk dyk
z yk =
k=1
n p
1 Y
= 2/k
z
k=1
1 (2)n/2
1 = p .
z det(H)

This gives a formula for z, and the final formula for the multivariate normal
density
det H 1 x Hx
u(x) = e 2 . (10)
(2)n/2

2.5. The covariance, by direct integration: We can calculate the covariance


matrix of the Xj . The jk element of E[XX ] is E[Xj Xk ] = cov(Xj , Xk ). The
covariance matrix consisting of all these elements is C = E[XX ]. Note the
conflict of notation with the constant C above. A direct way to evaluate C is
to use the density (10):
Z
C = xx u(x)dx
Rn
Z
det H 1
= n/2
xx e 2 x Hx dx .
(2) Rn

Note that the integrand is an n n matrix. Although each particular xx


has rank one, the average of all of them will be a nonsingular positive definite
matrix, as we will see. To work the integral, we use the x = V y change of
variables above. This gives
Z
det H 1
C= (V y)(V y) e 2 y y dy .
(2)n/2 Rn

We use (V y)(V y) = V (yy )V and take the constant matrices V outside the
integral. This gives C as the product of three matrices, first V , then an integral
involving yy , then V . So, to calculate C, we can calculate all the matrix
elements Z
det H 1
Bjk = yj yk e 2 y y dy .
(2)n/2 Rn

14
Clearly, if j 6= k, Bjk = 0, because the integrand is an odd (antisymmetric)
function, say, of yj . The diagonal elements Bkk may be found using the fact
that the integrand is a product:
Z ! Z
det H Y 2 2
Bkk = n/2
ej yj /2 dyj yk2 ek yk /2 dyk .
(2) j6=k yj yk

p
As before, j factors (for j 6= k) integrate to 2/j . The k factor integrates
p
to 2/(k )3/2 . The k factor differs from the others only by a factor 1/k .
Most of these factors combine to cancel the normalization. All that is left is
1
Bkk = .
k

This shows that B = 1 , so

C = V 1 V .

Finally, since H = V V , we see that

C = H 1 . (11)

The covariance matrix is the inverse of the matrix defining the multivariate
normal.

2.6. Linear functions of multivariate normals: A fundamental fact about


multivariate normals is that a linear transformation of a multivariate normal is
also multivariate normal, provided that the transformation is onto. Let A be
an m n matrix with m n. This A defines a linear transformation y = Ax.
The transformation is onto if, for every y Rm , there is at least ibe x Rn
with Ax = y. If n = m, the transformation is onto if and only if A is invertable
(det(A) 6= 0), and the only x is A1 y. If m < n, A is onto if its m rows
are linearly independent. In this case, the set of solutions is a hyperplane
of dimension n m. Either way, the fact is that if X is an n dimensional
multivariate normal and Y = AX, then Y is an m dimensional multivariate
normal. Given this, we can completely determine the probability density of Y
by calculating its mean and covariance matrix. Writing X and Y for the
means of X and Y respectively, we have

Y = E[Y ] = E[AX] = AE[X] = AX .

Similarly, if E[Y ] = 0, we have

CY = E[Y Y ] = E[(AX)(AX) ] = E[AXX A ] = AE[XX ]A = ACX A .

The reader should verify that if CX is n n, then this formula gives a CY that
is m m. The reader should also be able to derive the formula for CY in terms

15
of CX without assuming that Y = 0. We will soon give the proof that linear
functions of Gaussians are Gaussian.

2.7. Uncorrelation and independence: The inverse of a symmetric matrix


is another symmertic matrix. Therefore, CX is diagonal if and only if H is
diagonal. If H is diagonal, the probability density function given by (10) is a
product of densities for the components. We have already used that fact and
will use it more below. For now, just note that CX is diagonal if and only if the
components of X are uncorrelated. Then CX being diagonal implies that H is
diagonal and the components of X are independent. The fact that uncorrelated
components of a multivariate normal are actually independent firstly is a prop-
erty only of Gaussians, and secondly has curious consequences. For example,
suppose Z1 and Z2 are independent standard normals and X1 = Z1 + Z2 and
X2 = Z1 Z2 , then X1 and X2 , being uncorrelated, are independent of each
other. This may seem surprising in view of that fact that increasing Z1 by 1/2
increases both X1 and X2 by the same 1/2. If Z1 and Z2 were independent uni-
form random variables (PDF = u(z) = 1 if 0 z 1, u(z) = 0 otherwise), then
again X1 and X2 would again be uncorrelated, but this time not independent
(for example, the only way to get X1 = 2 is to have both Z1 = 1 and Z2 = 1,
which implies that X2 = 0.).

2.8. Application, generating correlated normals: There are simple tech-


niques for generating (more or less) independent standard normal random vari-
ables. The Box Muller method being the most famous. Suppose we have a
positive definite symmetric matrix, CX , and we want to generate a multivari-
ate normal with this covariance. One way to do this is to use the Choleski
factorization CX = LL , where L is an n n lower triangular matrix. Now
define Z = (Z1 , . . . , Zn ) where the Zk are independent standard normals. This
Z has covariance CZ = I. Now define X = LZ. This X has covariance
CX = LIL = LL , as desired. Actually, we do not necessarily need the
Choleski factorization; L does not have to be lower triangular. Another possi-
bility is to use the symmetric square root of CX . Let CX = V V , where
is the diagonal symmetric matrix with eigenvalues of CX ( = 1 where
is given above),
and V is
the orthogonal matrix if eigenvectors. We can
take A = V V , where is the diagonal matrix. Usually the Choleski
factorization is easier to get than the symmetric square root.

2.9. Central Limit Theorem: Let X be an n dimensional random variable


with probability density u(x). Let X (1) , X (2) , . . ., be a sequence of independent
samples of X, that is, independent random variables with the same density u.
Statisticians call this iid (independent, identically distributed). If we need to
(k)
talk about the individual components of X (k) , we write Xj for component j
of X (k) . For example, suppose we have a population of people. If we choose a
person at random and record his or her height (X1 ) and weight (X2 ), we get a
two dimensional random variable. If we measure 100 people, we get 100 samples,

16
X (1) , . . ., X (100) , each consisting of a height and weight pair. The weight of
(27)
person 27 is X2 . Let = E[X] be the mean and C = E[(X )(X ) ]
the covariance matrix. The Central Limit Theorem (CLT) states that for large
n, the random variable
n
1 X (k)
R(n) = (X )
n
k=1

has a probability distribution close to the multivariate normal with mean zero
and covariance C. One interesting consequence is that if X1 and X2 are uncor-
( (n)
related then an average of many independent samples will have R1 n) and R2
nearly independent.

2.10. What the CLT says about Gaussians: The Central Limit Theorem
tells us that if we avarage a large number of independent samples from the
same distribution, the distribution of the average depends only on the mean
and covariance of the starting distribution. It may be surprising that many
of the properties that we deduced from the formula (10) may be found with
almost no algebra simply knowing that the multivariate normal is the limit of
averages. For example, we showed (or didnt show) that if X is multivariate
normal and Y = AX where the rows of A are linearly independent, then Y is
multivariate normal. This is a consequence of the averaging property. If X is
(approximately) the average of iid random variables Uk , then Y is the average
of random variables Vk = AUk . Applying the CLT to the averaging of the Vk
shows taht Y is also multivariate normal.
Now suppose U is a univariate random variable with iid samples Pn Uk , and
E[Uk ] = 0, E[Uk2 = 2 ], and E[Uk4 ] = a4 < Define Xn = 1n k=n Uk . A
calculation shows that E[Xn4 ] = 3 4 + n1 a4 . For large n, the fourth moment of
the average depends only on the second moment of the underlying distribution.
A multivariate and slightly more general version of this calculation gives Wicks
theorem, an expression for the expected value of a product of components of
a multivariate normal in terms of covariances.

17
Stochastic Calculus Notes, Lecture 5
Last modified October 21, 2004

1 Brownian Motion

1.1. Introduction: Brownian motion is the simplest of the stochastic pro-


cesses called diffusion processes. It is helpful to see many of the properties of
general diffusions appear explicitly in Brownian motion. In fact, the Ito calculus
makes it possible to describea any other diffusion process may be described in
terms of Brownian motion. Furthermore, Brownian motion arises as a limit or
many discrete stochastic processes in much the same way that Gaussian random
variables appear as a limit of other random variables throught the central limit
theorem. Finally, the solutions to many other mathematical problems, parti-
cilarly various common partial differential equations, may be expressed in terms
of Brownian motion. For all these reasons, Brownian motion is a central object
to study.

1.2. History: Late in the 18th century, an English botanist named Brown
looked at pollen grains in water under a microscope. To his amazement, they
were moving randomly. He had no explination for supposedly inert pollen grains,
and later inorganic dust, seeming to swim as though alive. In 1905, Einstein
proposed the explination that the observed Brownian motion was caused by
individual water molecules hitting the pollen or dust particles. This allowed
him to estimate, for the first time, the weight of a water molecule and won him
the Nobel prize (relativity and quantum mechanics being too controversial at
the time). This is the modern view, that the observed random motion of pollen
grains is the result of a huge number of independent and random collisions with
tiny water molecules.

1.3. Basics: The mathematical description of Brownian motion involves a


random but continuous function on time, X(t). The standard Brownian motion
starts at x = 0 at time t = 0: X(0) = 0. The displacement, or increment
between time t1 > 0 and time t2 > t1 , Y = X(t2 ) X(t1 ), is the sum of a
large number of i.i.d. mean zero random variables, (each modeling the result
of one water molecule collision). It is natural to suppose that the number of
such collisions is proportional to the time increment. This implies, throught the
central limit theorem, that Y should be a Gaussian random variable with vari-
ance proportional to t2 t1 . The standard Brownian motion has X normalized
so that the variance is equal to t2 t1 . The random shocks (a term used in
finance for any change, no matter how small) in disjoint time intervals should
be independent. If t3 > t2 and Y2 = X(t3 ) X(t2 ), Y1 = X(t2 ) Xt1 ), then Y2
and Y1 should be independent, with variances t3 t2 and t2 t1 respectively.
This makes the increments Y2 and Y1 a two dimensional multivariate normal.

1
1.4. Wiener measure: The probability space for standard Brownian motion
is C0 ([0, T ], R). As we said before, this consists of continuous functions, X(t),
defined for t in the range 0 t T . The notation C0 means1 that X(0) = 0.
The algebra representing full information is the Borel algebra. The infi-
nite dimensional Gaussian probability measure on C0 ([0, T ], R) that represents
Brownian motion is called Wiener measure2 .
This measure is uniquely specified by requiring that for any times 0 = t0 <
t1 < < tn T , the increments Yk = X(tk+1 ) X(tk ) are independent Gaus-
sian random variables with var(Yk ) = tk+1 tk . The proof (which we omit) has
two parts. First, it is shown that there indeed is such a measure. Second, it
is shown that there is only one such. All the information we need is contained
in the joint distribution of the increments. The fact that increments from dis-
joint time intervals are independent is the independent increments property. It
also is possible to consider Brownian motion on an infinite time horizon with
probability space C0 ([0, ), R).

1.5. Technical aside: There is a different descripton of the Borel algebra


on C0 ([0, T ], R). Rather than using balls in the sup norm, one can use sets more
closely related to the definition of Wiener measure through the joint distribution
of increments. Choose times 0 = t0 < t1 < tn , and for each tk a Borel set,
Ik R (thought of as intervals though they may not be). Let A be the event
{X(tk ) Ik for all k}. The set of such events forms an algebra (check this),
though not a algebra. The probabilities P (A) are determined by the joint
distributions of the increments. The Borel algebra on C0 ([0, T ], R) is generated
by this algebra (proof ommitted), so Wiener measure (if it exists) is determined
by these probabilities.

1.6. Transition probabilities: The transition probability density for Brownian


motion is the probability density for X(t + s) given that X(t) = y. We denote
this by G(y, x, s), the G standing for Greens function. It is much like the
t
Markov chain transition probabilities Py,x except that (i) G is a probability
density as a function of x, not a probability, and (ii) tr is continuous, not
discrete. In our case, the increment X(t + s) X(t), is Gaussina with variance
s. If we learn that X(t) = y, then y becomes the expected value of X(t + s).
Therefore,
1 2
G(y, x, s) = e(xy) /2s . (1)
2s

1.7. Functionals: An element of = C0 ([0, T ], R) is called X. We de-


note by F (X) a real valued function of X. In this context, such a func-
tion is often called a functional, to keep from confusing it with X(t), which
1 In other contexts, people use C to indicate functions with compact support (whatever
0
that means) or functions that tend to zero as t , but not here.
2 The American mathematician and MIT professor Norbert Wiener was equally brilliant

and inarticulate.

2
is a random function of t. This functional is just what we called a func-
tion of a random variable (the path X palying the role of the abstract ran-
dom outcome ). The simplest example of a functional is just a function of
X(T ): F (X) = V (X(T )). More complicated functionals are integrals: F (X) =
RT
0
V (X(t))dt.n extrema: F (X) = maxo tT X(t), or stopping times such as
Rt
F (X) = min t such that 0 X(s)dx 1 . Stochastic calculus provides tools
for computing the expected values of many such functionals, often through solu-
tions of partial differential equations. Computing expected values of functionals
is our main way to understand the behavior of Brownian motion (or any other
stochastic process).

1.8. Markov property: The independent increments property makes Brown-


ian motion a Markov process. Let Ft be the algebra generated by the path
up to time t. This may be characterized as the algebra generated by all the
random variables X(s) for s t, which is the smallest algebra in which all the
functions X(s) are measurable. It also may be characterized as the algebra
generated by events of the form A above (Tehcnical aside) with tn t (proof
ommitted). We also have the algebra Gt generated by the present only. That
is, Gt is generated by the single random variable X(t); it is the smallest
algebra in which X(t) is measurable. Finally, we let Ht denote the algebra
that depends only on future values X(s) for s t. The Markov property states
that if F (X) is any functional measurable with respect to Ht (i.e. depending
only on the future of t), then E[F | Ft ] = E[F | Gt ].
Here is a quick sketch of the proof. If F (X) is a function of finitely many
values, X(tk ), with tk t, then then E[F | Ft ] = E[F | Gt ] follows from the
independent increments property. It is possible (though tedious) to show that
any F measurable with respect to Ht may be approximated by a functional
depending on finitely many future times. This extends E[F | Ft ] = E[F | Gt ] to
all F measurable in Ht .

1.9. Path probabilities: For discrete Markov chains, as here, the individual
outcomes are paths, X. For Markov chains one can compute the probability
of an individual path by multiplying the transition probabilities. The situation
is different Brownian motion, where each individual path has probability zero.
We will make much use of the following partial substitute. Again choose times
t0 = 0 < t1 < < tn T , let ~t = (t1 , . . . , tn ) be the vector of these times, and
~ = (X(t1 ), . . . , X(tn )) be the vector of the corresponding observations of
let X
X. We write U (n) (~x, ~t) for the joint probability density for the n observations,
which is found by multiplying together the transition probability densities (1)
(and using properties of exponentials):
n1
Y
U (n) (~x, ~t) = G(xk , xk+1 , tk+1 tk )
k=0

3
n1 n1
!
1 Y 1 1 X (xk+1 xk )2
= n/2
exp . (2)
(2) tk+1 tk 2 tk+1 tk
k=0 k=0

The formula (2) is a concrete summary of the defining properties of the


probability measure for Brownian motion, Wiener measure: the independent
increments property, the Gaussian distribution of the increments, the variance
being proportional to the time differences, and the increments having mean zero.
It also makes clear that each finite collection of observations forms a multivariate
normal. For any of the events A as in Technical aside, we have
Z Z
P (A) = U (n) (x1 , . . . , xn , ~t)dx1 dxn .
x1 I1 xn In

1.10. Consistency: You cannot give just any old probability densities to
replace the joint densities (2). They must satisfy a simple consistency condition.
Having given the joint density for n observations, you also have given the joint
density for a subset of these observations. For example, the joint density for
X(t1 ) and X(t3 ) must be the marginal of the joint density of X((t1 ), X(t2 ),
and X(t3 ):
Z
(2)
U (x1 , x3 , t1 , t3 ) = U (3) (x1 , x2 , x3 , t1 , t2 , t3 )dx2 .
x2 =

It is possible to verify these consistency conditions by direct calculation with


the Gaussian integrals. A more abstract way is to understand the consistency
conditions as adding random increments. The U (2) density says that we get
X(t3 ) from X(t1 ) by adding an increment that is Gaussian with mean zero
and variance t3 t1 . The U (2) says that we get X(t3 ) from X(t2 ) by adding
a Gaussian with mean zero and variance t3 t2 . In turn, we get X(t2 ) from
X(t1 ) by adding an increment having mean zero and variance t2 t1 . Since the
smaller time increments are Gaussian and independent of each other, their sum
is also Gaussian, with mean zero and variance (t3 t2 ) + (t2 t1 ), which is the
same as the variance in going from X(t1 ) to X(t3 ) directly.

1.11. Rough paths: The above picture shows 5 Brownian motion paths.
They are random and differ in gross features (some go up, others go down), but
the fine scale structure of the paths is the same. They are not smooth, or even
differentiable functions of t. If X(t) is a differentiable function of t, then for
small t its increments are roughly proportional to t:
dX
X = X(t + t) X(t) tl.
dt
For Brownian motion, the expected value of the square of X (the variance of
X) is proportional
to t. This suggests that typical values of X will be on
the order of t. In fact, an easy calculation gives

t
E[|X|] = .
2

4
This would be impossible if successive increments of Brownian motion were all
in the same direction (see Total variation below). Instead, Brownian motion
paths are constantly changing direction. They go nowhere (or not very far) fast.

1.12. Total variation: One quantitative sense of path roughness is the vact
that Brownian motion paths have infinite total variation. The total variation
of a function X(t) measures the total distance it moves, counting both ups and
downs. For a differentiable function, this would be
Z T
dX
TV(X) = dt dtl.
(3)
0

If X(t) has simple jump discontinuities, we add the sizes of the jumps to (3).
For general functions, the total variation is
n1
X
TV(X) = sup |X(tk+1 ) X(tk )| , (4)
k=0

where the supremum as over all positive n and all sequences t0 = 0 < t1 < <
tn T .
Suppose X(t) has finitely many local maxima or minima, such as t0 = local
max, t1 = local min, etc. Then taking these t values in (4) gives the exact total
variation (further subdivision does not increase the left side). This is one way
to relate the general definition (4) to the definition for differentiable functions
(??). This does not help for Brownian motion paths, which have infinitely many
local maxima and minima.

1.13. Almost surely: Let A F be a measurable event. We say A happens


almost surely if P (A) = 1. This allows us to establish properties of random ob-
jects by doing calculations (stochastic calculus). For example, we will show that
Brownian motions paths have infinite total variation almost surely by showing
that for any (small)  > 0 and any (large) N ,

P (TV(X) < N ) <  . (5)

Let B C0 ([0, t], R) be the set of paths with finite total variation.. This is a
countable union [ [
B= {TV(X) < N } = BN .
N >0 N >0

Since P (BN ) < ) for any  > 0, we must have P (BN ) = 0. Countable additivity
then implies that P (B) = 0, which means that P (TV = ) = 1.
There is a distinction between outcomes that do not exist and events that
never happen because they have probability zero. For example, if Z is a one
dimensional Gaussian random variable, the outcome Z = 0 does exist, but the
event {Z = 0} is impossible (never will be observed). This is what we mean
when we say a Gaussian random variable never is zero, or every Brownian
motion path has invinite total variation.

5
1.14. The TV of BM: The heart of the matter is tha actual calculation
behind the inequality (5). We choose an n > 0 and define (not for the last time)
t = T /n and tk = kt. Let Y be the random variable
n1
X
Y = |X(tk+1 ) X(tk )| .
k=0

Remember that Y is one of the candidates we must use in the supremem (4) that
defines the total variation.qIf Y is large, then the total
q variation is at least as
2
2

large. Because E[|X|] = t, we have E[Y ] = T n. A calculation
using the independent increments property shows that
 
2
var(Y ) = 1 T

for any n. Tchebychevs inequality3 implies that


! !
2
r r
2 1
P Y < nk 1 T 2 .
k

If we take very large n and medium large k, this inequality says that
it is very
unlikely for Y (or total variation of X) to be much less than const n. Our
inequality (5) follows from this whth a suitable choice of n and k.

1.15. Structure of BM paths: For any function X(t), we can define the
total variation on the interval [t1 , t2 ] in an obvious way. The odometer of a car
records the distance travelled regardless of the direction. For X(t), the total
variation on the interval [0, t] plays a similar role. Clearly, X is monotone on
the interval [t1 , t2 ] if and only if TV(X, t1 , t2 ) = |X(t2 ) X(t1 )|. Otherwise,
X has at least one local min or max within [t1 , t2 ]. Now, Brownian motion
paths have infinite total variation on any interval (the proof above implies this).
Therefore, a Brownian motion path has a local max or min within any interval.
This means that (like the rational numbers, for example) the set of local maxima
and minima is dense: There is a local max or min arbitrarily close to any given
number.

1.16. Dynamic trading: The infinite total variation of Brownian motion has
a consequence for dynamic trading strategies. Some of the simplest dynamic
trading strategies, Black-Scholes hedging, and Merton half stock/half cash trad-
ing, call for trades that are proportional to the change in the stock price. If the
stock price is a diffusion process and there are transaction costs proportional
to the size of the trade, then the total transaction costs will either be infinite
(in the idealized continuous trading limit) or very large (if we trade as often as
3 If E[Y ] = and var(Y ) = 2 , then P (|Y | > k) < 1
k2
. The proof and more examples
are in any good basic probability book.

6
possible). It turns out that dynamic trading strategies that take trading costs
into account can approach the idealized zero cost strategies when trading costs
are small. Next term you will learn how this is done.

1.17. Quadratic variation: A more useful measure of roughness of Brownian


motion paths and other diffusion processes is quadratic variation. Using previous
notations: t = T /n, tk = kt, the definition is4 (where n as t 0
with t = nt fixed)
n1
X 2
Q(X) = lim Qn (X) = lim (X(tk+1 X(tk )) . (6)
t0 t0
k=0

If X is a differentiable function of t, then its quadratic variation is zero (Qn


is the sum of n terms each of order 1/n2 ). For Brownian motion, Q(T ) =
T (almost surely). Clearly E[Qn ] = T for any n (independent increments,
Gaussian increments with variance t). The independent increments property
also lets us evaluate var(Qn ) = 3T 2 /n (the sum of n terms each equal to 3t2 =
3T 2 /n2 ). Thus, Qn must be increasingly close to T as n gets larger5

1.18. Trading volatility: The quadratic variation of a stock price (or a similar
quantity) is called its realized volatility. The fact that it is possible to buy
and sell realized volatility says that the (geometric) Brownian motion model
of stock price movement is not completely realistic. That model predicts that
realized volatility is a constant, which is nothing to bet on.

1.19. Brownian bridge construction:

1.20. Continuous time stochastic process: The general abstract definition of


a continuous time stochastic process is just a probability space, , and, for each
t > 0, a algebra Ft . These algebras should form a filtration (corresponding
to increase of information): Ft1 Ft2 if t1 t2 . There should also be a family
of random variables Yt (), with Yt measurable in Ft (i.e. having a value known
at time t). This explains why probabilists often write Xt instead of X(t) for
Brownian motion and other diffusion processes. For each t, we think of Xt as a
function of with t simply being a parameter. Our choice of probability space
= C0 ([0, T ], R) implies that for each , Xt () is a continuous function of t.
(Actually, for simple Brownian motion, the path X plays the role of the abstract
outcome , though we never write Xt (X).) Other stochastic processes, such as
the Poisson jump process, do not have continuous sample paths.
4 It is possible, though not customary, to define TV(X) using evenly spaced points. In the

limit t 0, we would get the same answer for continuous paths or paths with TV(X) < .
You dont have to use uniformly spaced times in the definition of Q(X), but I think you get
a different answer if you let the times depend on X as they might in the definition of total
variation.
5 Thes does not quite prove that (almost surely) Q T as n . We will come back
n
to this point in later lectures.

7
1.21. Continuous time martingales: A stochastic process Ft (with and the
Ft ) is a martingale if E[Fs | Ft ] = Ft for s > t. Brownian motion forms the first
example of a continuous time martingale. Another famous martingale related to
Brownian motion is Ft = Xt2 t (the reader should check this). As in discrete
time, any random variable, Y , defines a continuous time martingale through
conditional expectations: Yt = E[Y | Ft ]. The Ito calculus is based on the idea
that a stochastic integral with respect to X should produce a martingale.

2 Brownian motion and the heat equation

2.1. Introduction: Forward and backward equations are tools for calculating
probabilities and expected values related to Brownian motion, as they are for
Markov chains and stochastic processes more generally. The probability density
of X(t) satisfies a forward equation. The conditional expectations E[V | Ft ]
satisfy backward equations for a variety of functionals V . For Brownian motion,
the forward and backward equations are partial differential equations, either the
heat equation or a close relative. We will see that the theory of partial differential
equations of diffusion type (the heat equation being the a prime example) and
the theory of diffusion processes (Brownian motion being a prime example) each
draw from the other.

2.2. Forward equation for the probability density: If X(t) is a standard


Brownian motion with X(0) = 0, then X(t) N (0, t), so its probability density
is (see (1))
1 x2 /2t
u(x, t) = G(0, x, t) = e .
2t
Directly calculating partial derivatives, we can verify that
1 2
t G = G. (7)
2 x
We also could consider a Brownian motion with a more general initial density
X(0) u0 (x). Then X(t) is the sum of independent random variables X(0)
and an N (0, t). Therefore, the probability density for X(t) is
Z Z
u(x, t) = G(y, x, t)u0 (y)dy = G(0, x y, t)u0 (y)dy . (8)
y= y=

Again, direct calculation (differentiating (8), x and t derivatives land on G)


shows that u satisfies
1
t u = x2 u . (9)
2
This is the heat equation, also called diffusion equation. The equation is used in
two ways. First, we can compute probabilities by solving the partial differential
equation. Second, we can use known probability densities as solutions of the
partial differential equation.

8
2.3. Heat equation via Taylor series: The above is not so much a derivation
of the heat equation as a verification. We are told that u(x, t) (the probability
density of Xt ) satisfies the heat equation and we verify that fact. Here is a
method for deriving a forward equation without knowing it in advance. We
assume that u(x, t) is smooth enough as a function of x and t that we may expand
it to to second order in Taylor series, do the expansion, then take the conditional
expectation of the terms. Variations of this idea lead to the backward equations
and to major parts of the Ito calculus.
Let us fix two times separated by a small t: t0 = t + t. The rules of
conditional probability allow us to compute the density of X = X(t0 ) in terms
of the density of Y = X(t) and the transition probabilit density (1):
Z
u(x, t + t) = G(y, x, t)u(y, t)dy . (10)
y=

The main idea is that for small t, X(t + t) will be close to X(t). This is
expressed in G being small unless y is close to x, which is evident in (1). In
the integral, x is a constant and y is the variable of integration. If we would
approximate u(y, t) by u(x, t), the value of the integral just would be u(x, t).
This would give the true but not very useful approximation u(x, t + t)
u(x, t) for small t. Adding the next Taylor series term (writing ux for x u):
R t) u(x, t)+ux (x, t)(y x), the integral does not change the result because
u(y,
G(y, x, t)(y x)dy = 0. Adding the next term:
1
u(y, t) u(x, t) + ux (x, t)(y x) + uxx (x, t)(y x)2 ,
2
gives (because E[(Y X)2 ] = t)
1
u(x, t + t) u(x, t) + uxx (x, t)t .
2
To derive a partial differential equation, we expand the left side as u(x, t+t) =
u(x, t) + ut (x, t)t + O(t2 ). On the right, we use
Z
3
G(y, x, t) |y x| dy = O(t3/2 ) .

Altogether, this gives

u(x, t) + ut (x, t)t = u(x, t) + uxx (x, t)t + O(t3/2 ) .

If we cancel the common u(x, t) then cancel the common factor t and let
t 0, we get the desired heat equation (9).

2.4. The initial value problem: The heat equation (9) is the Brownian motion
anologue of the forward equation for Markov chains. If we know the time 0
density u(x, 0) = u0 (x) and the evolution equation (9), the values of u(x, t) are
completely and uniquely determined (ignoring mathematical technicalities that

9
would be unlikely to trouble a practical person). The task of finding u(x, t) for
t > 0 from u0 (x) and (9) is called the initial value problem, with u0 (x) being
the initial value (or values??). This initial value problem is well posed,
which means that the solution, u(x, t), exists and depends continuously on the
initial data, u0 . If you want a proof that the solution exists, just use the integral
formula for the solution (8). Given u0 , the integral (8) exists, satisfies the heat
equation, and is a continuous function of u0 . The proof that u is unique is more
technical, partly because it rests on more technical assumptions.

2.5. Ill posed problems: In some situations, the problem of finding a function
u from a partial differential equation and other data may be ill posed, useless
for practical purposes. A problem is ill posed if it is not well posed. This means
either that the solution does not exist, or that it does not depend continuously
on the data, or that it is not unique. For example, if I try to find u(x, t) for
positive t knowing only u0 (x) for x > 0, I must fail. A mathematician would say
that the solution, while it exists, is not unique, there being many different ways
to give u0 (x) for x > 0, each leading to a different u. A more subtle situation
arises, for example, if we give u(x, T ) for all x and wish to determine u(x, t)
for 0 t < T . For example, if u(x, T ) = 1[0,1] (x), there is no solution (trust
me). Even if there is a solution, for example given by (8), is does not depend
continuously on the values of u(x, T ) for T > t (trust me).
The heat equation (9) relates values of u at one time to values at another
time. However, it is well posed only for determining u at future times from u
at earlier times. This forward equation is well posed only for moving forward
in time.

2.6. Conditional expectations: We saw already for Markov chains that


certain conditional expected values can be calculated by working backwards in
time with the backward equation. The Brownian motion version of this uses
the conditional expectation

f (x, t) = E[V (XT ) | Xt = x] . (11)

One modern formulation of this defines Ft = E[V (Xt ) | Ft ]. The Markov


property implies that Ft is measurable in Gt , which makes it a function of
Xt . We write this as Ft = f (Xt , t). Of course, these definitions mean the
same thing and yield the same f . The definition is also sometimes written as
f (x, t) = Ex,t [V (XT )]. In general if we have a parametrized family of probability
measures, P , we write the expected value with respect to P as E []. Here,
the probability measure Px,t is the Wiener measure describing Brownian motion
paths that start from x at time t, which is defined by the densities of increments
for times larger than t as before.

2.7. Backward equation by direct verification: Given that Xt = x, the


conditional density for XT is same transition density (1). The expectation (11)

10
is given by the integral f (x, t) as an integral, we get
Z
f (x, t) = G(x, y, T t)V (y)dy . (12)

We can verify by explicit differentiation (x and t derivatives act on G) that


1
t f + x2 f = 0 . (13)
2
Note that the sign of t here is not what it was in (9), which is because we are
calculating t G(T t) rather than t G(t). This (13) is the backward equation.

2.8. Backward equation by Taylor series: As with the forward equation (9),
we can find the backward equation by Taylor series expansions. We start by
choosing a small t and expressing f (x, t) in terms of6 f (, t + t). As before,
define Ft = E[V (XT ) | Ft ] = f (Xt , t). Since Ft Ft+t , the tower property
implies that Ft = E[Ft+t | Ft ].

f (x, t) = Ex,t [f (Xt+t )]


Z
= f (y, t + t)G(x, y, t)dy . (14)
y=

As before, we expand f (y, t + t) about x, t dropping terms that contribute less


than O(t):

f (y, t + t)
1
= f (x, t) + fx (x, t)(y x) + fxx (x, t)(y x)2 + ft (x, t)t
2
3
+O(|y x| ) + O(t2 ) .

Substituting this into (14) and integrating each term leads to


1
f (x, t) = f (x, t) + 0 + fxx (x, t)t + ft (x, t)t + O(t3/2 ) + O(t2 ) .
2
A bit of algebra and t 0 then gives (13).
For future reference, we pause to note the differences between this derivation
of (13) and the related derivation of (9). Here, we integrated G with respect
to its second argument, while earlier we integrated with respect to the first
argument. This does not matter for the special case of Brownian motion and
the heat equation because G(x, y, t) = G(y, x, t). When we apply this reasoning
to other diffusion processes, G(x, y, t) will be a probability density as a function
of y for every x, but it need not be a probability density as a function of x for
given y. This is an anologue of the fact in Markov chains that the transition
6 The notation f (, t + t) is to avoid writing f (x, t + t) which might imply that the value

f (x, t) depends only on f at time t + t for the same x value. Instead, it depends on all the
values f (y, t + t).

11
matrix P acts from the left on column vectors f (summing Pjk over P k) but from
the right on row vectors
P u (summing P jk over j). For each j, k Pjk = 1 but
the column sums j Pjk may not equal one. Of course, the sign of the t term
is different in the two cases because we did the t Taylor series on the right side
of (14) but on the left side of (10).

2.9. The final value problem: The final values f (x, T ) = V (x), together with
the backward evolution equation (13) allow us to determine the values f (, t)
for t < T . The definition (11) makes this obvious. This means that the final
value problem for the backward heat equation is a well posed problem.
On the other hand, the initial value problem for the backward heat equation
is not a well posed problem. If we have a f (x, 0) and we want a V (x) that leads
to it, we are probably out of luck.

2.10. Duality: As for Markov chains, we can express the expected value of
V (XT ) in terms of the probability density at any earlier time t T
Z
E[V (XT )] = u(x, t)f (x, t)dx .

This again implies that the right side is independent of t, which in turn al-
lows us to derive the forward equation (9) from the backward equation (13) or
conversely. For example, differentiating and using (13) gives
d
0 =
Zdt Z
= ut (x, t)f (x, t)dx + u(x, t)ft (x, t)dx
Z Z
= ut (x, t)f (x, t)dx u(x, t) 21 fxx (x, t)dx .

To derive an equation involving only u derivatives, we want to integrate the last


integral by parts to move the x derivatives from f to u. In this formal derivation,
we will assume that the probability density u(x, t) decays to zero fast enough as
|x| that we can neglect possible boundary terms at x = . This gives
Z
ut (x, t) 12 uxx (x, t) f (x, t)dx = 0 .


If this relation holds for a sufficiently rich family of functions f , we can only
conclude that ut 12 uxx is identically zero, which is the forward equation (9).

2.11. The smoothing property, regularity: Solutions of the forward or back-


ward heat equation become smooth functions of x and t even if the initial data
(for the forward equation) or final data (for the backward equation) are not
smooth. For u, this is clear from the integral formula (8). If we differentiate
with respect to x, this derivative passes under the integral and onto the G fac-
tor. This applies also to x or t derivatives of any order, since the corresponding

12
derivatives of G are still smooth integrable functions of x. The same can be said
for f using (12); as long as t < T , any derivatives of f with respect to x and/or t
are bounded. A function that has all partial derivatives of any order bounded is
called smooth. (Warning, this term is not used consistently. Some people say
smoooth to mean, for example, merely having derivatives up to second order
bounded.) Solutions of more general forward and backward equations often,
but not always, have the smoothing property.

2.12. Rate of smoothing: Suppose the payout (and final value) function,
V (x), is a discontinuous function such as V (x) = 1x<0 (x) (a digital option in
finance). The solution to the backward equation can be expressed in terms of
the cumulative normal (with Z N (0, 1))
Z x
1 2
N (x) = P (Z < x) = ez /2 dz .
2 z=
Then we have
Z 0
f (x, t) = G(x, y, T t)dy
y=
Z 0
1 2
= p e(xy) /2(tt)
dy
2(T t) y=

f (x, t) = N (x/ T t) . (15)
From this it is clear
that f is differentiable when t < T , but the first x derivative
is as large as 1/ T t, the second as large as 1/(T t), etc. All derivatives
blow up as t T with higher derivatives blowing up faster. This can make
numerical solution of the backward equation difficult and inaccurate when the
final data V (x) is not smooth.
The formula (15) can be derived without integration.
One way is to note that
f (x, t) = P (XT < 0 | Xt = x) and XT x+ T tZ, (Gaussian increments) so
that XT < 0 is the same as Z < x/ T t. Even without the normal probability,
a physicist would tell you that X t, sothe hitting probability starting
from x at time t has to be some function of x/ T t.

2.13. Diffusion: If you put a drop of ink into a glass of still water, you
will see the ink slowly diffuse through the water. This is modelled as a vast
number of tiny ink particles each preforming an independent Brownian motion
in the water. Let u(x, t) represent the density of particles about x at time t
(say, particles per cubic
R millemeter). This u satisfies the heat equation but not
the requirement that u(x, t)dx = 1. If ink has been diffusing through water
for some time, there will be dark regions with a high density of particles (large
u) and lighter regions with smaller u. In the absence of boundaries (sides of the
class and the top of the water), the ink distribution would be Gaussian.

2.14. Heat: Heat also can diffuse through a medium, as happens when
we put a thick metal pan over a flame and wait for the other side to heat

13
up. We can think of u(x, t) as representing the temperature in a metal at
location x at time t. This helps us interpret solutions of the heat equation
(9) when u is not necessarily positive. In particular, it helps us imagine the
cancellation that can occur when regions of positive and negative u are close to
each other. Heat flows from the high temperature regions to low or negative
temperature regions in a way that makes the temperature distribution a more
uniform. A physical argument that heat (temperature) flowing through a metal
should satisfy the heat equation was given by the French mathematical phycisist,
friend of Napoleon, and founder of Ecole Polytechnique, Joseph Fourier.

2.15. Hitting times: A stopping time, , is any time that depends on the
Brownian motion path X so that the event t is measurable with respect to
Ft . This is the same as saying that for each t there is some process that has as
input the values Xs for 0 s t and as output a decision t or > t. One
kind of stopping time is a hitting time:
a = min (t | Xt = a) .
More generally (particularly for Brownian motion in more than one dimension)
if A is a closed set, we may consider A = min(t | Xt A). It is useful to define
a Brownian motion that stops at time : Xt = Xt if t , Xt = X if t .

2.16. Probabilities for stopped Brownian motion: Suppose Xt is Brownian


motion starting at X0 = 1 and X is the Brownian motion stopped at time 0 ,
the first time Xt = 0. The probability measure, Pt , for Xt may be written
as the sum of two terms, Pt = Pts + Ptac . (Since Xt is a single number, the
probability space is = R, and the algebra is the Borel algebra.) The
singular part, Pts , corresponds to the paths that have been stopped. If p(t) is
the probability that t, then Pts = p(t)(x), which means that for any Borel
set, A R, Pts (A) = p(t) if 0 A and Pts (A) = 0 if 0 / A. This is called
the delta function or delta mass; it puts weight one on the point zero and
no weight anywhere else. Probabilists sometimes write x0 for the measure that
puts weight one on the point x0 . Phycisists write x0 (x) = delta(x = x0 ). The
absolutely continuous part, Ptac , isR given by a density, u(x, t). This means
ac
R
that Pt (A) = A u(x, t)dx. Because R u(x, t)dx = 1 p(t) < 1, u, while being
a density, is not a probability density.
This decomposition of a measure (P ) as a sum of a singular part and ab-
solutely continuous part is a special case of the Radon Nikodym theorem. We
will see the same idea in other contexts later.

2.17. Forward equation for u: The density for the absolutely continuous part,
u(x, t), is the density for paths that have not touched X = a. In the diffusion
interpretation, think of a tiny ink particle diffusing as before but being absorbed
if it ever touches a. It is natural to expect that when x 6= a, the density satisfies
the heat equation (9). u knows about the boundary condition because of
the boundary condition u(a, t) = 0. This says that the density of particles
approaches zero near the absorbing boundary. By the end of the course, we

14
will have several ways to prove this. For now, think of a diffusing particle, a
Brownian motion path, as being hyperactive; it moves so fast that it has already
visited a neighborhood of its current location. In particluar, if Xt is close to a,
then very likely Xs = a for some s < t. Only a small minority of the particles
at x near a, with small density u(x, t) 0 as x a have not touched a.

2.18. Probability flux: Suppose a Brownian motion starts at a random point


X0 > 0 with probability density u0 (x) and we take the absorbing boundary
at a = 0. Clearly, u(x, t) = 0 for x < 0 because a particle cannot cross from
positive to negative without crossing zero, the Brownian motion paths being
continuous. The probability of not being absorbed before time t is given by
Z
1 p(t) = u(x, t)dx . (16)
x>0
The rate of absorbtion of particles, the rate of decrease of probabiltiy, may be
calculated by using the heat equation and the boundary condition. Differenti-
ating (16) with respect to t and using the heat equation for the right side then
integrating gives
Z
p(t) = t u(x, t)dx
x>0
Z
1 2
= x u(x, t)dx
x>0 2
1
p(t) = x u(0, t) . (17)
2
Note that both sides of (17) are positive. The left side because P ( t) is an
increasing function of t, the right side because u(0, t) = 0 and u(x, t) > 0 for
x > 0. The identity (17) leads us to interpret the left side as the probability
flux (or density flux if we are thinking of diffusing particles). The rate
at which probability flows (or particles flow) across a fixed point (x = 0) is
proportional to the derivative (the gradient) at that point. In the heat flow
interpretation this says that the rate of heat flow across a point is proportional
to the temperature gradient. This natural idea is called Ficks law (or possibly
Fouriers law).

2.19. Images and Reflections: We want a function u(x, t) that satisfies the
heat equation when x > 0, the boundary condition u(0, t) = 0, and goes to x0
as t 0. The method of images is a trick for doing this. We think of x0 as
a unit charge (in the electrical, not financial sense) at x0 and g(x x0 , t) =
2
1 e(xx0 ) /2t as the response to this charge, if there is no absorbing boundary.
2
For example, think of puting a unit drop of ink at x0 and watching it spread
along the x axis in a bell shaped (i.e. gaussian) density distribution. Now
think of adding a negative image charge at x0 so that u0 (x) = x0 x0
and correspondingly
1  (xx0 )2 /2t 2

u(x, t) = e e(x+x0 ) /2t . (18)
2t

15
This function satisfies the heat equation everywhere, and in particular for x > 0.
It also satisfies the boundary condition u(0, t) = 0. Also, it has the same initial
data as g, as long as x > 0. Therefore, as long as x > 0, the u given by (18)
represents the density of unabsorbed particles in a Brownian motion with ab-
sorption at x = 0. You might want to consider the image charge contribution
2
in (18), 12 e(xx0 ) /2t , as red ink (the ink that represents negative quanti-
ties) that also diffuses along the x axis. To get the total density, we subtract
the red ink density from the black ink density. For x = 0, the red and black
densities are the same because the distance to the sources at x0 are the same.
When x > 0 the black density is higher so we get a positive u. We can think of
the image point, x0 , as the reflection of the original source point through the
barrier x = 0.

2.20. The reflection principle: The explicit formula (18) allows us to evaluate
p(t), the probability of touching x = 0 by time t starting at X0 = x0 . This is
Z Z
1  (xx0 )2 /2t 2

p(t) = 1 u(x, t)dx = e e(x+x0 ) /2t dx .
x>0 x>0 2t
R 1 (xx )/2t
Because 2t e 0
dx = 1, we may write
Z 0 Z
1 (xx0 )2 /2t 1 (x+x0 )2 /2t
p(t) = e dx + e dx .
2t 0 2t
Of course, the two terms on the right are the same! Therefore
Z 0
1 (xx0 )2 /2t
p(t) = 2 e dx .
2t
This formula is a particular case the Kolmogorov reflection principle. It says
that the probability that Xs < 0 for some s t is (the left side) is exactly
twice the probability that Xt < 0 (the integral on the right). Clearly some of
the particles that cross to the negative side at times s < t will cross back, while
others will not. This formula says that exactly half the particles that touch
for some s t have Xt > 0. Kolmogorov gave a proof of this based on the
Markov property and the symmetry of Brownian motion. Since X = 0 and
the increments of X for s > are independent of the increments for s < , and
since the increments are symmetric Gaussian random variables, they have the
same chance to be positive Xt > 0 as negative Xt < 0.

16
Stochastic Calculus Notes, Lecture 5
Last modified October 26, 2004

1 Integrals involving Brownian motion

1.1. Introduction: There are two kinds of integrals involving Brownian


motion, time integrals and Ito integrals. The time integral, which is discussed
here, is just the ordinary Riemann integral of a continuous but random function
of t with respect to t. Such integrals define stochastic processes that satisfy
interesting backward equations. On the one hand, this allows us to compute
the expected value of the integral by solving a partial differential equation. On
the other hand, we may find the solution of the partial differential equation by
computing the expected value by Monte Carlo, for example. The Feynman Kac
formula is one of the examples in this section.

1.2. The integral of Brownian motion: Consider the random variable, where
X(t) continues to be standard Brownian motion,
Z T
Y = X(t)dt . (1)
0

We expect Y to be Gaussian because the integral is a linear functional of the


(Gaussian) Brownian motion path X. Because X(t) is a continuous function
of t, this is a standard Riemann integral. The Riemann sum approximations
converge. As usual, for n > 0 we define t = T /n and tk = kt. The Riemann
sum approximation is
n1
X
Yn = t X(tk ) , (2)
k=0

and Yn Y as n because X(t) is a continuous function of t. The n


summands in (2), X(tk ), form an n dimensional multivariate normal, so each of
the Yn is normal. It would be surprising if Y , as the limit of Gaussians, were
not Gaussian.

1.3. The variance of Y : We will start the hard way, computing the variance
from (2) and letting t 0. The trick is to use two summation variables
Pn1 Pn1
Yn = t k=0 X(tk ) and Yn = t j=0 X(tj ). It is immediate from (2) that
E[Yn ] = 0 and var(Yn ) = E[Yn2 ]:
E[Yn2 ] = E[Yn Yn ]
! n1
n1
X X
= E t X(tk ) t X(tj )
k=0 j=0
X
2
= t E[X(tk )X(tj )] .
jk

1
If we now let t 0, the left side converges to E[Y 2 ] and the right side
converges to a double integral:
Z T Z T
E[Y 2 ] = E[X(t)X(s)]dsdt . (3)
s=0 t=0

We can find the needed E[X(t)X(s)] if s > t by writing X(s) = X(t) + X


with X independent of X(t), so

E[X(t)X(s) = E[X(t)(X(t) + X)]


= E[X(t)X(t)]
= t.

A variation of this argument gives E[Xt Xs ] = s if s < t. Altogether

E[Xt Xs ] = min(t, s) ,

which is a famous formula. This now gives


Z T Z T Z T Z T
2 1 3
E[Y ] = E[Xt Xs ]dsdt = min(s, t)dsdt = T .
s=0 t=0 s=0 t=0 3
RT
There is a simpler and equally rigorous way to get this. Write Y = s=0
X(s)ds
RT
and t=0 X(t)dt so that again
"Z #
Z T T
E[Y 2 ] = E X(s)ds X(t)dt
s=0 t=0
"Z #
T Z T
= E X(s)X(t)dtds (4)
s=0 t=0
Z T Z T
= E[X(s)X(t)]dtds; . (5)
s=0 t=0

Going from the (4) to (5) involves changing the order of integration1 . After all,
E[] just represents integration over a probability space. The right side of (4)
has the abstarct form
Z Z Z !
F (, s, t)dtds dP () .
s[0,T ] t[0,T ]

1 The possibility of changing order of abstract integrals was established by the twentieth

century mathematician Fubini. He proved it to be correct if the double (triple in our case)
integral converges absolutely (a requirement even for ordinary Riemann integrals) and the
function F is jointly measurable in all its arguments. Our integrand is nonnegative, so the
result will be infinite if the integral does not converge absolutely. We omit a discussion of
product measures and joint measurability.

2
Here F = X(s)X(t), and is the random outcome (the whole path X[0, T ]
here), and P represents Wiener measure. If we interchange the ordinary Rie-
mann dsdt integral with the abstract dP integral, we get
Z Z Z 
F (, s, t)dP () dsdt ,
s[0,T ] t[0,T ]

Which is the abstract form of (5).

1.4. MeasurabilityR tof Brownian motionRintegrals: Suppose t1 < t2 . Consider


t
the integrals U = 0 1 X(t)dt and V = t12 (X(t) X(t1 ))dt. We expect U
to be measurable in Ft1 because all the X values defining U are measurable
in Ft1 . Similarly, all the differences defining V are independent of anything
anything in Ft1 . Therefore, we expect V to be independent of U . We omit the
straightforward proofs of these facts, which depend on elementary properties of
abstract integration.

1.5. The Xt3 martingale: Many martingales are constructed from integrals
involving Brownian motion. A simple one is
Z t
3
F (t) = X(t) 3 X(s)ds .
0

To check the martingale property, choose t2 > t1 and, for t > t1 , write X(t) =
X(t1 ) + X(t). Then
Z t2  Z t1 Z t2  
E X(t)ds | Ft1 = E X(t)dt + X(t)dt Ft1

0 0 t1
Z t  Z t2 
= E X(t)dt | Ft +E (X(t1 ) + X(t)) dt | Ft
0 t1
Z t
= X(t)dt + (t2 t1 )X(t1 ) .
0

In the last line we use the facts that X(t) Ft1 when t < t1 , and Xt1 Ft1 ,
and that E[X(t) | Ft1 ] = 0 when t > t1 , which is part of the independent
increments property. For the X(t)3 part, we have,
h i
3
E (X(t1 ) + X(t2 )) | Ft1
= E X(t1 )3 + 3X(t1 )2 X(t2 ) + 3X(t1 )X(t2 )2 + X(t2 )3 | Ft1
 

= X(t1 )3 + 3X(t1 )2 0 + 3X(t1 )E[X(t2 )2 | Ft1 ] + 0


= X(t1 )3 + 3(t2 t1 )X(t1 ) .

In the last line we used the independent increments property to get E[X(t2 ) |
Ft1 ] = 0, and the formula for the variance of the increment to get E[X(t2 )2 |
Ft1 ] = t2 t1 . This verifies that E[F (t2 ) | Ft ] = F (t1 ), which is the martingale
property.

3
1.6. Backward equations for expected values of integrals: Many integrals
involving Brownian motion arise in applications and may be solved using
RT
backward equations. One example is F = 0 V (X(t))dt, which represents the
total accumulated V (X) over a Brownian motion path. If V (x) is a continuous
function of x, the integral is a standard Riemann integral, because V (X(t))
is a continuous function of t. We can calculate E[F ], using the more general
function "Z #
T
f (x, t) = Ex,t V (X(s))ds . (6)
t

As before, we can describe the function f (x, t) in terms of the random variable
"Z #
T
F (t) = E V (X(s))dt | Ft .
t

Since F (t) is measurable in Ft and depends only on future values (X(s) with
s > t), F (t) is measurable in Gt . Since Gt is generated by X(t) alone, this
means that F (t) is a function of X(t), which we write as F (t) = f (X(t), t).
Of course, this definition is a big restatement of definition (6). Once we know
f (x, t), we can plug in t = 0 to get E[F ] = F (0) = f (x0 , 0) if X(0) = x0 is
known. Otherwise, E[F ] = E[f (X(0), t)].
The backward equation for f is
1
t f + x2 f + V (x, t) = 0 , (7)
2
with final conditions f (x, T ) = 0. The derivation is similar to the one we used
before for the backward equation for Ex,t [V (XT )]. We use Taylor series and the
tower property to calculate how f changes over a small time increment, t. We
start with
Z T Z t+t Z T
V (X(s))ds = V (X(s))ds + V (X(s))ds ,
t t t+t

take the x, t expectation, and use (6) to get


"Z # "Z #
t+t T
f (x, t) = Ex,t V (X(s))ds Ft + Ex,t V (X(s))ds Ft . (8)

t t+t

The first integral on the right has the value V (x)t + o(t). We write o(t) for
a quantity that is smaller than t in the sense that o(t)/t 0 as t 0
(we will shortly divide by t, take the limit t 0, and neglect all o(t)
terms.). The second term has
"Z #
T
Ex,t V (X(s))ds | Ft+t = F (Xt+t ) = f (X(t + t), t + t) .
t+t

4
Writing X(t + t) = X(t) + X, we use the tower property with Ft Ft+t
to get
"Z #
T
E V (X(s))ds | Ft = E [f (Xt + X, t + t) | Ft ] .
t+t

As before, we use Taylor expansions the conditional expectation to get first


1
f (x+X, t+t) = f (x, t)+tt f (x, t)+Xx f (x, t)+ X 2 x2 f (x, t)+o(t) ,
2
then
1
Ex,t [f (x + X, t + t] = f (x, t) + tt f (x, t) + tx2 f (x, t) + o(t) .
2
Putting all this back into (8) gives
1
f (x, t) = tV (x) + f (x, t) + tt f (x, t) + tx2 f (x, t) + o(t) .
2
Now just cancel f (x, t) from both sides and let t 0 to get the promised
equation (7).

1.7. Application of PDE: Most commonly, we cannot evaluate either the


expected value (6) or the solution of the partial differential equation (PDE)
(7). How does the PDE represent progress toward evaluating f ? One way is
by suggesting a completely different computational procedure. If we work only
from the definition (6), we would use Monte Carlo for numerical evaluation.
Monte Carlo is notoriously slow and inaccurate. There are several techniques
for finding the solution of a PDE that avoid Monte Carlo, including finite dif-
ference methods, finite element methods, spectral methods, and trees. When
such deterministic methods are practical, they generally are more reliable, more
accurate, and faster. In financial applications, we are often able to find PDEs
for quantities that have no simple Monte Carlo probabilistic definition. Many
such examples are related to optimization problems: maximizing an expected
return or minimizing uncertainty with dynamic trading strategies in a randomly
evolving market. The Black Scholes evaluation of the value of an American style
option is a well known example.

1.8. The Feynman Kac formula: Consider


" Z T !#
F = E exp V (X(t)dt . (9)
0

As before, we evaluate F using the related and more refined quantities


 RT 
V (Xs )ds
f (x, t) = Ex,t e t (10)

5
satisfies the backward equation
1
t f + x2 f + V (x)f = 0 . (11)
2
When someone refers to the Feynman Kac formula, they usually are referring
to the fact that (10) is a formula for the solution of the PDE (11). In our work,
the situation mostly will be reversed. We use the PDE (11) to get information
about the quantity defined by (10) or even just about the process X(t).
We can verify that (10) satisfies (11) more or less as in the preceding para-
graph. We note that
(Z )
t+t Z T
exp V (X(s))ds + V (X(s))ds
t t+t
(Z ) (Z )
t+t T
= exp V (X(s))ds exp V (X(s))ds
t t+t
(Z )
T
= (1 + tV (X(t)) + o(t)) exp V (X(s))ds
t+t

The expectation of the rigth side with respect to Ft+t is

(1 + tV (Xt ) + o(t)) f (X(t + X, t + t) .

When we now take expectation with respect to Ft , which amounts to averaging


over X, using Taylor expansion of f about f (x, t) as before, we get (11).

1.9. The Feynman integral: A precurser to the Feynman Kac formula, is the
Feynman integral2 solution to the Schrodinger equation. The Feynman integral
is not an integral in the sense of measure theory. (Neither is the Ito integral, for
that matter.) The colorful probabilist Marc Kac (pronounced Katz) discov-
ered that an actual integral over Wiener measure (10) gives the solution of (11).
Feynmans reasoning will help us derive the Girsanov formula, so we pause to
sketch it.
The finite difference approximation
Z T n1
X
V (X(t))dt t V (X(tk )) , (12)
0 k=0

(always t = T /n, tk = kt) leads to an approximation to F of the form


" n=1
!#
X
Fn = E exp t V (X(tk )) . (13)
k=0
2 The American Physicist Richard Feynman was born and raised in Far Rockaway (a neigh-
borhood of Queens, New York). He is the author of several wonderful popular books, including
Surely Youre Joking, Mr. Feynman and The Feynman Lectures on Physics.

6
The functional Fn depends only on finitely many values Xk = X(tk ), so we may
~ = (X1 , . . . , Xn ). The
evaluate (13) using teh known joint density function for X
density is (see Path probabilities from Lecture 5):
n1
!
(n) 1 X
2
U (~x) = exp (xk+1 xk ) /et .
(2tn/2 k=0

It is suggestive to rewrite this as


" n1  2 #
(n) 1 t X xk+1 xk
U (~x) = exp . (14)
(2tn/2 2
k=0
t

Using this to evaluate Fn gives


" n1 n1
#
1
Z X t X  xk+1 xk 2
Fn = Rn exp t V (xk ) d~x . (15)
(2tn/2 k=0
2
k=0
t

It is easy to show that Fn F as n as long as V (x) is, say, continuous


and bounded (see below).
Feynman proposed a view of F = limn Fn in (15) that is not mathemat-
ically rigorous but explains whats going on. If xk x(tk ), then we should
have
n1
X Z T
t V (xk ) V (x(t))dt .
k=0 t=0

Also,  
xk+1 xk dx
= x(tk ) ,
t dt
so we should also have
n1  2 Z T
t X xk+1 xk
x(t)2 dt .
2 t 0
k=0

As n , the integral over Rn should converge to the integral over all paths
x(t). We denote this by P without worring about exactly which paths are
allowed (continuous, differentiable, ...?). The integration element d~x has the
possible formal limit
n1
Y n1
Y T
Y
d~x = dxk = dx(tk ) dx(t) .
k=0 k=0 t=0

Altogether, this gives the formal expression for the limit of (15):
Z Z T Z T ! T
Y
1 2
F = const exp V (x(t))dt 2 x(t) dt dx(t) . (16)
P 0 0 t=0

7
1.10. Feynman and Wiener integration: Mathematicians were quick to com-
plain about (16). For one thing, the constant const = limn (2t)n/2 should
be Rinfinite.
QT More seriously, there is no abstract integral measure corresponding
to P t=0 dx(t) (it is possible to prove this). Kac proposed to write (16) as
Z Z T !" Z T ! T #
Y
F = exp V (x(t))dt const exp 21 x(t)2 dt dx(t) .
P 0 0 t=0

and then interpret the latter part as Wiener measure (dP ):


Z T ! T
Y
1 2
const exp 2 x(t) dt dx(t) = dP (X) (17)
0 t=0

In fact, we have already implicitly argued informally (and it can be formalized)


that
n1
Y
lim U (n) (~x) dxk dP (X) as n .
n
k=0
These intuitive but mathematically nonsensical formulas are a great help in
understanding Brownian motion. For one thing, (17) makes clear that Wiener
R exp(Q(x)), where Q(x) is
measure is Gaussian. Its density has the form const
a positive quadratic function of x. Here Q(x) = x(t)2 dt (and the constant is,
alas, infinite).
R Moreover, in many cases it is possible to approximate integrals
of the form exp((~x))d~x by e , where = max~x (~x) if the is sharply
peaked around its maximum. This is particularly common in rare event or
large deviation problems. In our case, this would lead us to solve the calculus
of variations problem
Z T !
1 T 2
R
max V (x(t)dt 2 0 x(t) dt .
x 0

1.11. Application of Feynman Kac: The problem of evaluating


" Z T !#
f = E exp V (Xt )dt
0

arises in many situations. In finance, f could represent the present value of


a payment in the future subject to unknown fluctuating interest rates. The
PDE (11) provides a possible way to evaluate f = f (0, 0), either analytically or
numerically.

2 Mathematical formalism

2.1. Introduction: We examine the solution formulas for the backward and
forward equation from two points of view. The first is an analogy with linear

8
algebra, with function spaces playing the role of vector space and operators
playing the role of matrices. The second is a more physical picture, interpreting
G(x, y, t) as the Greens function describing the forward diffusion of a point mass
of probability or the backward diffusion of a localized unit of payout.

2.2. Solution operator As time moves forward, the probability density for
Xt changes, or evolves. As time moves backward, the value function f (x, t)
also evolves3 The backward evolution process is given by (for s > 0, this is a
consequence of the tower property.)
Z
f (x, t s) = G(x, y, s)f (y, t)dy . (18)

We write this abstractly as f (t s) = G(s)f (t).


This formula is anologous to the comparable Markov chain formula f (ts) =
P s f (t). In the Markov chain case, s and t are integers and f (t) represents a
vector in Rn whose components are fk (t). Here, f (t) is a function of x whose
values are f (x, t). We can think of P s as an n n matrix or as the linear
operator that transforms the vector f to the vector g = P s f . Similarly, G(s) is
a linear operator, transforming a function f into g, with
Z
g(x) = G(x, t, s)f (y)dy .

The operation is linear, which means that G(af (1) + bf (2) ) = aGf (1) + bGf (2) .
The family of operators G(s) for s > 0 produces the solution to the backward
equaion, so we call G(s) the solution operator for time s.

2.3. Duhamels principle: The inhomogeneous backward equation

t f + x2 f = V (x, t) , (19)

with homogeneous4 final condition f (x, T ) = 0 may be solved by


"Z #
T
f (x, t) = Ex,t V (X(t0 ), t0 dt0 ) .
t

Exchanging the order of integration, we may write


Z T
f (x, t) = g(x, t, t0 )dt0 , (20)
t0 =t

where
g(x, t, t0 ) = Ex,t [V (X(t0 ))] .
3 Unlike biological evolustion, this evolution process makes the solution less complicated,

not more.
4 We often say homogeneous to mean zero and inhomogeneous to mean not zero. That

may be because if V (x, t) is zero then it is constant, i.e. the same everywhere, which is the
usual meaning of homogeneous.

9
This g is the expected value (at (x, t)) of a payout (V (, t0 ) at time t0 > t). As
such, g is the solution of a homogeneous final value problem with inhomogeneous
final values:
t g + 12 x2 g = 0 for t < t0 ,

(21)
g(x, t0 ) = V (x, t0 ) .

Duhamels principle, which we just demonstrated, is as follows. To solve the


invonogeneous final value problem (19), we solve a homogeneous final value
problem (21) for each t0 between t and T then we add up the results (20).

2.4. Infinitesimal generator: There are matrices of many different types that
play various roles in theory and computation. And so it is with operators. In
addition to the solution operator, there is the infinitesimal generator (or simply
generator). For Brownian motion in one dimension, the generator is

L = 21 x2 . (22)

The backward equation may be written

t f + Lf = 0 . (23)

For other diffusion processes, the generator is the operator L that puts the
backward equation for process in the form (23).
Just as a matrix has a transpose, an operator has an adjoint, written L .
The forward equation takes the form

t u = L u .

The operator (22) for Brownian motion is self adjoint, which means that L = L,
which is why the operator 21 x2 is the appears in both. We will return to these
points later.

2.5. Composing (multiplying) operators: If A and B are matrices, then there


are two ways to form the matrix AB. One way is to multiply the matrices. The
other is to compose the linear transformations: f Bf ABf . In this way,
AB is the composite linear transformation formed by first applying B then
applying A. We also can compose operators, even if we sometimes lack a good
explicit representation for the composite AB. As with matrices, composition of
operators is associative: A(Bf ) = (AB)f .

2.6. Composing solution operators: The solution operator G(s1 moves the
value function backward in time by the amount s1 , which is written f (t s1 ) =
G(s1 )f (t). The operator G(s2 ) moves it back an additional s2 , i.e. f (t (s1 +
s2 )) = G(s2 )f (ts1 ) = G(s2 )G(s1 )f (t). The result is to move f back by s1 +s2
in total, which is the same as applying G(s1 + s2 ). This shows that for every
(allowed) f , G(s2 )G(s1 )f = G(s2 + s1 )f ,. which means that

G(s2 )G(s1 ) = G(s2 + s1 ) . (24)

10
This is called the semigroup property. It is a basic property of the solution
operator for any problem. The matrix anologue for Markov chains is P s2 +s+1 =
P s2 P s1 , which is a basic fact about powers of matrices having nothing to do
with Markov chains. The property (24) would be called the group property if
we were to allow negative s2 or s1 , which we do not. Negative s is allowed in
the matrix version if P is nonsingular. There is no particular physical reason
for the transition matrix of a Markov chain to be non singular.

2.7. Operator kernels:PIf matrix A has elements Ajk , we can compute g = Af


by doing the sum gj = k Ajk fk . Similarly, operator A may or may not have
a kernel5 , which is a function A(x, y) so that g = Af is represented by
Z
g(x) = A(x, y)f (y)dy .

If operators A and B both have kernels, then the composite operator has the
kernel Z
(AB)(x, y) = A(x, z)B(z, y)dz . (25)
R
To derive this
R formula, set g = Bf and h = Ag. Then h(x) = A(x, z)g(z)dz
and g(z) = B(z, y)f (y)dy implies that
Z Z 
h(x) = A(x, z)B(z, y)dz f (y)dy .

This shows that (25) is the kernel of AB. The formula is anologous to the
formula for matrix multiplication.

2.8. The semigroup property: When we defined (18) the solution operators
G(s), we did so by specifying the kernels
1 2
G(x, t, s) = e(xy) /2s .
2s
According to (25). the semigroup property should be an integral identity in-
volving G. The identity is
Z
G(x, y, s2 + s1 ) = G(x, z, s2 )G(z, y, s1 )dz .

More concretely:
1 2
p e(xy) /2(s2 +s1 )
2(s2 + s1 )
Z
1 1 2 2
=p p e(xz) /2s2 e(zy) /2s1 dz .
2(s2 ) 2(s1 )
5 The term kernel also describes vectors f with Af = 0, it is unfortunate that the same

word is used for these different objects.

11
The reader is encouraged to verify this by direct integration. It also can be
verified by recognizing it as the statement that adding independent mean zero
Gaussian random variables with variance s2 and s1 respectively gives a Gaussian
with variance s2 + s1 .

2.9. Fundamental solution: The operators G(t) form a fundamental solution6


for the problem ft + Lf = 0 if
t G = LG , for t > 0 , (26)
G(0) = I . (27)
 
The property (26) really means that t G(t)f = L Gf for any f . If G(t) has
a kernel G(x, y, t), this in turn means (as the reader should ckeck) that
t G(x, y, t) = Lx G(x, y, t) , (28)
where Lx means that the derivatives on L are with respect to the x variables in
G. In our case with G being the heat kernel, this is
1 (xy)2 /2t 2
t e = 21 x2 2t
1
e(xy) /2t ,
2t
which we have checked and rechecked.
Without matrices, we still have the identity operator: If = f for all f . The
property (27) really means that G(t)f f as t 0. It is easy to verify this
for our heat kernel provided that f is continuous.

2.10. Duhamel with fundamental solution operator: The g appearing in (20)


may be expresses as g(t, t0 ) = G(t0 t)V (t0 ), where V (t0 ) is the function with
values V (x, t0 ). This puts (20) in the form
Z T
f (t) = G(t0 t)V (t0 )dt0 . (29)
t

We illustrate the properties of the fundamental solution operator by verifying


(29) directly. We want to show taht (29) implies that t f + Lf = V (t) and
f (T ) = 0. The latter is clear. For the former we compute t f (t) by differenti-
ating the right side of (29):
Z T Z T
t G(t0 t)V (t0 )dt0 = G(t t)V (t) G0 (t0 t)V (t0 )dt0 ,
t t

We write G0 (t) to represent t G(t). This allows us to write t G(t0 t) =


G0 (t0 t) = LG(t0 t). Continuing, the left side is
Z T Z T
V (t) LG(t0 t)V (t0 )dt0 = V (t) LG(t0 t)V (t0 )dt0 .
t t
6 We have adjusted this definition from its original form in books on ordinary differential
equations to accomodate the backward evolution of the backward equation. This amounts to
reversing the sign of L.

12
If we take L outside the integral on the right, we recognize what is left in the
integral as f (t). Altogether, we have t f = V (t) Lf (t). This is almost right,
I just have to fix the minus sign somehow.

2.11. Greens function: Consider the solution formula for the homogeneous
final value problem t f + Lf = 0, f (T ) = V :
Z
f (x, t) = G(x, y, T t)V (y)dy . (30)

Consider a special jackpot payout V (y) = (y x0 ). If you like, you can


think of V (y) = 1 when |y = x0 | <  then let  0. We then get f (x, t) =
G(x, x0 , T t). The function that satisfies t G + Lx G = 0, G(x, T = (x x0 )
is called the Greenss function7 . The Greens function represents the result of
a point mass payout. A general payout can be expressed as a sum (integral) of
point mass payouts as x0 with weight V (x0 ):
Z
V (y) = V (x0 )(y x0 )dx0 .

Since the backward equation is linear, the general value function will be the
weighted sum (integral) of the point mass value functions, which is the formula
(30).

2.12. More generally: Brownian motion is special in that G(x, y, t) is a


function of x y. This is because Brownian motion is translation invariant: a
Brownian motion starting from any point looks like a Brownian motion starting
from any other point. Brownian motion is also special in that the forward
equation and backward equations are nearly the same, having the same spatial
operator L = 21 x2 .
More general diffusion processes loose both these properties. The solution
operator depends in a more complicated way on x and y. The backward equa-
tion is t f + Lf = 0 but the forward equation is t u = L u. The Greens
function, G(x, y, t) is the fundamental solution for the backward equation in the
x, t variables with y as a parameter. It also is the fundamental solution to the
forward equation in the y, t variables with x as a parameter. This material will
be in a future lecture.

7 This is in honor of a 19th century Englishman named Green.

13
Stochastic Calculus Notes, Lecture 7
Last modified December 3, 2004

1 The Ito integral with respect to Brownian mo-


tion

1.1. Introduction: Stochastic calculus is about systems driven by noise. The


Ito calculus is about systems driven by white noise, which is the derivative of
Brownian motion. To find the response of the system, we integrate the forcing,
which leads to the Ito integral, of a function against the derivative of Brownian
motion.
The Ito integral, like the Riemann integral, has a definition as a certain limit.
The fundamental theorem of calculus allows us to evaluate Riemann integrals
without returning to its original definition. Itos lemma plays that role for Ito
integration. Itos lemma has an extra term not present in the fundamental
theorem that is due to the non smoothness of Brownian motion paths. We will
explain the formal rule: dW 2 = dt, and its meaning.
In this section, standard one dimensional Brownian motion is W (t) (W (0) =
0, E[W 2 ] = t). The change in Brownian motion in time dt is formally called
dW (t). The independent increments property implies that dW (t) is independent
of dW (t0 ) when t 6= t0 . Therefore, the dW (t) are a model of driving noise
impulses acting on a system that are independent from one time to another.
We want a rule to add up the cumulative effects of these impulses. In the first
instance, this is the integral
Z T
Y (T ) = F (t)dW (t) . (1)
0

Our plan is to lay out the principle ideas first then address the mathemat-
ical foundations for them later. There will be many points in the beginning
paragraphs where we appeal to intuition rather than to mathematical analysis
in making a point. To justify this approach, I (mis)quote a snippet of a poem I
memorized in grade school: So you have built castles in the sky. That is where
they should be. Now put the foundations under them. (Author unknown by
me).

1.2. The Ito integral: Let Ft be the filtration generated by Brownian motion
up to time t, and let F (t) Ft be an adapted stochastic process. Correspond-
ing to the Riemann sum approximation to the Riemann integral we define the
following approximations to the Ito integral
X
Yt (t) = F (tk )Wk , (2)
tk <t

1
with the usual notions tk = kt, and Wk = W (tk+1 ) W (tk ). If the limit
exists, the Ito integral is
Y (t) = lim Yt (t) . (3)
t0

There is some flexibility in this definition, though far less than with the Riemann
integral. It is absolutely essential that we use the forward difference rather than,
say, the backward difference ((wrong) Wk = W (tk ) W (tk1 )), so that
 
E F (tk )Wk Ftk = 0 . (4)

Each of the terms in the sum (2) is measurable measurable in Ft , therefore


Yn (t) is also. If we evaluate at the discrete times tn , Yt is a martingale:

E[Yt (tn+1 ) Ftn = Y t(tn ) .

In the limit t 0 this should make Y (t) also a martingale measurable in Ft .

1.3. Famous example: The simplest interesting integral with an Ft that is


random is Z T
Y (T ) = W (t)dW (t) .
0

If W (t) were differentiable with derivative W , we could calculate the limit of


(2) using dW (t) = W (t)dt as
Z T RT
1
t W (t)2 dt = 21 W (t)2 .

(wrong) W (t)W (t)dt = 2 0
(wrong) (5)
0

But this is not what we get from the definition (2) with actual rough path
Brownian motion. Instead we write
1 1
W (tk ) = 2 (W (tk+1 ) + W (tk )) 2 (W (tk+1 ) W (tk )) ,

and get
X
Yt (tn ) = W (tk ) (W (tk+1 ) W (tk ))
k<n
X
1
= 2 (W (tk+1 ) + W (tk )) (W (tk+1 ) W (tk ))
k<n
X
1
2 (W (tk+1 ) W (tk )) (W (tk+1 ) W (tk ))
k<n
X  X 2
= 1
2 W (tk+1 )2 W (tk )2 1
2 (W (tk+1 ) W (tk )) .
k<n k<n

The first on the bottom right is (since W (0) = 0)


1 2
2 W (tn )

2
The second term is a sum of n independent random variables, each with expected
value t/2 and variance t2 /2. As a result, the sum is a random variable with
mean nt/2 = tn /2 and variance nt2 /2 = tn t/2. This implies that
X 2
1
2 (W (tk+1 ) W (tk )) T /2 as t 0 . (6)
tk <T

Together, these results give the correct Ito answer


Z T
W (t)dW (t) = 21 W (t)2 T .

(7)
0

The difference between the right answer (7) and the wrong answer (5) is the
T /2 coming from (6). This is a quantitative consequence of the roughness of
Brownian motion paths. If W (t) were a differentiable function of t, that term
would have the approximate value
Z T 2
dW
t dt 0 as t 0 .
0 dt

1.4. Backward differencing, etc: If we use the backward difference Wk =


W (tk ) W (tk1 ), then the martingale property (4) does not hold. For ex-
ample, if F (t) = W (t) as above, then the right side changes from zero to
(W (tn ) W (tn1 )W (tn ) (all quantities measurable in Ftn ), which has expected
value1 t. In fact, if we use the backward difference and follow the argu-
ment used to get (7), we get instead 21 (W (T )2 + T ). In addition to the Ito
integral there is a Stratonovich integral, which is used the central difference
Wk = 21 (W (tk+1 ) W (tk1 )). The Stratonovich definition makes the stochas-
tic integral act more like a Riemann integral. In particular, the reader can check
that the Stratonovich integral of W dW is 21 W (T )2 .

1.5. Martingales: The Ito integral is a martingale. It was defined for that
purpose. Often one can compute an Ito integral by starting with the ordinary
calculus guess (such as 12 W (T )2 ) and asking what needs to change to make the
answer a martingale. In this case, the balancing term T /2 does the trick.

1.6. The Ito differential: Itos lemma is a formula for the Ito differential,
which, in turn, is defined in using the Ito integral. Let F (t) be a stochastic
process. We say dF = a(t)dW (t) + b(t)dt (the Ito differential) if
Z T Z T
F (T ) F (0) = a(t)dW (t) + b(t)dt . (8)
0 0

The first integral on the right is an Ito integral and the second is a Riemann
integral. Both a(t) and b(t) may be stochastic processes (random functions of
1 E[(W (t ) W (t
n n1 ))W (tn1 )] = 0, so E[(W (tn ) W (tn1 ))W (tn )] = E[(W (tn )
W (tn1 ))(W (tn ) W (tn1 ))] = t

3
time). For example, the Ito differential of W (t)2 is

dW (t)2 = 2W (t)dW (t) + dt ,

which we verify by checking that


Z T Z T
W (T )2 = 2 W (t)dW (t) + dt .
0 0

This is a restatement of (7).

1.7. Itos lemma: The simplest version of Itos lemma involves a function
f (w, t). The lemma is the formula (which must have been stated as a lemma
in one of his papers):
2
df (W (t), t) = w f (W (t), t)dW (t) + 21 w f (W (t), t)dt + t f (W (t), t)dt . (9)

According to the definition of the Ito differential, this means that

f (W (T ), T ) f (W (0), 0) (10)
Z T Z T
2

= w f (W (t), t)dW (t) + w f (W (t), t) + t f (W (t), t) dt (11)
0 0

1.8. Using Itos lemma to evaluate an Ito integral: Like the fundamental
theorem of calculus, Itos lemma can be used to evaluate integrals. For example,
consider Z T
Y (T ) = W (t)2 dW (t) .
0

A naive guess might be 31 W (T )3 , which would be the answer for a differentiable


function. To check this, we calculate (using (9), w 31 w3 = w2 , and 12 w
2 1 3
3 w = w)

d 13 W (t)3 = W 2 (t)dW (t) + W (t)dt .

This implies that


Z T Z T Z T
3
1
3 W (t) = d 31 W (t)3 = W (t)2 dW (t) + W (t)dt ,
0 0 0

which in turn gives


Z T Z T
W (t)2 dW (t) = 31 W (t)3 W (t)dt .
0 0
RT
This seems to be the end. There is no way to integrate Z(T ) = 0 W (t)dt
to get a function of W (T ) alone. This is to say that Z(T ) is not measurable in
GT , the algebra generated by W (T ) alone. In fact, Z(T ) depends equally on all

4
W (t) values for 0 t T . A more technical version of this remark is coming
after the discussion of the Brownian bridge.

1.9. To tell a martingale: Suppose F (t) is an adapted stochastic process


with dF (t) = a(t)dW (t) + b(t)dt. Then F is a martingale if and only if b(t) = 0.
We call a(t)dW (t) the martingale part and b(t)dt drift Rterm. If b(t) is at all
continuous, then it can be identified through (because E[ a(s)dW (s) Ft ] = 0)
"Z #
  t+t
E F (t + t) F (t) Ft = E b(s)ds Ft
t

= b(t)t + o(t) . (12)

We give one and a half of the two parts of the proof of this theorem. If
b = 0 for all t (and all, or almost all ), then F (T ) is an Ito integral and
hence a martingale. If b(t) is a continuous function of t, then we may find a
t and  > 0 and > 0 so that, say, b(t) > > 0 when |t t | < . Then
E[F (t + ) F (t )] > 2 > 0, so F is not a martingale2 .

1.10. Deriving a backward equation: Itos lemma gives a quick derivations


of backward equations. For example, take
 
f (W (t), t) = E V (W (T )) Ft .

The tower property tells us that F (t) = F (W (t), t) is a martingale. But Itos
lemma, together with the previous paragraph, implies that F (W (t), t) is a mar-
tingale of and only if t F + 21 = 0, which is the backward equation for this case.
In fact, the proof of Itos lemma (below) is much like the proof of this backward
equation.

1.11. A backward equation with drift: The derivation of the backward


equation for "Z #
T
f (w, t) = Ew,t V (W (s), s)ds
t

uses the above, plus (12). Again using


"Z #
T
F (t) = E V (W (s), s)ds Ft ,
t

with F (t) = f (W (t), t), we calculate


"Z #
  t+t
E F (t + t) F (t) Ft = E V (W (s), s)ds Ft
t

= V (W (t), t)t + o(t) .


2 This is a somewhat incorrect version of the proof because , , and t probably are random.

There is a real proof something like this.

5
This says that dF (t) = a(t)dW (t) + b(t)dt where
b(t) = V (W (t), t) .
1 2
But also, b(t) = t f + 2 w f . Equating these gives the backward equation from
Lecture 6:
2
t f + 12 w f + V (w, t) = 0 .

1.12. Proof of Itos lemma: We want to show that


Z T Z T
f (W (T ), T ) f (W (0), 0) = fw (W (t), t)dW (t) + ft (W (t), t)dt
0 0
Z T
+ 21 fww (W (t), t)dt . (13)
0

Define t = T /n, tk = kt, Wk = W (tk ), Wk = W (tk+1 ) W (tk ), and


fk = f (Wk , tk ), and write
n1
X 
fn f0 = fk+1 fk . (14)
k=0

Taylor series expansion of the terms on the right of (14) will produce terms
that converge to the three integrals on the right of (13) plus error terms that
converge to zero. In our pre-Ito derivations of backward equations, we used the
relation E[(W )2 ] = t. Here we argue that with many independent Wk , we
may replace (Wk )2 with t (its mean value).
The Taylor series expansion is
2 2
fk+1 fk = w fk Wk + 12 w fk (W ) + t fk t + Rk , (15)
where w fk means w f (W (tk ), tk ), etc. The remainder has the bound3
|Rk | C t2 + t |Wk | + Wk3 .


Finally, we separate the mean value of Wk2 from the deviation from the mean:
1 2 1 2 1 2
fk Wk2 = w fk t + w fk (Wk2 t) .
2 w 2 2
The individual summands on the right side all have order of magnitude t.
However, the mean zero terms (the second sum) add up to much less than the
first sum, as we will see. With this, (14) takes the form
n1
X n1
X n1
X
1 2
fn f0 = w fk Wk + t fk t + 2 w fk t
k=0 k=0 k=0
n1
X n1
X
2
+ 12 fk W 2 t +

w Rk . (16)
k=0 k=0
3 We assume that f (w, t) is thrice differentiable with bounded third derivatives. The error

in a finite Taylor approximation is bounded by the sized of the largest terms not used. Here,
that is t2 (for omitted term t2 f ), t(W )2 (for t w ), and W 3 (for w
3 ).

6
The first three sums on the right converge respectively to the corresponding
integrals on the right side of (13). A technical digression will show that the last
two converge to zero as n in a suitable way.

1.13. Like Borel Cantelli: As much as the formulas, the proofs in stochastic
calculus rely on calculating expected values of things. Here, Sm is a sequence
of random numbers and we want to show that Sm 0 as m (almost
surely).
P We use two observations. First, if sm is a sequence of numbers with
m=1 |sm | < , then sm 0 as m . Second, if B > 0 is a random
variable with E[B] < , then B < almost surely (if P the event {B = } has

positive
P probability, then E[B] = ). We take B = m=1 |Sm |. If B <
then m=1 |Sm | < so Sm 0 as m . What this shows is:

X  
E [|Sm |] < = Sm 0 as m (a.s.) (17)
m=1

This observarion is a variant of the Borel Cantelli lemma, which often is used
in such arguments.

1.14. One of the error terms: To apply the Borel Cantelli lemma we must
find bounds for the error terms, bounds whose sum is finite.
Pn1 We start with the
last error term in (16). Choose n = 2m and define Sm = k=0 Rk , with
3
|Rk | C t2 + t |Wk | + |Wk | .
3
Since E[|Wk |] C t and E[|Wk | ] Ct3/2 (you do the math the
integrals), this gives (with nt = T )
 
E [|Sm |] Cn t2 + t3/2

CT t .

Expressed in terms of m, we have t = T /2m and t = T 2m/2 =
m m
T 2 . Therefore PE [|Sm |] C(T ) 2 . Now, if z is any number
m
greater
P than one, then m=1 z = 1/(1 + 1/z)) < . This implies that
m=1 E [|S m |] < and (using Borel Cantelli) that S m 0 as m (almost
surely).
This argument would not have worked thisway had we taken n = m instead
of n = 2m . The error bounds of order 1/ n would not have had a finite
sum. If both error terms in the bottom line of (16) go to zero as m
with n = 2m , this will prove Itos lemma. We will return to this point when
we discuss the difference between almost sure convergence, which we are using
here, and convergence in probability, which we are not.

1.15. The other sum: The other error sum in (16) is small not because of the
smallness of its terms, but because of cancellation. The positive and negative

7
terms roughly balance, leaving a sum smaller than the sizes of the terms would
suggest. This cancellation
Pn1 is of the same sort appearing in the central limit
theorem, where k=0 Xk = Un is of order n rather than n when the Xk are
i.i.d. with finite variance. In fact, using a trick we used before we show that Un2
is of order n rather than n2 :
  X
E Un2 = E [Xj Xk ] = nE Xk2 = cn .
 

jk

Our sum is X
1 2
Wk2 tk .

Un = 2 w f (Wk , tk )

The above argument applies, though the terms are not independent. Suppose
j 6= k and, say, k > j. The cross term involving Wj and Wk still vanishes
because  
E Wk t Ftk = 0 ,
and the rest is in Ftk . Also (as we have used before)
h i
2
E (Wk t) Ftk = 2t2 .

Therefore
n1
X 2
E Un2 = 1 2
f (Wk , tk ) t2 C(T )t .
 
4 w
k=0

As before, we take n = 2 and sum to find that U22m 0 as m , which of


m

course implies that U2m 0 as m (almost surely).

1.16. Convergence of Ito sums: Choose t and define tk = kt and Wk =


W (tk ). To approximate the Ito integral
Z T
Y (T ) = F (t)dW (t) ,
0

we have the the Ito sums


X
Ym (T ) = F (tk ) (Wk+1 Wk ) , (18)
tk <T

where t = 2m . In proving convergence of Riemann sums to the Riemann


integral, we assume that the integrand is continuous. Here, we will prove that
limm Ym (T ) exists under the hypothesis
 2
E (F (t + t) F (t)) Ct . (19)

This is natural in that it represents the smoothness of Brownian motion paths.


We will discuss what can be done for integrands more rough than (19).
The trick is to compare Ym with Ym+1 , which is to compare the t approxi-
mation to the t/2 approximation. For that purpose, define tk+1/2 = (k+ 21 )t,

8
Wk+1/2 = W (tk+1/2 ), etc. The tk term in the Ym sum corresponds to the time
interval (tk , tk+1 ). The Ym+1 sum divides this interval into two subintervals of
length t/2. Therefore, for each term in the Ym sum there are two correspond-
ing terms in the Ym+1 sum (assuming T is a multiple of t), and:
X
Ym+1 (T ) Ym (T ) = F (tk )(Wk+1/2 Wk ) + F (tk+1/2 )(Wk+1 Wk+1/2 )
tk <T

F (tk )(Wk+1 Wk )
X
= (Wk+1 Wk+1/2 )(F (tk+1/2 ) F (tk ))
tk <T
X
= Rk ,
tk <T

where
Rk = (Wk+1 Wk+1/2 )(F (tk+1/2 ) F (tk )) .
We compute E[(Ym+1 (T )Ym (T ))2 ] = jk E[Rj Rk ]. As before,4 E[Rj Rk ] =
P

0 unless j = k. Also, the independent increments property and (19) imply that5
h 2 i h 2 i
E[Rk2 ] = E Wk+1 Wk+1/2 E F (tk+1/1 ) F (tk )
t t
C = Ct2 .
2 2
This gives
2
E (Ym+1 (T ) Ym (T )) C2m .

(20)
The convergence of the Ito sums follows from (??) using our Borel Cantelli
m
type lemma. Let Sm = Ym+1 Ym . From (20), we have6 E |Sm |] C 2 .
Thus X
lim Ym (T ) = Y1 (T ) + Ym+1 (T ) Ym (T )
m
m1

exists and is finite. This shows that the limit defining the Ito integral exists, at
least in the case of an integrand that satisfies (19), which includes most of the
cases we use.

1.17. Ito isometry formula: This is the formula


!2 Z
Z T2 T2
E a(t)dW (t) = E[a(t)2 ]dt . (21)
T1 T1

4 If j > k then E[Wj+1 Wj+1/2 | Ftj+1/2 ] = 0, so E[Rj Rk | Ftj+1/2 ] = 0, and E[Rj Rk ] =


0
5 Mathematicians often use the same letter C to represent different constants in the same

formula. For example, Ct + C 2 t Ct really means: Let C = C1 + C22 , if u C1 t



and v C2 t, then u + v 2 Ct. Instead, we dont bother to distinguish between the
various constants.
6 The Cauchy Schwartz inequality gives E[|S |] = E[|S | 1] (E[S 2 ]E[12 ])1/2 =
m m m
E[Sm2 ]1/2 .

9
The derivation uses what we just have done. We approximate the Ito integral
by the sum X
a(tk )Wk .
T1 tk <T2

Because the different Wk are independent, and because of the independent


increments property, the expected square of this is
X
a(tk )2 t .
T1 tk <T2

The formula (21) follows from this.


RT
An application of this is to understand the roughness of Y (T ) = 0 a(t)dW (t).
If E[a(t)2 ] < C for all t T , then E[(Y (T2 ) Y (T1 ))2 ] Ct. This is the
same roughness as Brownian motion itself.

1.18. White noise: White noise is a generalized function,7 (t), which is


thought of as homogeneous and Gaussian with (t1 ) independent of (t2 ) for
Rt
t1 6= t2 . More precisely, if t0 < t1 < < tn and Yk = tkk+1 (t)dt, then the Yk
are independent and normal with zero mean and var(Yk ) = tk+1 tk . You can
convince yourself that (t) is not a true function by showing that it would have
Rb
to have a (t)2 dt = for any a < b. Brownian motion can be thought of as
Rt
the motion of a particle pushed by white noise, i.e. W (t) = 0 (s)ds. The Yk
defined above then are the increments of Brownian and have the appropriate
statistical properties (independent, mean zero, normal, variance tk+1 tk ).
These properties may be summarized by saying that (t) has mean zero and

cov((t), (t0 ) = E[(t)(t0 ) = (t t0 ) . (22)


R
For example,
R if f (t) and g(t) are deterministic functions and Yf = f (t)d(t)
and Yg = g(t)d(t), then, (22) implies that
Z Z
E[Yf Yg ] = f (t)g(t0 )E[(t)(t0 )]dtdt0
t t0
Z Z
= f (t)g(t0 )(t t0 )dtdt0
t t0
Z
= f (t)g(t)dt
t

It is tempting to take dW (t) = (t)dt in the Ito integral and use (22) to derive
the Ito isometry formula. However, this must by done carefully because the
existence of the Ito integral, and the isometry formula, depend on the causality
structure that makes dW (t) independent of a(t).
7 A generalized function is not an actual function, but has properties defined as though it

were an Ractual function through integration. The function for example, is defined by the
formula f (t)(t)dt = f (0). No actual function can do this. Generalized functions also are
called distributions.

10
2 Stochastic Differential Equations

2.1. Introduction: The theory of stochastic differential equations (SDE) is a


framework for expressing dynamical models that include both random and non
random forces. The theory is based on the Ito integral. Like the Ito integral,
approximations based on finite differences that do not respect the martingale
structure of the equation can converge to different answers. Solutions to Ito
SDEs are Markov processes in that the future depends on the past only through
the present. For this reason, the solutions can be studied using backward and
forward equations, which turn out to be linear parabolic partial differential equa-
tions of diffusion type.

2.2. A Stochastic Differential Equation: An Ito stochastic differential equa-


tion takes the form

dX(t) = a(X(t), t)dt + (X(t), t)dW (t) . (23)

A solution is an adapted process that satisfies (23) in the sense that


Z T Z T
X(T ) X(0) = a(X(t), t)dt + (X(t), t)dW (t) , (24)
0 0

where the first integral on the right is a Riemann integral and the second is an
Ito integral. We often specify initial conditions X(0) u0 (x), where u0 (x) is
the given probability density for X(0). Specifying X(0) = x0 is the same as
saying u0 (x) = (x x0 ). As in the general Ito differential, a(X(t), t)dt is the
drift term, and (X(t), t)dW (t) is the martingale term. We often call (x, t)
the volatility. However, this is a different use of the letter from Black Scholes,
where the martingale term is x for a constant (also called volatility).

2.3. Geometric Brownian motion: The SDE

dX(t) = X(t)dt + X(t)dW (t) , (25)

with initial data X(0) = 1, defines geometric Brownian motion. In the general
formulation above, (25) has drift coefficient a(x, t) = x, and volatility (x, t) =
x (with the conflict of terminology noted above). If W (t) were a differentiable
function of t, the solution would be

(wrong) X(t) = et+W (t) . (26)

To check this, define the function x(w, t) = et|w with w x = xw = x, xt = x


and xww = 2 x. so that the Ito differential of the trial function (26) is

2
dX(W (t), t) = Xdt + XdW (t) + Xdt .
2

11
2
We can remove the unwanted final term by multiplying by e t/2
, which sug-
gests that the formula
2
X(t) = et t/2+W (t) (27)
satisfies (25). A quick Ito differentiation verifies that it does.

2.4. Properties of geometric Brownian motion: Let us focus on the simple


case = 0 = 1, so that

dX(t) = X(t)dW (t) . (28)

The solution, with initial data X(0) = 1, is the simple geometric Brownian
motion
X(t) = exp(W (t) t/2) . (29)
We discuss (29) in relation to the martingale property (X(t) is a martingale
because the drift term in (23) is zero in (28)). A simple calculation based on

X(t + t0 ) = exp W (t) t/2 + (W (t0 ) W (t) (t0 t)/2)




and integrals of Gaussians shows that E[X(t + t0 | Ft ] = X(t).


However, W (t) has the order of magnitude t. For large t, this means that
the exponent in (29) is roughly equal to t/2, which suggests that

X(t) = exp(W (t) t/2) et/2 0 as t (a.s.)

Therefore, the expected value E[X(t)] = 1, for large t, is not produced by


typical paths, but by very exceptional ones. To be quantitative, there is an
exponentially small probability that X(t) is as large as its expected value:

P (X(t) 1) < et/4 for large t.

2.5. Dominated convergence theorem: The dominated convergence theorem


is about expected values of limits of random variables. Suppose X(t, ) is a
family of random variables and that limt X(t, ) = Y () for almost every
. The random variable U () dominates the X(t, ) if |X(t, )| U () for
almost every and for every t > 0. We often write this simply as X(t) Y
as t a.s., and |X(t)| U a.s. The theorem states that if E[U ] < then
E[X(t)] E[Y ] as t . It is fairly easy to prove the theorem from the
definition of abstract integration. The simplicity of the theorem is one of the
ways abstract integration is simpler than Riemann integration.
The reason for mentioning this theorem here is that geometric Brownian
motion (29) is an example showing what can go wrong without a dominating
function. Although X(t) 0 as t a.s., the expected value of X(t) does
not go to zero, as it would do if the conditions of the dominated convergence
theorem were met. The reader is invited to study the maximal function, which
is the random variable M = maxt>0 (W (t) t/2), in enough detail to show that
E[eM ] = .

12
2.6. Strong and weak solutions: A strong solution is an adapted function
X(W, t), where the Brownian motion path W again plays the role of the abstract
random variable, . As in the discrete case, X(t) (i.e. X(W, t)) being measurable
in Ft means that X(t) is a function of the values of W (s) for 0 s t. The two
examples we have, geometric Brownian motion (27), and the Ornstein Uhlenbeck
process8 Z t
X(t) = e(ts) dW (s) , (30)
0

both have this property. Note that (27) depends only on W (t), while (30)
depends on the whole pate up to time t.
A weak solution is a stochastic process, X(t), defined perhaps on a different
probability space and filtration (, Ft ) that has the statistical properties called
for by (23). These are (using X = X(t + t) X(t)) roughly9

E[X | Ft ] = a(X(t), t)t + o(t) , (31)

and
E[X 2 | Ft ] = 2 (X(t), t)t + o(t) . (32)
We will see that a strong solution satisfies (31) and (32), so a strong solution
is a weak solution. It makes no sense to ask whether a weak solution is a
strong solution since we have no information on how, or even whether, the weak
solution depends on W .
The formulas (31) and (32) are helpful in deriving SDE descriptions of phys-
ical or financial systems. We calculate the left sides to identify the a(x, t) and
(x, t) in (23). Brownian motion paths and Ito integration are merely a tool
for constructing the desired process X(t). We saw in the example of geomertic
Brownian motion that expressing the solution in terms of W (t) can be very
convenient for understanding its properties. For example, it is not particularly
easy to show that X(t) 0 as t from (31) and (32) with a = X and10
(x, t) = x.

2.7. Strong is weak: We just verify that the strong solution to (23) that
satisfies (24) also satisfies the weak form requirements (31) and (32). This is an
important motivation for using the Ito definition of dW rather than, say, the
Stratonovich definition.
A slightly more general fact is simpler to explain. Define R and I by
Z t+t Z t+t
R= a(t)dt , I= (t)dW (t) ,
t t
8 This process satisfies the SDE dX = Xdt + dW , with X(0) = 0.
9 The little o notation f (t) = g(t) + o(t) informally means that the difference between f
and g is a mathematicians order of magnitude smaller than t for small t. Formally, it means
that (f (t) g(t))/t 0 as t 0.
10 This conflict of notation is common in discussing geometric Brownian motion. On the left

is the coefficient of dW (t). On the right is the financial volatility coefficient.

13
where a(t) and (t) are continuous adapted stochastic processes. We want to
see that  
E R + I Ft = a(t) + o(t) , (33)
and
E (R + I)2 Ft = 2 (t) + o(t) .
 
(34)
We may leave I out of (33) because E[I] = 0 always. We may leave R out of (34)
because |I| >> |R|. (If a is bounded then R = O(t) so E[R2 | Ft ] = O(t2 ).
The Ito isometry formula suggests that E[I 2 | Ft ] = O(t). Cauchy Schwartz
then gives E[RI | Ft ] = O(t3/2 ). Altogether, E[(R + I)2 | Ft ] = E[I 2 |
Ft ] + O(t3/2 ).)
To verify (33) without I, we assume that a(t) is a continuous function of t
in the sense that for s > t,
 
E a(s) a(t) Ft 0 as s t .

This implies that


Z t+t
1
E [a(s) a(t) | Ft ] 0 as t 0,
t t

so that
 
Z t+t
E R Ft = E [a(s) | Ft ]
t
Z t+t Z t+t
= E [a(t) | Ft ] + E [a(s) a(t) | Ft ]
t t
= ta(t) + o(t) .

This verifies (33). The Ito isometry formula gives


Z t+t
E I 2 Ft = (s)2 ds ,
 
t

so (34) follows in the same way.

2.8. Markov diffusions: Roughly speaking, 11 a diffusion process is a contin-


uous stochastic process that satisfies (31) and (32). If the process is Markov,
the a of (31) and the 2 of (32) must be functions of X(t) and t. If a(x, t) and
(x, t) are Lipschitz (|a(x, t) a(y, t)| C|x y|, etc.) functions of x and t,
then it is possible to find it is possible to express X(t) as a strong solution of
an Ito SDE (23).
This is the way equations (23) are often derived in practice. We start off
wanting to model a process with an SDE. It could be a random walk on a lattice
with the lattice size converging to zero or some other process that we hope will
11 More detailed treatment are the books by Steele, Chung and Williams, Karatsas and

Shreve, and Oksendal.

14
have a limit as a diffusion. The main step in proving the limit exists is tightness,
which we hint at a lecture to follow. We identify a and by calculations. Then
we use the representation theorem to say that the process may be represented
as the strong solution to (23).

2.9. Backward equation: The simplest backward equation is the PDE sat-
isfied by f (x, t) = Ex,t [V (X(T ))]. We derive it using the weak form conditions
(31) and(32) and the tower property. As with Brownian motion, the tower
property gives
f (x, t) = Ex,t [V (X(T ))] = Ex,t [F (t + t)] ,
where F (s) = E[V (X(T )) | Fs ]. The Markov property implies that F (s) is a
function of X(s) alone, so F (s) = f (X(s), s). This gives
f (x, t) = Ex,t [f (X(t + t), t + t)] . (35)
If we assume that f is a smooth function of x and t, we may expand in Taylor
series, keeping only terms that contribute O(t) or more.12 We use X =
X(t + t) x and write f for f (x, t), ft for ft (x, t), etc.
f (X(t + t), t + t) = f + ft t + fx X + 21 fxx X 2 + smaller terms.
Therefore (31) and (32) give:
f (x, t) = Ex,t [f (X(t + t), t + t)]
= f (x, t)ft t + fx Ex,t [X] + 12 fxx Ex,t [X 2 ] + o(t)
= f (x, t) + ft t + fx a(x, t)t + 12 fxx 2 (x, t)t + o(t) .
We now just cancel the f (x, t) from both sides, let t 0 and drop the o(t)
terms to get the backward equation
2 (x, t) 2
t f (x, t) + a(x, t)x f (x, t) + x f (x, t) = 0 . (36)
2

2.10. Forward equation: The forward equation follows from the backward
equation by duality. Let u(x, t) be the probability density for X(t). Since
f (x, t) = Ex,t [V (X(T ))], we may write
Z
E[V (X(T ))] = u(x, t)f (x, t)dx ,

which is independent of t. Differentiating with respect to t and using the back-


ward equation (36) for ft , we get
Z Z
0 = u(x, t)ft (x, t)dx + ut (x, t)f (x, t)dx
Z Z Z
= u(x, t)a(x, t)x f (x, t) 21 u(x, t) 2 (x, t)x2 f (x, t) + ut (x, t)f (x, t) .

12 The homework has more on the terms left out.

15
We integrate by parts to put the x derivatives on u. We may ignore boundary
terms if u decaus fast enough as |x| and if f does not grow too fast. The
result is
Z
x (a(x, t)u(x, t)) 12 x2 2 (x, t)u(x, t) + t u(x, t) f (x, t)dx = 0 .
 

Since this should be true for every function f (x, t), the integrand must vanish
identically, which implies that

t u(x, t) = x (a(x, t)u(x, t)) + 12 x2 2 (x, t)u(x, t) .



(37)

This is the forward equation for the Markov process that satisfies (31) and (32).

2.11. Transition probabilities: The transition probability density is the


probability density for X(s) given that X(t) = y and s > t. We write it as
G(y, s, t, s), the probabiltiy density to go from y to x as time goes from t to s.
If the drift and diffusion coefficients do not depend on t, then G is a function
of t s. Because G is a probability density in the x and s variables, it satisfies
the forward equation

s G(y, x, t, s) = x (a(x, s)G(y, x, t, s)) + 12 x2 2 (x, s)G(y, x, t, s) . (38)




In this equation, t and y are merely parameters, but s may not be smaller than
t. The initial condition that represents the requirement that X(t) = y is

G(y, x, t, t) = (x y) . (39)

The transition density is the Greens function for the forward equation, which
means that the general solution may be written in terms of G as
Z
u(x, s) = u(y, t)G(y, x, t, s)dy . (40)

This formula is a continuous time version of the law of total probability: the
probability density to be at x at time s is the sum (integral) of the probability
density to be at x at time s conditional on being at y at time t (which is
G(y, x, t, s)) multiplied by the probability density to be at y at time s (which is
u(y, t)).

2.12. Greens function for the backward equation: We can also express the
solution of the backward equation in terms the transition probabilities G. For
s > t,
f (y, t) = Ey,t [f (X(s), s)] ,
which is an expression of the tower property. The expected value on the right
may be evaluated using the transition probability density for X(s). The result
is Z
f (y, t) = G(y, x, t, s)f (x, s)dx . (41)

16
For this to hold, G must satisfy the backward equation as a function of y and t
(which were parameters in (38). To show this, we apply the backward equation
operator (see below for terminology) t + a(y, t)y + 12 2 (y, t)y2 to both sides.
The left side gives zero because f satisfies the backward equation. Therefore we
find that
Z
t + a(y, t)y + 21 2 (y, t)y2 G(y, x, t, s)f (x, s)dx

0=

for any f (x, s). Therefore, we conclude that

t G(y, x, t, s) + a(y, t)y G(y, x, t, s) + 21 2 (y, t)y2 G(y, x, t, s) = 0 . (42)

Here x and s are parameters. The final condition for (42) is the same as (39).
The equality s = t represents the initial time for s and the final time for t
because G is defined for all t s.

2.13. The generator: The generator of an Ito process is the operator con-
taining the spatial part of the backward equation13

L(t) = a(x, t)x + 21 2 (x, t)x2 .

The backward equation is t f (x, t) + L(t)f (x, t) = 0. We write just L when a


and do not depend on t. For a general continuous time Markov process, the
generator is defined by the requirement that
d
E[g(X(t), t)] = E [(L(t)g)(X(t), t) + gt (X(t), t)] , (43)
dt
for a sufficiently rich (dense) family of functions g. This applies not only to Ito
processes (diffusions), but also to jump diffusions, continuous time birth/death
processes, continuous time Markov chains, etc. Part of the requirement is that
the limit defining the derivative on the left side should exist. Proving (43) for an
Ito process is more or less what we did when we derived the backward equation.
On the other hand, if we know (43) we can derive the backward equation by
d
requiring that dt E[f (X(t), t)] = 0.

2.14. Adjoint: The adjoint of L is another operator that we call L . It is


defined in terms of the inner product
Z
hu, f i = u(x)f (x)dx .

We leave out the t variable for the moment. If u and f are complex, we take
the complex conjugate of u above. The adjoint is defined by the requirement
that for general u and f ,
hu, Lf i = hL u, f i .
13 Some people include the time derivative in the definition of the generator. Watch for this.

17
In practice, this boils down to the same integration by parts we used to derive
the forward equation from the backward equation:
Z
u(x) a(x)x f (x) + 12 2 (x)x2 f (x) dx

hu, Lf i =

Z
x (a(x)u(x)) + 21 x2 ( 2 (x)u(x)) f (x)dx .

=

Putting the t dependence back in, we find the action of L on u to be

(L(t) u)(x, t) = x (a(x, t)u(x, t)) + 21 x2 ( 2 (x, t)u(x, t)) .

The forward equation (37) then may be written

t u = L(t) u .

All we have done here is define notation (L ) and show how our previous deriva-
tion of the forward equation is expressed in terms of it.

2.15. Adjoints and the Greens function: Let us summarize and record what
we have said about the transition probability density G(y, x, t, s). It is defined
for s t and has G(x, y, t, t) = (x y). It moves probabilities forward by
integrating over y (38) and moves expected values backward by integrating over
x ??). As a function of x and s it satisfies the forward equation

s G(y, x, t, s) = (Lx (t)G)(y, x, t, s) .

We write Lx to indicate that the derivatives in L are with respect to the x


variable:

(Lx (t)G)(y, x, t, s) = x (a(x, t)G(y, x, t, s)) + 21 x2 ( 2 (x, t)G(y, x, t, s)) .

As a function of y and t it satisfies the backward equation

t G(y, x, t, s) + (Ly (t)G)(y, x, t, s) = 0 .

3 Properties of the solutions

3.1. Introduction: The next few paragraphs describe some properties of


solutions of backward and forward equations. For Brownian motion, f and u
have every property because the forward and backward equations are essentially
the same. Here f has some and u has others.

3.2. Backward equation maximum principle: The backward equation has a


maximum principle

max f (x, t) max f (y, s) for s > t. (44)


x y

18
This follows immediately from the representation

f (x, t) = Ex,t [f (X(s), s)] .

The expected value of f (X(s), s) cannot be larger than its maximum value.
Since this holds for every x, it holds in particular for the maximizer.
There is a more complicated proof of the maximum principle that uses the
backward equation. I give a slightly naive explination to avoid taking too long
with it. Let m(t) = maxx f (x, t). We are trying to show that m(t) never
increases. If, on the contrary, m(t) does increase as t decreases, there must be
a t with dm dt (t ) = < 0. Choose x so that f (x , t ) = maxx f (x, t ). Then
fx (x , t ) = 0 and fxx (x , t ) 0. The backward equation then implies that
ft (x , t ) 0 (because 2 0), which contradicts ft (x , t ) < 0.
The PDE proof of the maximum principle shows that the coefficients a and
2 have to be outside the derivatives in the backward equation. Our argument
that Lf 0 at a maximum where fx = 0 and fxx 0 would be wrong if we had,
say, x (a(x)f (x, t)) rather than a(x)x f (x, t). We could get a non zero value
because of variation in a(x) even when f was constant. The forward equation
does not have a maximum principle for this reason. Both the Ornstein Uhlen-
beck and geometric Brownian motion problems have cases where maxx u(x, t)
increases in forward time or backward time.
R
3.3. Conservation of probability: The probability density has
u(x, t)dx =
1. We can see that
d
Z
u(x, t)dx = 0
dt
also from the forward equation (37). We simply differentiate under the integral,
substitute from the equation, and integrate the resulting x derivatives. For this
it is crucial that the coefficients a and 2 be inside the derivatives. Almost any
example with a(x, t) or (x, t) not independent of x will show that

d
Z
f (x, t)dx 6= 0 .
dt

3.4. Martingale property: If there is no drift, a(x, t) = 0, then X(t) is a


martingale. In particular, E[X(t)] is independent of t. This too follows from
the forward equation (37). There will be no boundary contributions in the
integrations by parts.

d
Z
d
E[X(t)] = xu(x, t)dx
dt dt
Z
= xut (x, t)

Z
= x 12 x2 ( 2 (x, t)u(x, t))dx

19
Z
1 2
= 2 x ( (x, t)u(x, t))dx

= 0.

This would not be true for the backward equation form 12 2 (x, t)x2 f (x, t) or even
for the mixed form we get from the Stratonovich calculus 12 x ( 2 (x, t)x f (x, t)).
The mixed Stratonovich form conserves probability but not expected value.

3.5. Drift and advection: If there is no drift then the SDE (23) becomes the
ordinary differential equation (ODE)

dx
= a(x, t) . (45)
dt
If x(t) is a solution, then clearly the expected payout should satisfy f (x(t), t) =
f (x(s), s), if nothing is random then the expected value is the value. It is easy
to check using the backward equation that f (x(t), t) is independent of t if x(t)
satisfies (45) and = 0:

d dx
f (x(t), t) = ft (x(t), t) + fx (x(t), t) = ft (x(t), t) + a(x(t), t)fx (x(t), t) = 0 .
dt dt
Advection is the process of being carried by the wind. If there is no diffusion,
then the values of f are simply advected by the drift. The term drift implies
that this advection process is slow and gentle. If is small but not zero, then
f may be essentially advected with a little spreading or smearing induced by
diffusion. Computing drift dominated solutions can be more challenging than
computing diffusion dominated ones.
The probability density does not have u(x(t), t) a constant (try it in the
forward equation). There is a conservation of probability correction to this that
you can find if you are interested.

20
Stochastic Calculus Notes, Lecture 8
Last modified December 14, 2004

1 Path space measures and change of measure

1.1. Introduction: We turn to a closer study of the probability measures


on path space that represent solutions of stochastic differential equations. We
do not have exact formulas for the probability densities, but there are approxi-
mate formulas that generalize the ones we used to derive the Feynman integral
(not the Feynman Kac formula). In particular, these allow us to compare the
measures for different SDEs so that we may use solutions of one to represent ex-
pected values of another. This is the Cameron Martin Girsanov formula. These
changes of measure have many applications, including importance sampling in
Monte Carlo and change of measure in finance.

1.2. Importance sampling: Importance sampling is a technique that can


make Monte Carlo computations more accurate. In the simplest version, we
have a random variable, X, with probability density u(x). We want to estimate
A = Eu [(X)]. Here and below, we write EP [] to represent expecation using
the P measure. To estimate A, we generate N (a large number) independent
samples from the population u. That is, we generate random variables Xk for
k = 1, . . . , N that are independent and have probability density u. Then we
estimate A using
N
1 X
A Au =
b (Xk ) . (1)
N
k=1

The estimate is unbiased because the bias, A Eu [Abu ], is zero. The error is
1
determined by the variance var(A
bu ) = varu ((X)).
N
Let v(x) be another probability density so that v(x) 6= 0 for all x with
u(x) 6= 0. Then clearly
Z Z
u(x)
A = (x)u(x)dx = (x) v(x)dx .
v(x)
We express this as
u(x)
A = Eu [(X)] = Ev [(X)L(X)] , where L(x) = . (2)
v(x)
The ratio L(x) is called the score function in Monte Carlo, the likelihood ratio
in statistics, and the Radon Nikodym derivative by mathematicians. We get a
different unbiased estimate of A by generating N independent samples of v and
taking
N
bv = 1
X
AA (Xk )L(Xk ) . (3)
N
k=1

1
The accuracy of (3) is determined by
Z
varv ((X)L(X)) = Ev [((X)L(X) A)2 ] = ((x)L(x) A)2 v(x)dx .

The goal is to improve the Monte Carlo accuracy by getting var(A


bv ) <<
var(Au ).
b

1.3. A rare event example: Importance sampling is particularly helpful


in estimating probabilities of rare events. As a simple example, consider the
problem of estimating P (X > a) (corresponding to (x) = 1x>a ) when X
N (0, 1) is a standard normal random variable and a is large. The naive Monte
Carlo method would be to generate N sample standard normals, Xk , and take

Xk N (0, 1), k = 1, , N ,

bu = 1 # {Xk > a} = 1 (4)


X
A = P (X > a) A 1.
N N
Xk >a

For large a, the hits, Xk > a, would be a small fraction of the samples, with the
rest being wasted.
One importance sampling strategy uses v corresponding to N (a, 1). It
seems natural to try to increase the number of hits by moving the mean from
0 to a. Since most hits are close to a, it would be a mistake to move the
2
mean farther than a. Using the probability densities u(x) = 12 ex /2 and
2 2
v(x) = 12 e(xa) /2 , we find L(x) = u(x)/v(x) = ea /2 eax . The importance
sampling estimate is

Xk N (a, 1), k = 1, , N ,

1 a2 /2 X aXk
A Av = N e e .
b
Xk >a

Some calculations show that the variance of A bv is smaller than the variance
bu by a factor of roughly ea2 /2 . A simple way to
of of the naive estimator A
generate N (a, 1) random variables is to start with mean zero standard normals
2 2
Yk N (0, 1) and add a: Xk = Yk + a. In this form, ea /2 eaXk = ea /2 eaYk ,
and Xk > a, is the same as Yk > 0, so the variance reduced estimator becomes

Yk N (0, 1), k = 1, , N ,

a2 /2 1 (5)
X
A Av = e
b eaYk .
N
Yk >0

The naive Monte Carlo method (4) produces a small A b by getting a small
number of hits in many samples. The importance sampling method (5) gets
2
roughly 50% hits but discounts each hit by a factor of at least ea /2 to get the
same expected value as the naive estimator.

2
1.4. Radon Nikodym derivative: Suppose is a measure space with
algebra F and probability measures P and Q. We say that L() is the
Radon Nikodym derivative of P with respect to Q if dP () = L()dQ(), or,
more formally, Z Z
V ()dP () = V ()L()dQ() ,

which is to say
EP [V ] = EQ [V L] , (6)
dP
for any V , say, with EP [|V |] < . People often write L = dQ , and call it
the Radon Nikodym derivative of P with respect to Q. If we know L, then the
right side of (6) offers a different and possibly better way to estimate EP [V ].
Our goal will be a formula for L when P and Q are measures corresponding to
different SDEs.

1.5. Absolute continuity: One obstacle to finding L is that it may not exist.
If A is an event with P (A) > 0 but Q(A) = 0, L cannot exist because the
formula (6) would become
Z Z Z
P (A) = dP () = 1A ()dP () = 1A ()L()dQ() .
A

Looking back at our definition of the abstract integral, we see Rthat if the event
A = {f ()
R 6= 0} has Q(A) = 0, then all the approximations to f ()dQ() are
zero, so f ()dQ() = 0.
We say that measure P is absolutely continuous with respect to Q if P (A) =
0 Q(A) = 0 for every1 A F. We just showed that L cannot exist unless
P is absolutely continuous with respect to Q. On the other hand, the Radon
Nikodym theorem states that an L satisfying (6) does exist if P is absolutely
continuous with respect to Q.
In practical examples, if P is not absolutely continuous with respect to Q,
then P and Q are completely singular with respect to each other. This means
that there is an event, A F with P (A) = 1 and Q(A) = 0.

1.6. Discrete probability: In discrete probability, with a finite or countable


state space, P is absolutely continuous with respect to Q if and only if P () > 0
whenever Q(x) > 0. In that case, L() = P ()/Q(). If P and Q represent
Markov chains on a discrete state space, then P is not absolutely continuous
with respect to Q if the transition matrix for P (also called P ) allows transitions
that are not allowed in Q.

1.7. Finite dimensional spaces: If = Rn and the probability measures are


given by densities, then P may fail to be absolutely continuous with respect to
1 This assumes that measures P and Q are defined on the same algebra. It is useful

for this reason always to use the algebra of Borel sets. It is common to imagine completing
a measure by adding to F all subsets of events with P (A) = 0. It may seem better to have
more measurable events, it makes the change of measure discussions more complicated.

3
Q if the densities are different from zero in different places. An example with
n = 1 is P corresponding to a negative exponential random variable u(x) = ex
for x 0 and u(x) = 0 for x > 0, while Q corresponds to a positive exponential
v(x) = ex for x 0 and v(x) = 0 for x < 0.
Another way to get singular probability measures is to have measures using
functions concentrated on lower dimensional sets. An example with = R2 has
Q saying that X1 and X2 are independent standard normals while P says that
2
X1 = X2 . The probability density for P is u(x1 , x2 ) = 12 ex1 /2 (x2 x1 ).
The event A = {X1 = X2 } has Q probability zero but P probability one.

1.8. Testing for singularity: It sometimes helps to think of complete singu-


larity of measures in the following way. Suppose we learn the outcome, and
we try to determine which probability measure produced it. If there is a set
A with P (A) = 1 and Q(A) = 0, then we report P if A and Q if / A.
We will be correct 100% of the time. Conversely, if there is a way to determine
whether P of Q was used to generate , then let A be the set of outcomes that
you say came from P and you have P (A) = 1 because you always are correct
in saying P if came from P . Also Q(A) = 0 because you never say Q when
A.
Common tests involve statistics, i.e. functions of . If there is a (measurable)
statistic F () with F () = a almost surely with respect to P and F () = b 6= a
almost surely with respect to Q, then we take A = { | F () = a} and see
that P and Q are completely singular with respect to each other.

1.9. Coin tossing: In common situations where this works, the function F ()
is a limit that exists almost surely (but with different values) for both P and Q.
If limn Fn () = a almost surely with respect to P and limn Fn () = b
almost surely with respect to Q, then P and Q are completely singular.
Suppose we make an infinite sequence of coin tosses with the tosses being
independent and having the same probability of heads. We describe this by
taking to be infinite sequences = (Y1 , Y2 , . . .), where the k th toss Yk equals
one or zero, and the Yk are independent. Let the measure P represent tossing
with Yk = 1 with probability p, and Pn Q represent tossing with Yk = 1 with
probability q 6= p. Let Fn () = n1 k=1 Yk . The (Kolmogorov strong) law of
large numbers states that Fn p as n almost surely in P and Fn
q as n almost surely in Q. This shows that P and Q are completely
singular with respect to each other. Note that this is not an example of discrete
probability in our sense because the state space consists of infinite sequences.
The set of infinite sequences is not countable (a theorem of Cantor).

1.10. The Cameron Martin formula: The Cameron Martin formula relates
the measure, P , for Brownian motion with drift to the Wiener measure, W , for
standard Brownian motion without drift. Wiener measure describes the process

dX(t) = dB(t) . (7)

4
The P measure describes solutions of the SDE

dX(t) = a(X(t), t)dt + dB(t) . (8)

For definiteness, suppose X(0) = x0 is specified in both cases.

1.11. Approximate joint probability measures: We find the formula for


L(X) = dP (X)/dW (X) by taking a finite t approximation, directly comput-
ing Lt , and observing the limit of L as t 0. We use our standard notations
tk = kt, Xk X(tk ), Bk = B(tk+1 ) B(tk ), and X ~ = (X1 , . . . , Xn ) Rn .
The approximate solution of (8) is

Xk+1 = Xk + ta(Xk , tk ) + Bk . (9)

This is exact in the case a = 0. We write V (~x) for the joint density of X ~ for
W and U (~x) for teh joint density under (9). We calculate Lt (~x) = U (~x)/V (~x)
and observe the limit as t 0.
To carry this out, we again note that the joint density is the product of the
transition probability densities. For (7), if we know xk , then Xk+1 is normal
with mean xk and variance t. This gives
1 2
G(xk , xk+1 , t) = e(xk+1 xk ) /2t ,
2t
and !
n1
n/2 1 X
V (~x) = 2 t exp (xk+1 kk )2 . (10)
2t
k=0

For (9), the approximation to (8), Xk+1 is normal with mean xk + ta(xk , tk )
and variance t. This makes its transition density
1 2
G(xk , xk+1 , t) = e(xk+1 xk ta(xk ,tk )) /2t ,
2t
so that
n1
!
n/2 1 X
U (~x) = 2 t exp (xk+1 kk ta(xk , tk ))2 . (11)
2t
k=0

To calculate the ratio, we expand (using some obvious notation)


2
Xk tak = x2k 2txk + t2 a2k .

Dividing U by V removes the 2 factors and the x2k in the exponents. What
remains is

Lt (~x) = U (~x)/V (~x)


n1 n1
!
X t X
= exp (a(xk ), tk )(xk+1 xk ) a(xk ), tk )2 .
2
k=0 k=0

5
The first term in the exponent converges to the Ito integral
n1
X Z T
(a(xk ), tk )(xk+1 xk ) a(X(t), t)dX(t) as t 0,
k=0 0

if tn = max {tk < T }. The second term converges to the Riemann integral
n1
X Z T
2
t a(xk ), tk ) a2 (X(t), t)dt as t 0.
k=0 0

Altogether, this suggests that if we fix T and let t 0, then


!
Z T Z T
dP 1 2
= L(X) = exp a(X(t), t)dX(t) a (X(t), t)dt . (12)
dW 0 2 0

This is the Cameron Martin formula.

2 Multidimensional diffusions

2.1. Introduction: Some of the most interesting examples, curious phenom-


ena, and challenging problems come from diffusion processes with more than one
state variable. The n state variables are arranged into an n dimensional state
vector X(t) = (X1 (t), . . . , Xn (t))t . We will have a Markov process if the state
vector contains all the information about the past that is helpful in predicting
the future. At least in the beginning, the theory of multidimensional diffusions
is a vector and matrix version of the one dimensional theory.

2.2. Strong solutions: The drift now is a drift for each component of X,
a(x, t) = (a1 (x, t), . . . , an (x, t))t . Each component of a may depend on all com-
ponents of X. The now is an n m matrix, where m is the number of
independent sources of noise. We let B(t) be a column vector of m independent
standard Brownian motion paths, B(t) = (B1 (t), . . . , Bm (t))t . The stochastic
differential equation is

dX(t) = a(X(t), t)dt + (X(t), t)dB(t) . (13)

A strong solution is a function X(t, B) that is nonanticipating and satisfies


Z t Z t
X(t) = X(0) + a(X(s), s)ds + (X(s), s)dB(s) .
0 0

The middle term on the right is a vector of Riemann integrals whose k th com-
ponent is the standard Riemann integral
Z t
ak (X(s), s)ds .
0

6
The last term on the right is a collection of standard Ito integrals. The k th
component is
Xm Z t
kj (X(s), s)dBj (s) ,
j=1 0

with each summand on the right being a scalar Ito integral as defined in previous
lectures.

2.3. Weak form: The weak form of a multidimensional diffusion problem asks
for a probability measure, P , on the probability space = C([0, T ], Rn ) with
filtration Ft generated by {X(s) for s t} so that X(t) is a Markov process
with  
E X Ft = a(X(t), t)t + o(t) , (14)
and
E XX t Ft = (X(t), t)t + o(t) .
 
(15)
Here X = X(t+t)X(t), we assume t > 0, and X t = (X1 , . . . , Xn ) is
the transpose of the column vector X. The matrix formula (15) is a convenient
way to express the short time variances and covariances2
 
E Xj Xk Ft = jk (X(t), t)t + o(t) . (16)

As for one dimensional diffusions, it is easy to verify that a strong solution of


(13) satisfies (14) and (15) with = t .

2.4. Backward equation: As for one dimensional diffusions, the weak form
conditions (14) and (15) give a simple derivation of the backward equation for

f (x, t) = Ex,t [V (X(T ))] .

We start with the tower property in the familiar form

f (x, t) = Ex,t [f (x + X, t + t)] , (17)

and expand f (x+X, t+t) about (x, t) to second order in X and first order
in t:

f (x + X, t + t) = f + xk f Xk + 21 xj xk Xj Xk + t f t + R .

Here follow the Einstein summation convention by leaving out the sums over j
and k on the right. We also omit arguments of f and its derivatives when the
arguments are (x, t). For example, xk f Xk really means
n
X
xk f (x, t) Xk .
k=1

2 The reader should check


 that the true covariances

E (Xj E[Xj ])(Xk E[Xk ]) Ft also satisfy (16) when E Xj Ft = O(t).

7
As in one dimension, the error term R satisfies
3
|R| C |X| t + |X| + t2 ,


so that, as before,
E [|R|] C t3/2 .
Putting these back into (17) and using (14) and (15) gives (with the same
shorthand)

f = f + ak (x, t)xk f t + 12 jk (x, t)xj xk f t + t f t + o(t) .

Again we cancel the f from both sides, divide by t and take t 0 to get

t f + ak (x, t)xk f + 21 jk (x, t)xj xk f = 0 , (18)

which is the backward equation.


It sometimes is convenient to rewrite (18) in matrix vector form. For any
function, f , we may consider its gradient to be the row vector 5x f = Dx f =
(x1 f, . . . , xn f ). The middle term on the left of (18) is the product of the
row vector Df and the column vector x. We also have the Hessian matrix of
second
P partials (D2 f )jk = xj xk f . Any symmertic matrix has a trace tr(M ) =
k M kk . The summation convention makes this just tr(M ) = Mkk . If A and
B are symmetric matrices, then (as the reader should check) tr(AB) = Ajk Bjk
(with summation convention). With all this, the backward equation may be
written
t f + Dx f a(x, t) + 21 tr((x, t)Dx2 f ) = 0 . (19)

2.5. Generating correlated Gaussians: Suppose we observe the solution of


(13) and want to reconstruct the matrix . A simpler version of this problem
is to observe
Y = AZ , (20)
and reconstruct A. Here Z = (Z1 , . . . , Zm ) Rm , with Zk N (0, 1) i.i.d.,
is an m dimensional standard normal. If m < n or rank(A) < n then Y is a
degenerate Gaussian whose probability density (measure) is concentrated on
the subspace of Rn consisting of vectors of the form y = Az for some z Rm .
The problem is to find A knowing the distribution of Y .

2.6. SVD and PCA: The singular value decomposition (SVD) of A is a


factorization
A = U V t , (21)
where U is an n n orthogonal matrix (U t U = Inn , the n n identity matrix),
V is an m m orthogonal matrix (V t V = Imm ), and is an n m diagonal
matrix (jk = 0 if j 6= k) with nonnegative singular values on the diagonal:
kk = k 0. We assume the singular values are arranged in decreasing order
1 2 . The singular values also are called principal components and

8
the SVD is called principal component analysis (PCA). The columns of U and
V (not V t ) are left and right singular vectors respectively, which also are called
principal components or principal component vectors. The calculation

C = AAt = (U V t )(V t U t ) = U t U t

shows that the diagonal n n matrix = t contains the eigenvalues of


C = AAt , which are real and nonnegative because C is symmetric and positive
semidefinite. Therefore, left singular vectors, the columns of C, are the eigen-
vectors of the symmetric matrix C. The singular
values are the nonnegative
square roots of the eigenvalues of C: k = k . Thus, the singular values and
left singular vectors are determined by C. In a similar way, the right singular
vectors are the eigenvectors of the m m positive semidefinite matrix At A. If
n > m, then the k are defined only for k m (there is no m+1,m+1 in the
n m matrix ). Since the rank of C is at most m in this case, we have k = 0
for k > m. Even when n = m, A may be rank deficient. The rank of A being l
is the same as k = 0 for k > l. When m > n, the rank of A is at most n.

2.7. The SVD and nonuniqueness of A: Because Y = AZ is Gaussian


with mean zero, its distribution is determined by its covariance C = E[Y Y t ] =
E[AZZ t At ] = AE[ZZ t ]At = AAt . This means that the distribution of A
determines U and but not V . We can see this directly by plugging (21) into
(20) to get
Y = U (V t Z) = U Z 0 , where Z 0 = V t Z .
Since Z 0 is a mean zero Gaussian with covariance V t V = I, Z 0 has the same
distribution as Z, which means that Y 0 = U Z has the same distribution as Y .
Furthermore, if A has rank l < m, then we will have k = 0 for k > l and we
need not bother with the Zk0 for k > l. That is, for generating Y , we never need
to take m > n or m > rank(A).
For a simpler point of view, suppose we are given C and want to generate
Y N (0, C) in the form Y = AZ with Z N (0, I). The condition is that
C = AAt . This is a sort of square root of C. One solution is A = U as above.
Another solution is the Choleski decomposition of C: C = LLt for a lower
triangular matrix L. This is most often done in practice because the Choleski
decomposition is easier to compute than the SVD. Any A that works has the
same U and in its SVD.

2.8. Choosing (x, t): This non uniqueness of A carries over to non unique-
ness of (x, t) in the SDE (13). A diffusion process X(t) defines (x, t) through
(15), but any (x, t) with t = leads to the same distribution of X trajec-
tories. In particular, if we have one (x, t), we may choose any adapted matrix
valued function V (t) with V V t Imm , and use 0 = V . To say this another
way, if we solve dZ 0 = V (t)dZ(t) with Z 0 (0) = 0, then Z 0 (t) also is a Brownian
motion. (The Levi uniqueness theorem states that any continuous path process
that is weakly Brownian motion in the sense that a 0 and I in (14) and
(15) actually is Brownian motion in the sense that the measure on is Wiener

9
measure.) Therefore, using dZ 0 = V (t)dZ gives the same measure on the space
of paths X(t).
The conclusion is that it is possible for SDEs wtih different (x, t) to repre-
sent the same X distribution. This happens when t = 0 0 t . If we have , we
may represent the process X(t) as the strong solution of an SDE (13). For this,
we must choose with some arbtirariness a (x, t) with (x, t)(x, t)t = (x, t).
The number of noise sources, m, is the number of non zero eigenvalues of . We
never need to take m > n, but m < n may be called for if has rank less than
n.

2.9. Correlated Brownian motions: Sometimes we wish to use the SDE model
(13) where the Bk (t) are correlated. We can accomplish this with a change in .
Let us see how to do this in the simpler case of generating correlated standard
normals. In that case, we want Z = (Z1 , . . . , Zm )t Rm to be a multivariate
mean zero normal with var(Zk ) = 1 and given correlation coefficients
cov(Zj , Zk )
jk = p = cov(Zj , Zk ) .
var(Zj )var(Zk )
This is the same as generating Z with covariance matrix C with ones on the
diagonal and Cjk = jk when j 6= k. We know how to do this: choose A with
AAt = C and take Z = AZ 0 . This also works in the SDE. We solve

dX(t) = a(X(t), t)dt + (X(t), t)AdB(t) ,

with the Bk being independent standard Brownian motions. We get the effect
of correlated Brownian motions by using independent ones and replacing (x, t)
by (x, t)A.

2.10. Normal copulas (a digression): Suppose we have a probability den-


sity u(y) for a scalar random variable Y . We often want to generate families
Y1 , . . . , Ym so that each Yk has the density u(y) but different Yk are correlated.
A favorite heuristic for doing this3 is the normal copula. Let U (y) = P (Y < y)
be the cumulative distribution function (CDF) for Y . Then the Yk will have
density u(y) if and only if U (Yk ) Tk and the Tk are uniformly distributed in
the interval [0, 1] (check this). In turn, the Tk are uniformly distributed in [0, 1]
if Tk = N (Zk ) where the Zk are standard normals and N (z) is the standard
normal CDF. Now, rather than generating independent Zk , we may use corre-
lated ones as above. This in turn leads to correlated Tk and correlated Yk . I do
not know how to determine the Z correlations in order to get a specified set of
Y correlations.

2.11. Degenerate diffusions: Many practical applications have fewer sources


of noise than state variables. In the strong form (13) this is expressed as m < n
or m = n and det() = 0. In the weak form is always n n but it may be
3 I hope this goes out of fashion in favor of more thoughtful methods that postulate some

mechanism for the correlations.

10
rank deficient. In either case we call the stochastic process a degenerate diffu-
sion. Nondegenerate diffusions have qualitative behavior like that of Brownian
motion: every component has infinite total variation and finite quadratic varia-
tion, transition densities are smooth functions of x and t (for t > 0) and satisfy
forward and backward equations (in different variables) in the usual sense, etc.
Degenerate diffusions may lack some or all of these properties. The qualitative
behavior of degenerate diffusions is subtle and problem dependent. There are
some examples in the homework. Computational methods that work well for
nondegenerate diffusions may fail for degenerate ones.

2.12. A degenerate diffusion for Asian options: An Asian option gives a


payout that depends on some kind of time average of the price of the under-
lying security. The simplest form would have th eunderlier being a geometric
Brownian motion in the risk neutral measure

dS(t) = rS(t)dt + S(t)dB(t) , (22)


RT
and a payout that depends on 0
S(t)dt. This leads us to evaluate

E [V (Y (T ))] ,

where Z T
Y (T ) = S(t)dt .
0
To get a backward equation for this, we need to identify a state space so
that the state is a Markov process. We use the two dimensional vector
 
S(t)
X(t) = ,
Y (t)

where S(t) satisfies (22) and dY (t) = S(t)dt. Then X(t) satisfies (13) with
 
rS
a= ,
S

and m = 1 < n = 2 and (with the usual double meaning of )


 
S
= .
0

For the backward equation we have

S 2 2
 
t 0
= = ,
0 0

so the backward equation is

s2 2 2
t f + rss f + sy f + f =0. (23)
2 s

11
Note that this is a partial differential equation in two space variables,
x = (s, y)t . Of course, we are interested in the answer at t = 0 only for y = 0.
Still, we have include other y values in the computation. If we were to try the
standard finite difference approximate solution of (23) we might use a central
1
difference approximation y f (s, y, t) 2y (f (s, y + y, t) f (s, y y, t)).
If > 0 it is fine to use a central difference approximation for s f , and this
is what most people do. However, a central difference approximation for y f
leads to an unstable computation that does not produce anything like the right
answer. The inherent instability of centeral differencing is masked in s by the
strongly stabilizing second derivative term, but there is nothing to stabalize the
unstable y differencing in this degenerate diffusion problem.

2.13. Integration with dX: We seek the anologue of the Ito integral and
Itos lemma for a more general diffusion. If we have a function f (x, t), we seek
a formula df = adt + bdX. This would mean that
Z T Z T
f (X(T ), T ) = f (X(0), 0) + a(t)dt + b(t)dX(t) . (24)
0 0

The first integral on the right would be a Riemann integral that would be defined
for any continuous function a(t). The second would be like the Ito integral with
Brownian motion, whose definition depends on b(t) being an adapted process.
The definition of the dX Ito integral should be so that Itos lemma becomes
true.
For small t we seek to approximate f = f (X(t + t), t + t) f (X(t), t).
If this follows the usual pattern (partial justification below), we should expand
to second order in X and first order in t. This gives (wth summation con-
vention)
f (xj f )Xj + 21 (xj xk f )Xj Xk + t f t . (25)
As with the Ito lemma for Brownian motion, the key idea is to replace the
products Xj Xk by their expected values (conditional on Ft ). If this is true,
(15) suggests the general Ito lemma

df = (xj f )dXk + 12 (xj xk f )jk + t f dt ,



(26)

where all quantities are evaluated at (X(t), t).

2.14. Itos rule: One often finds this expressed in a slightly different way. A
simpler way to represent the small time variance condition (15) is

E [dXj dXk ] = jk (X(t), t)dt .


 
(Though it probably should be E dXj dXk Ft .) Then (26) becomes

df = (xj f )dXk + 21 (xj xk f )E[dXj dXk ] + t f dt .

This has the advantage of displaying the main idea, which is that the fluctuations
in dXj are important but only the mean values of dX 2 are important, not the

12
fluctuations. Itos rule (never enumciated by Ito as far as I know) is the formula

dXj dXk = jk dt . (27)

Although this leads to the correct formula (26), it is not structly true, since the
standard defiation of the left side is as large as its mean.
In the derivation of (26) sketched below, the total change in f is represented
as the sum of many small increments. As with the law of large numbers, the
sum of many random numbers can be much closer to its mean (in relative terms)
than the random summands.

2.15. Ito integral: The definition of the dX Ito integral follows the definition
of the Ito integral with respect to Brownian motion. Here is a quick sketch
with many details missing. Suppose X(t) is a multidimensional diffusion pro-
cess, Ft is the algebra generated by the X(s) for 0 s t, and b(t) is a
possibly random function that is adapted to Ft . There are n components of b(t)
corresponding to the n components of X(t). The Ito integral is (tk = kt as
usual):
Z T X
b(t)dX(t) = lim b(tk ) (X(tk+1 ) X(tk )) . (28)
0 t0
tk <T

This definition makes sense because the limit


P exists (almost surely) for a rich
enough family of integrands b(t). Let Yt = tk <T b(tk ) (X(tk+1 ) X(tk )) and
write (for appropriately chosen T )
X
Yt/2 Yt = Rk ,
tk <T

where  
Rk = b(tk+1/2 ) b(tk ) X(tk+1 ) X(tk+1/2 ) .
The bound h 2 i
E Yt/2 Yt = O(tp ) , (29)

implies that the limit (28) exists almost surely if tl = 2l .


As in the Brownian motion case, we assume that b(t) has the (lack of)
smoothness of Brownian motion: E[(b(t + t) b(t))2 ] = O(t). In the mar-
tingale case (drift = a 0 in (14)), E[Rj Rk ] = 0 if j 6= k. In evaluating E[Rk2 ],
we get from (15) that
h 2 i
E X(tk+1 ) X(tk+1/2 ) Ftk+1/2 = O(t) .

Since b(tt+1/2 ) is known in Ftk+1/2 , we may use the tower property and our
assumption on b to get
h 2 2 i
E[Rk2 ] E X(tk+1 ) X(tk+1/2 ) b(tk+1/2 ) b(t) = O(t2 ) .

13
This gives (29) with p = 1 (as for Brownian motion) for that case. For the
general case, my best effort is too complicated for these notes and gives (29)
with p = 1/2.

2.16. Itos lemma: We give a half sketch of the proof of Itos lemma for
diffusions. We want to use k to represent the time index (as in tk = kt) so
we replace the index notation above with vector notation: x f X instead of
xk Xk , x2 (Xk , Xk ) instead of (xj xk f )Xj Xk , and tr(x2 f ) instead
of (xj xk f )jk . Then Xk will be the vector X(tk+1 ) X(tk ) and x2 fk the
n n matrix of second partial derivatives of f evaluated atP(X(tk ), tk ), etc.
Now it is easy to see who f (X(T ), T ) f (X(0), 0) = tk <T Fk is given
by the Riemann and Ito integrals of the right side of (26). We have
fk = t fk t + x fk Xk + 21 x2 fk (Xk , Xk )
+ O(t2 ) + O (t |Xk |) + O Xk3 .


As t 0, the contribution from the second row terms vanishes (the third
term takes some work, see below). The sum of the t fk t converges to the
RT
Riemann integral 0 t f (X(t), t)dt. The sum of the x fk Xk converges to the
RT
Ito integral 0 x f (X(t), t)dX(t). The remaining term may be written as

x2 fk (Xk , Xk ) = E x2 fk (Xk , Xk ) Ftk + Uk .


 

It can be shown that


h i h i
2 4
E |Uk | Ftk CE |Xk | Ftk Ct2 ,

as it is for Brownian motion. This shows (with E[Uj Uk ] = 0) that


2
X X h i
2
E Uk = E |Uk | CT t ,


tk <T tk <T

so tk <T Uk 0 as t 0 almost surely (with t = 2l ). Finally, the small


P
time variance formula (15) gives
E x2 fk (Xk , Xk ) Ftk = tr x2 fk k + o(t) ,
  

so
X Z T
E x2 fk (Xk , Xk ) Ftk tr x2 f (X(t), t)(X(t), t) dt ,
  

tk <T 0

(the Riemann integral) as t 0. This shows how the terms in the Ito lemma
(26) are accounted for.

2.17. Theory left out: We did not show that there is a process satisfying (14)
and (15) (existence) or that these conditions characterize the process (unique-
ness). Even showing that a process satisfying (14) and (15) with zero drift and

14
= I is Brownian motion is a real theorem: the Levi uniqueness theorem.
The construction of the stochastic
h processi X(t) (existence) also gives bounds
4
on higher moments, such as E |X| C t2 , that we used above. The
higher moment estimates are true for Brownian motion because the increments
are Gaussian.

2.18. Approximating diffusions: The formula strong form formulation of the


diffusion problem (13) suggests a way to generate approximate diffusion paths.
If Xk is the approximation to X(tk ) we can use

Xk+1 = Xk + a(Xk , tk )t + (Xk , tk ) tZk , (30)

where the Zk are i.i.d. N (0, Imm ). This has the properties corresponding to
(14) and (15) that
 
E Xk+1 Xk X1 , , Xk = a(Xk , tk )t

and
cov(Xk+1 Xk ) = t .
This is the forward Euler method. There are methods that are better in some
ways, but in a surprising large number of problems, methods better than this
are not known. This is a distinct contrast to numerical solution of ordinary
differential equations (without noise), for which forward Euler almost never is
the method of choice. There is much research do to to help the SDE solution
methodology catch up to the ODE solution methodology.

2.19. Drift change of measure:


The anologue of the Cameron Martin formula for general diffusions is the
Girsanov formula. We derive it by writing the joint densities for the discrete
time processes (30) with and without the drift term a. As usual, this is a
product of transition probabilities, the conditional probability densities for Xk+1
conditional on knowing Xj for j k. Actually, because (30) is a Markov
process, the conditional densityh for Xk+1 depends on Xk only. We write it
G(xk , xk+1 , tk , t). Conditional on Xk , Xk+1 is a multivariate normal with
covariance matrix (Xk , tk )t. If a 0, the mean is Xk . Otherwise, the mean
is Xk + a(Xk , tk )t. We write k and ak for (Xk , tk ) and a(Xk , tk ).
Without drift, the Gaussian transition density is

(xk+1 xk )t 1
 
1 k (xk+1 xk )
G(xk , xk+1 , tk , t) = p exp
(2)n/2 det(k ) 2t
(31)
With nonzero drift, the prefactor
1
zk = p
(2)n/2 det(k )

15
is the same and the exponential factor accomodates the new mean:

(xk+1 xk ak t)t 1
 
k (xk+1 xk ak t)
G(xk , xk+1 , tk , t) = zk exp .
2t
(32)
Let U (x1 , . . . , xN ) be the joint density without drift and U (x1 , . . . , xN ) with
drift. We want to evaluate L(~x) = V (~x)/U (~x) Both U and V are products of
the appropriate transitions densities G. In the division, the prefactors zk cancel,
as they are the same for U and V because the k are the same.
The main calculation is the subtraction of the exponents:

(xk ak t)t 1 t 1
k (xk ak t)xk xk = 2tatk 1 2 t 1
k xk +t ak k ak .

This gives:
1 N 1
N
!
X t X t 1
L(~x) = exp atk 1
k xk + ak k ak .
2
k=0 k=0

This is the exact likelihood ratio for the discrete time processes without drift.
If we take the limit t 0 for the continuous time problem, the two terms in
the exponent converge respectively to the Ito integral
Z T
a(X(t), t)t (X(t), t)1 dX(t) ,
0

and the Riemann integral


Z T
1
a(X(t), t)t (X(t), t)1 a(X(t), t)dt .
0 2

The result is the Girsanov formula


dP
= L(X)
dQ
!
Z T Z T
t 1 1 t 1
= exp a(X(t), t) (X(t), t) dX(t) a(X(t), t) (X(t), t) a(X(t), t)dt(33).
0 0 2

16

Potrebbero piacerti anche