Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1 Overture
1.4. Diffusions and Ito calculus: The Ito calculus is a tool for studying
continuous stochastic processes in continuous time. If X(t) is a differentiable
function of time, then X = X(t + t) X(t) is of the order of1 t. Therefore
f (X(t)) = f (X(t + t)) f (X(t)) f 0 X to this accuracy. For an Ito
process, X is of the order of t, so f f 0 X + 12 f 00 X 2 has an error
1 This means that there is a C so that |X(t + t X(t)| C || for small t.
1
smaller than t. In the special case where X(t) is Brownian motion, it is often
permissible (and the basis of the Ito calculus) to replace X 2 by its mean value,
t.
2 Discrete probability
Here are some basic definitions and ideas of probability. These might seem dry
without examples. Be patient. Examples are coming in later sections. Although
the topic is elementary, the notation is taken from more advanced probability
so some of it might be unfamiliar. The terminology is not always helpful for
simple problems but it is just the thing for describing stochastic processes and
decision problems under incomplete information.
Usually, we specify an event in some way other than listing all the outcomes in
it (see below). We do not distinguish between the outcome and the event that
that outcome occurred A = {}. That is, we write P () for P ({}) or vice
versa. This is called abuse of notation: we use notation in a way that is not
absolutely correct but whose meaning is clear. Its the mathematical version of
saying I could care less to mean the opposite.
2
any set) that is not countable is called uncountable. This distinction was
formalized by the late nineteenth century mathematician Georg Cantor, who
showed that the set of (real) numbers in the interval [0, 1] is not countable.
Under the uniform probability density, P () = 0 for any [0, 1]. It is hard to
imagine that the probability formula (1) is useful in this case, since every term
in the sum is zero. The difference between continuous and discrete probability
is the difference between integrals and sums.
2.5. Example: Toss a coin 4 times. Each toss yields either H (heads) or T
(tails). There are 16 possible outcomes, TTTT, TTTH, TTHT, TTHH, THTT,
. . ., HHHH. The number of outcomes is #() = || = 16. We suppose that
1
each outcome is equally likely, so P () = 16 for each . If A is the event
that the first two tosses are H, then
A = {HHHH, HHHT, HHTH, HHTT} .
1
There are 4 elements (outcomes) in A, each having probability 16 Therefore
X X 1 4 1
P (first two H) = P (A) = P () = = = .
16 16 4
A A
2.6. Set operations: Events are sets, so set operations apply to events. If A
and B are events, the event A and B is the set of outcomes in both A and
B. This is the set intersection A B, because the outcomes that make both A
and B happen are those that are in both events. The union A B is the set
of outcomes in A or in B (or in both). The complement of A, Ac , is the event
not A, the set of outcomes not in A. The empty event is the empty set, the
set with no elements, . The probability of should be zero because the sum
that defines it has no terms: P () = 0. The complement of is . Events A
and B are disjoint if A B = . Event A is contained in event B, A B, if
every outcome in A is also in B. For example, if the event A is as above and B
is the event that the first toss is H, then A B.
2.7. Basic
P facts: Each of these facts is a consequence of the representation
P (A) = A P (). First P (A) P (B) if A B. Also, P (A) + P (B) =
P (A B) if P (A B) = 0, but not otherwise. If P ( 6= 0 for all , then
P (AB) = 0 only wehn A and B are distoint. Clearly, P (A)+P (Ac ) = P () =
1.
3
2.9. Independence: Events A and B are independent if P (A | B) = P (A).
That is, knowing whether or not B occurred does not change the probability of
A. In view of Bayes rule, this is expressed as
For example, suppose A is the event that two of the four tosses are H, and B
is the event that the first toss is H. Then A has 6 elements (outcomes), B has
8, and, as you can check by listing them, A B has 3 elements. Since each
1 3 6
element has probability 16 , this gives P (A B) = 16 while P (A) = 16 and
8 1
P (B) = 16 = 2 . We might say duh for the last calculation since we started
the example with the hypothesis that H and T were equally likely. Anyway,
this shows that (3) is indeed satisfied in this case. This example is supposed to
show that while some pairs of events, such as the first and second tosses, are
obviously independent, others are independent as the result of a calculation.
Note that if C is the event that 3 of the 4 tosses are H (instead of 2 for A),
4
then P (C) = 16 = 41 and P (B C) = 16 3
, because
2.10. Working with conditional probability: Let us fix the event B, and
discuss the conditional probability Pe() = P ( | B), which also is a probability
(assuming P (B) > 0). There are two slightly different ways to discuss Pe. One
way is to take B to be the probability space and define
P ()
Pe() =
P (B)
for all B. Since B is the probability space for Pe, we do not have to define
Pe for / B. This Pe is a probability because Pe() 0 for all B and
P
B P () = 1. The other way is to keep as the probability space and
e
set the conditional probabilities to zero for / B. If we know the event B
happened, then the probability of an outcome not in B is zero.
(
P ()
for B,
P ( | B) = P (B) (4)
0 for
/ B.
4
C, occurred given that B occurred, should be be the P conditional probability
of given that both B and C occurred. Bayes rule verifies this intuition:
Pe()
Pe( | C) =
Pe(C)
P ( | B)
=
P (C | B)
P ()
=
P (C B
P (B)
P (B)
P ()
=
P (B C)
= P ( | B C) .
5
2.13. Example 2 of an F: Suppose we know only the results of the tosses
but not the order. This might happen if we toss 4 identical coins at the same
time. In this case, we know only the number of H coins. Some measurable sets
are (with an abuse of notation)
{4} = {HHHH}
{3} = {HHHT, HHTH, HTHH, THHH}
..
.
{0} = {TTTT}
1 3
The event {2} has 6 outcomes (list them), so its probability is 6 = . There
16 8
are other events measurable in this algebra, such as less than 3 H, but, in
some sense, the events listed generate the algebra.
(check), but not a algebra. For example, if An leaves out only the first n odd integers,
then A is the set of even integers, and neither A nor Ac is finite.
6
two functions, X1 and X2 , we might try to calculate the probability that they
are equal, P (X1 = X2 ). Strictly speaking, this is the probability of the set of
so that X1 () = X2 ().
7
way, if Bx is not empty, then there is some number, u(x), so that Y () = u(x)
for every Bx . This means that Y () = u(X()) for all ). Altogether,
saying Y FX is a fancy way of saying that Y is a function of X. Of course,
u(x) only needs to be defined for those values of x actually taken by the random
variable X.
For example, if X is the number of H in 4 tosses, and Y is the number of
H minus the number of T , then, for any 4 tosses, , Y () = 2X() 4. That
is, u(x) = 2x 4.
8
This is because P () is the fraction of the time you would get and X() is
the number you get for . If X1 () and X2 () are two random variables, then
E[X1 + X2 ] = E[X1 ] + E[X2 ]. Also, E[cX] = cE[X] if c is a constant (not
random).
2.24. Classical conditional expectation: There are two senses of the term
conditional expectation. We start with the original classical sense then turn
to the related but different modern sense often used in stochastic processes.
Conditional expectation is defined from conditional probability in the obvious
way X
E[X|B] = X()P (|B) . (6)
B
Write B for the event {at least one H}. Since only =TTTT does not have
1
at least one H, |B| = 15 and P ( | B) = 15 for any B. Let X() be the
number of H in . Unconditionally, E[X] = 2, which means
1 X
X() = 2 .
16
x
and therefore
1 X
X()P () = 2
16
B
9
15 1 X
X()P () = 2
16 15
B
1 X 2 16
X()P () =
15 15
B
32
E[X | B] = = 2 + .133 . . . .
15
Knowing that there was at least one H increases the expected number of H by
.133 . . ..
This is easy to understand: exactly one of the events Bk happens. The expected
value of X is the sum over each of the events Bk of the expected value of X
given that Bk happened, multiplied by the probability that Bk did happen. The
derivation is a simple combination of the definitions of conditional expectation
(6) and conditional probability (4):
X
E[X] = X()P ()
!
X X
= X()P ()
k Bk
!
X X P ()
= X() P (Bk )
P (Bk )
k Bk
X
= E[X | Bk ]P (Bk ) .
k
This fact underlies the recurrence relations that are among the primary tools of
stochastic calculus. It will be reformulated below as the tower property when
we discuss the modern view of conditional probability.
10
Make sure you understand the fact that this two valued function Y is measurable
with respect to FB .
Only slightly more complicated is the case where F is generated by a parti-
tion, P = {B1 , B2 , . . .}, of . The conditional expectation Y () = E[X | F] is
defined to be
Y () = E[X | Bj ] if Bj , (8)
where E[X | Bj ] is classical conditional expectation (6). A single set B defines
a partition: B1 = B, B2 = B c , so this agrees with the earlier definition in that
case. The information in F is only which of the Bj occurred. The modern
conditional expectation replaces X with its expected value over the set taht
occurred. This is the expected value of X given the information in F.
11
value of X over the outcomes in the equivalence class.
TTTT
{0}
0
expected value = 0
HHHH
{4}
4
expected value = 4
E (Z X)2 E (Y X)2 .
12
This can be expressed in the terminology of linear algebra. The set of func-
tions (random variables) X is a vector space (Hilbert space) with inner product
X
hX, Y i = X()Y ()P () = E [XY ] ,
2
so kX Y k = E (X Y )2 . The set of functions measurable with respect
to F is a subspace, which we call SF . The conditional expectation, Y , is the
orthogonal projection of X onto SF , which is the element of SF that closest to
X in the norm just given.
Z() = E[X | Ck ]
X
= E[X | Bjk ]P (Bjk | Ck )
j
X
= Y (Bjk )P (Bjk | Ck )
j
= E[Y | Ck ] .
The linear algebra projection interpretation makes the tower property seem
obvious. Any function measurable with respect to G is also measurable with
respect to F, which means that the subspace SG is contained in SF . If you
project X onto SF then project the projection onto SG , you get the same thing
as projecting X directly onto SG (always orthogonal projections).
13
representation of the probability as an expected value. The modern conditional
probability then is P (A | F) = E[1A | F]. Unraveling the definitions, this is a
function, YA (), that takes the value P (A | Bk ) whenever Bk . A related
statement, given for practice with notation, is
X
P (A | F)() = P (A | Bk )1Bk () .
Bk PF
3 Markov Chains, I
3.1. Introduction: Discrete time Markov3 chains are a simple abstract class
of discrete random processes. Many practical models are Markov chains. Here
we discuss Markov chains having a finite state space (see below).
Many of the general concepts above come into play here. The probability
space is the space of paths. The natural states of partial information are
described by the algebras Ft , which represent the information obtained by ob-
serving the chain up to time t. The tower property applied to the Ft leads to
backward and forward equations. This section is mostly definitions. The good
stuff is in the next section.
3.2. Time: The time variable, t, will be an integer representing the number
of time units from a starting time. The actual time to go from t to t + 1 could
be a nanosecond (for modeling computer communication networks) or a month
(for modeling bond rating changes), or whatever. To be specific, we usually
start with t = 0 and consider only non negative times.
3.3. State space: At time t the system will be in one of a finite list of states.
This set of states is the state space, S. To be a Markov chain, the state should
be a complete description of the actual state of the system at time t. This
means that it should contain any information about the system at time t that
helps predict the state at future times t + 1, t + 2, ... . This is illustrated with
the hidden Markov model below. The state at time t will be called X(t) or Xt .
Eventually, there may be an also, so that the state is a function of t and :
X(t, ) or Xt (). The states may be called s1 , . . ., sm , or simply 1, 2, . . . , m.
depending on the context.
century. He is known for his path breaking work on the distribution of prime numbers as well
as on probability.
14
In principle, it should be possible to calculate the probability of any event
(such as {X(2) 6= s}, or {X(t) = s1 for some t T }) by listing all the paths
(outcomes) in that event and summing their probabilities. This is rarely the
easiest way. For one thing, the path space, while finite, tends to be enormous.
For example, if there are m = |S| = 7 states and T = 50 times, then the number
of paths is kk = mT = 750 , which is about 1.8 1042 . This number is beyond
computers.
3.7. Markov property: Informally, the Markov property is that X(t) is all the
information about the past that is helpful in predicting the future. In classical
terms, for example,
15
In modern notation, this may be stated
Recall that both sides are functions of the outcome, X. The function on the
right side, to be measurable with respect to Gt must be a function of X(t) only
(see Generating by a function in the previous section). The left side also is
a function, but in general could depend on all the values X(s) for s t. The
equality (9) states that this function depends on X(t) only.
This may be interpreted as the absence of hidden variables, variables that
influence the evolution of the Markov chain but are not observable or included
in the state description. If there were hidden variables, observing the chain for a
long period might help identify them and therefore change our prediction of the
future state. The Markov property (9) states, on the contrary, that observing
X(s) for s < t does not change our predictions.
The Markov chain is stationary if the transition probabilities Pjk are indepen-
dent of t. Each transition probability Pjk is between 0 and 1, with values 0 and
1 allowed, though 0 is more common than 1. Also, with j fixed, the Pjk must
sum to 1 (summing over k) because k = 1, 2, . . ., m is a complete list of the
possible states at time t + 1.
3.9. Path probabilities: The Markov property leads to a formula for the
probabilities of individual path outcomes P (X) as products of transition prob-
abilities. We do this here for a stationary Markov chain to keep the notation
simple. First, suppose that the probabilities of the initial states are known, and
call them
f0 (j) = P (X(0) = j) .
The Bayes rule (2) implies that
Using this argument again, and using (9), we find (changing the order of the
factors on the last line)
16
One way to express the general formula uses a notational habit common
in probability, using upper case letters to represent a random value of a vari-
able and lower case for generic values of the same quantity (see Terminol-
ogy, Section 2, but note that the meaning of X has changed). We write
x = (x(0), x(1), , x(T )) for a generic path, and seek P (x) = P (X = x) =
P (X(0) = x(0), X(1) = x(1), ). The argument above shows that this is given
by
1
TY
P (x) = f0 (x(0))Px(0),x(1) Px(T 1),x(T ) = f0 (x(0)) Px(t),x(t+1) . (10)
t=0
is the (j, k) entry of P s , the sth power of the transition matrix (explanation
below). Also, as discussed later, steady state probabilities form an eigenvector
of P corresponding to eigenvalue = 1.
3.11. Example 3, coin flips: The state space has m = 2 states, called U
(up) and D (down). Writing H and T would conflict with T being the length
of the chain. The coin starts in the U position, which means that f0 (U) = 1
and f0 (D) = 0. At every time step, the coin turns over with 20% probability,
so the transition probabilities are PU U = .8, PU D = .2, PDU = .2, PDD = .8.
The transition matrix is (taking U for 1 and D for 2):
.8 .2
P =
.2 .8
17
Take T = 3 and let A be the event UUzU, where the state X(2) = z is
unknown. There are two outcomes (paths) in A:
A = {UUUU, UUDU} ,
3.12. Example 4: There are two coins, F (fast) and S (slow). Either coin will
be either U or D at any given time. Only one coin is present at any given time
but sometimes the coin is replaced (F for S or vice versa) without changing its
UD status. The F coin has the same UD transition probabilities as example
3. The S coin has UD transition probabilities:
.9 .1
.05 .95
The probability of coin replacement at any given time is 30%. The replacement
(if it happens) is done after the (possible) coin flip without changing the UD
status of the coin after that flip. The Markov chain has 4 states, which we
arbitrarily number 1: UF, 2: DF, 3: US, 4: DS. States 1 and 3 are U states
while states 1 and 2 are F states, etc. The transition matrix is 4 4. We can
calculate, for example, the (non) transition probability for UF UF. We first
have a U U (non) transition then an F (non) transition. The probability
is then P (U U | F ) P (F F ) = .8 .7 = .56. The other entries can be
found in a similar way. The transitions are:
U F U F U F DF U F U S U F DS
DF U F DF DF DF U S DF DS
U S U F U S DF U S U S U S DS .
DS U F DS DF DS U S DS DS
If we start with U but equally likely F or S, and want to know the probability
of being D after 4 time periods, the answer is
4 4 4 4
.5 P12 + P14 + P32 + P34
18
because states 1 = UF and 3 = US are the (equally likely) possible initial U
states, and 2 = DF and 4 = DS are the two D states. We also could calculate
P (U U zU ) by adding up the probabilities of the 32 (list them) paths that make
up this event.
19
Stochastic Calculus Notes, Lecture 2
Last modied September 16, 2004
1.2. Forward equation, functional version: Let u(k, t) = P (X(t) = k). The
law of total probability gives
u(k, t + 1) = P (X(t + 1) = k)
= P (X(t + 1) = k | X(t) = j) P (X(t) = j) .
j
Therefore
u(k, t + 1) = Pjk u(j, t) . (1)
j
This is the forward equation for probabilities. It is also called the Kolmogorov
forward equation or the Chapman Kolmogorov equation. Once u(j, t) is known
for all j S, (1) gives u(k, t + 1) for any k. Thus, we can go forward in time
from t = 0 to t = 1, etc. and calculate all the numbers u(k, t).
Note that if we just wanted one number, say u(17, 49), still we would have
to calculate many related quantities, all the u(j, t) for t < 49. If the state space
is too large, this direct forward equation approach may be impractical.
1
Just one row makes it a row vector. Matrix-vector multiplication is a special
case of matrix-matrix multiplication. We often denote genuine matrices (more
than one row and column) with capital letters and vectors, row or column, with
lower case. In particular, if u is an n dimensional row vector, a 1n matrix, and
A is an n n matrix, then uA is another n dimensional row vector. We do not
write Au for this because that would be incompatible. Matrix multiplication is
always associative. For example, if u is a row vector and A and B are square
matrices, then (uA)B = u(AB). We can compute the row vector uA then
multiply by B, or we can compute the n n matrix AB then multiply by u.
If u is a row vector, we usually denote the k-th entry by uk instead of u1k .
n vector f is fk instead of fk1 . If both u and f
Similarly, the k-th entry of column
have n components, then uf = k=1 uk fk is a 11 matrix, i.e. a number. Thus,
treating row and column vectors as special kinds of matrices makes the product
of a row with a column vector natural, but not, for example, the product of two
column vectors.
1.4. Forward equation, matrix version: The probabilities u(k, t) form the
components of a row vector, u(t), with components uk (t) = u(k, t) (an abuse of
notation). The forward equation (1) may be expressed (check this)
using just k matrix multiplications. For example, this computes P 1024 using
just ten matrix multiplies, instead of a thousand.
These satisfy a backward equation that follows from the law of total probability:
2
= E [V (X(T )) | X(t) = k and X(t + 1) = j] P (X(t + 1) = j | X(t) = k)
jS
f (k, t) = f (j, t + 1)Pkj . (5)
jS
Using these, we may compute all the numbers f (k, T 1), then all the numbers
f (k, T 2), etc.
Since Gt is generated by the partition {k} = {X(t) = k}, this is the same def-
inition (4). Moreover, because Ft Ft+1 and F (t + 1) = E[V (X(T )) | Ft+1 ],
the tower property gives
Note that this is a version of the tower property. On the event {X(t) = k}, the
right side above takes the value
f (j, t + 1) P (x(t + 1) = j | X(t) = k) .
jS
Thus, (7) is the same as the backward equation (5). In the continuous time
versions to come, (7) will be very handy.
3
knows that a vector really is a function of the index. The backward equation
(5) then is equivalent to (check this)
f (t) = P f (t + 1) . (8)
f (t) = P T t V ,
The last line is a natural example of an inner product between a row vector and a
column vector. Note that the product E[V (X(T ))] = u(t)f (t) does not depend
on t even though u(t) and f (t) are dierent for dierent t. For this invariance
to be possible, the forward evolution equation for u and the backward equation
for f must be related.
( u(t + 1) u(t)P ) f (t + 1) = 0 .
If this is true for enough linearly independent vectors f (t + 1), then the vector
u(t+1)u(t)P must be zero, which is the matrix version of the forward equation
(2). A theoretically minded reader can verify that enough f vectors are produced
if the transition matrix is nonsingular and we choose a linearly independent
family of reward vectors, V . In the same way, the backward evolution of f is
a consequence of invariance and the forward evolution of u.
We now have two ways to evaluate E[V (X(T ))]: (i) start with given u(0),
compute u(T ) = u(0)P T , evaluate u(T )V , or (ii) start with given V = f (T ),
compute f (0) = P T V , then evaluate u(0)f (0). The former might be preferable,
for example, if we had a number a number of dierent reward functions to
evaluate. We could compute u(T ) once then evalute u(T )V for all our V vectors.
4
1.10. Duality: In its simplest form, duality is the relationship between a
matrix and its transpose. The set of column vectors with n components is a
vector space of dimension n. The set of n component row vectors is the dual
space, which has the same dimension but may be considered to be a dierent
space. We can combine an element of a vector space with an element of its dual
to get a number: row vector u multiplied by column vector f yields the number
uf . Any linear transformation on the vector space of column vectors is repre-
sented by an n n matrix, P . This matrix also denes a linear transformation,
the dual transformation, on the dual space of row vectors, given by u uP .
This is the sense in which the forward and backward equations are dual to each
other.
Some people prefer not to use row vectors and instead think of organizing
the probabilities u(k, t) into a column vector that is the transpose of what
we called u(t). For them, the forward equation would be written u(t + 1) =
P t u(t) (note the notational problem: the t in P t means transpose while the
t in u(t) and f (t) refers to time.). The invariance relation for them would be
ut (t + 1)f (t + 1) = ut (t)f (t). The transpose of a matrix is often called its dual.
Clearly,
f (1, t) = 1 for all t, (11)
and
f (k, T ) = 0 for k = 1. (12)
Moreover, if k = 1, the law of total probabilities yields a backward relation
f (k, t) = Pkj f (j, t + 1) . (13)
jS
The dierence between this and the plain backward equation (5) is that the
relation (13) holds only for interior states k = 1, while the boundary condition
(11) supplies the values of f (1, t). The sum on the right of (13) includes the
term corresponding to state j = 1.
1.12. Hitting probabilities, forward: We also can compute the hitting proba-
bilities (9) using a forward equation approach. Dene the survival probabilities
5
These satisfy the obvious boundary condition
u(1, t) = 0 , (15)
We may include or exclude the term with j = 1 on the right because u(1, t) = 0.
interior states k = 1. The overall probability
Of course, (17) applies only at
of survival up to time
T is kS u(k, T ) and the hitting probability is the
complementary 1 kS u(k, T ).
The matrix vector formulation of this involves the row vector
and the matrix P formed from P by removing the rst row and column. The
evolution equation (17) and boundary condition (15) are both expressed by the
matrix equation
(t)P .
(t + 1) = u
u
Note that P is not a stochastic matrix because some of the row sums are less
than one:
Pkj < 1 if Pk1 > 0 .
j=1
1.14. Running cost: Suppose we have a running cost functtion, W (x), and
we want to calculate T
f =E W (X(t)) . (18)
t=0
Sums like this are called path dependent because their value depends on the
whole path, not just the nal value X(T ). We can calculate (18) with the
6
forward equation using
T
f = E [W (X(t))]
t=0
T
= u(t)W . (19)
t=0
f (t) = P f (t + 1) + W . (21)
For example, X(t) could represent the state of a nancial market and V (k) =
1 + r(k) the interest rate for state k. Then (22) would be the expected total
interest. We also can write V (k) = eW (k) , so that
The t = t term in the product has V (X(t)) = V (k). The nal condition is
f (k, T ) = V (k). The backward evolution equation is derived more or less as
7
before:
f (k, t) = Ek,t V (k) V (X(t ))
t >t
T
= V (k)Ek,t V (X(t ))
t =t+1
= V (k)Ek,t [f (X(t + 1), t + 1)] (the tower property)
f (k, t) = V (k) P f (t + 1) (k) . (23)
In the last line on the right, f (t + 1) is teh column vector with components
f (k, t + 1) and P f (t + 1) is teh matrix vector product. We write P f (t + 1) (k)
for the k th component of the column vector P f (t + 1). We could express the
whole thing in matrix terms using diag(V ), the diagonal matrix with V (k) in
the (k, k) position:
f (t) = diag(V )P f (t + 1) .
A version of (23) for Brownian motion is called the Feynman-Kac formula.
is (see homework):
g(k, t) = V (k) g(t 1)P (k) . (25)
This is also the forward equation for a branching process with branching
factors V (k). At time t, the branching process has N (k, t) particles, or walkers,
at state k. The numbers N (k, t) are random. A time step of the branching
process has two parts. First, each particle takes one step of the Markov chain.
A particle at state j goes to state k with probability Pjk . All steps for all
particles are independent. Then, each particle at state k does a branching or
birth/death step in which the particle is replaced by a random number of particles
with expected number V (k). For example, if V (k) = 1/2, we could delete the
particle (death) with probability half. If V (k) = 2.8, we could keep the existing
particle, one new one, then add a third with probability .8. All particles are
treated independently. If there are m particles in state k before the birth/death
step, the expected number after the birth/death step is V (k)m. The expected
number of particles, g(k, t) = E[N (k, t)], satises (25).
8
When V (k) = 1 for all k there need be no birth or death. There will be
just one particle, the path X(t). The number of particles at state k at time t,
N (k, t), will be zero if X(t) = k or one if X(t) = k. In fact, N (k, t) = I(k, t)(X).
The expected values will be g(k, t) = E[N (k, t)] = E[I(k, t)] = u(k, t).
The branching process representation of (22) is possible when V (k) 0 for
all k. Monte Carlo methods based on branching processes are more accurate
than direct Monte Carlo in many cases.
2.2. Simple random walk: The state space for simple random walk is the
integers, positive and negative. At each time, the walker has three choices:
(A) move up one, (B) do not move, (C) move down one. The probabilities are
P (A) = P (k k + 1) = a, P (B) = P (X(t + 1) = X(t)) = b, and P (X(t + 1) =
X(t)1) = c. Naturally, we need a, b, and c to be non-negative and a+b+c = 1.
The transation matrix2 has b on the diagonal (Pkk = b for all k), a on the super-
diagonal (Pk,k+1 = a for all k), and c on the sub diagonal. All other matrix
elements Pjk are zero.
This Markov chain is homogeneous or translation invariant: The probalities
of moving up or down are independent of X(t). A translation by k is a shift of
everything by k (I do not know why this is called translation). Translation
invariance means, for example, that the probability of going from m to l in s
steps is the same as the probability of going from m + k to l + k in s steps:
P (X(t + s) = l | X(t) = m) = P (X(t + s) = l + k | X(t) = m + k). It is common
to simplify general discussions by choosing k so that X(0) = 0. Mathematicians
often say without loss of generality or w.l.o.g. when doing so.
2 This matrix is infinite when the state space is infinite. Matrix multiplication is still
defined. For example, the k component of uP is given by (uP )k = u P . This possibly
j j jk
infinite sum has only three nonzero terms when P is tridiagonal.
9
Often, particularly when discussing multidimensional random walk, we use
x, y, etc. instead of j, k, etc. to denote lattice points (states of the Markov
chain). Probabilists often use lower case Latin letters for general possible values
of a random variable, while using the capital letter for the random variable
itself. Thus, we might write Pxy = P (X(t + 1) = x | X(t) = y). As an execise
in denition unwrapping, review Lecture 1 and check that this is the same as
PX(t),x = P (X(t + 1) = x | Ft ).
2.5. Urn models: Urn models illustrate several features of more general
random walks. Unlike simple random walk, urn models are mean reverting and
have steady state probabilities that determine their large time behavior. We will
come back to them when we discuss scaling in future lectures.
The simple urn contains n balls that are identical except for their color.
There are k red balls and n k green ones. At each state, someone chooses
one of the balls at random with each ball equally likely to be chosen. He or she
replaces the chosen ball with a fresh ball that is red with probability p and green
3 People
use the term volatility in two distinct ways. In the Black Scholes theory, volatility
means something else.
10
a = 0.20, b= 0.20, c= 0.60, T = 8
0.2
0.15
probability
0.1
0.05
0
8 6 4 2 0 2 4 6
k
a = 0.20, b= 0.20, c= 0.60, T = 60
0.08
0.06
probability
0.04
0.02
0
45 40 35 30 25 20 15 10 5 0
k
11
with probability 1 p. All choices are independent. The number of red balls
decreases by one if he or she removes a red ball and returns a green one. This
happens with probabilty (k/n) (1 p). Similarly, the k k + 1 probability is
((n k)/n) p. In formal terms, the state space is the integers from 0 to n and
the transition probabilities are
2.6. Urn model steady state: For the simple urn model, the probabilities
u(k, t) = P (X(t) = k) converge to steady state probabilities, v(k), as t .
This is illustrated in Figure (2). The steady state probabilities are
n k
v(k) = p (1 p)nk .
k
The steady state probabilities have the property that if u(k, t) = v(k) for all
k, then u(k, t + 1) = v(k) also for all k. This is statistical steady state because
the probabilities have reached steady state values though the states themselves
keep changing, as in Figure (3). In matrix vector notation, we can form the
row vector, v, with entries v(k). Then v is a statistical steady state if vP = v.
It is no coincedence that v(k) is the probability of getting k red balls in n
independent trials with probability p for each trial. The steady state expected
number of red balls is
Ev [X] = np ,
where the notation Ev [] refers to expectation in probability distribution v.
2.7. Urn model mean reversion: If we let m(t) be the expected value if X(t),
then a calculation using the transition probabilities gives the relation
1
m(t + 1) = m(t) + (np m(t)) . (27)
n
This relation shows not only that m(t) = np is a steady state value (m(t) = np
implies m(t + 1) = np), but also that
np as t (if r(t) = m(t) np,
m(t)
then r(t + 1) = r(t) with || =
1 n1
< 1).
Another way of expression mean reversion will be useful in discussing stochas-
tic dierential equations later. Because the urn Model is a Markov chain,
12
n = 30, T = 6
0.14
0.12
0.1
probability
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25 30
k
Figure 2: The probability distributions for the simple urn model plotted every
T time steps. The rst curve is blue, low, and at. The last one is red and most
peaked in the center. The computation starts with each state being equally
likely. Over time, states near the edges become less likely.
13
p = 0.5, n = 100
100
90
80
70
60
X
50
40
30
20
10
0
0 50 100 150 200 250 300 350 400
t
14
If X(t) > np, we have
1
E[X(t)] = E[X(t + 1) X(t)] (np X(t)) ,
n
is negative. If X(t) < np, it is positive.
2.8. Boundaries: The terms boundary, interior, region, etc. as used in the
general discussion of Markov chain hitting probabilities come from applications
in lattice Markov chains such as simple random walk. For example, the region
x > has boundary x = . The quantities
for x > together with the absorbing boundary condition u(, t) = 0. We could
create a nite state space Markov chain by considering a region < x < with
simple random walk in the interior together with absorbing boundaries at x =
and x = . Absorbing boundary conditions are also called Dirichlet boundary
conditions.
Another way to create a nite state space Markov chain is to put reflecting
boundaries at x = and x = . This chain has the same transition probabilities
as ordinary random walk in the interior ( < x < ). However, transitions from
to 1 are disallowed and replaced by transitions from to + 1. This
means changing the transition probabilities starting from x = to
The scaled square lattice, with lattice spacing h > 0, is the set of points hx =
(hx1 , . . . , hxd ), where x are integer lattice points. In the present discussion, the
scaling is irrelevent, so we use the unit lattice. We say that latice points x and
y are neighbors if
15
Stochastic Calculus Notes, Lecture 3
Last modified September 30, 2004
1.4. Example 1, Markov chains: In this example, the Ft are minimal and
is the path space of sequences of length T from the state space, S. The
1
new information revealed at time t is the state of the chain at time t. The
variables Xt are may be called coordinate functions because Xt is coordinate
t (or entry t) in the sequence X. In principle, we could express this with the
notation Xt (X), but that would drive people crazy. Although we distinguish
between Markov chains (discrete time) and Markov processes (continuous time),
the term stochastic process can refer to either continuous or discrete time.
1.5. Example 2, diadic sets: This is a set of definitions for discussing averages
over a range of length scales. The time variable, t, represents the amount of
averaging that has been done. The new information revealed at time t is finer
scale information about a function (an audio signal or digital image). The state
space is the positive integers from 1 to 2T . We start with a function X() and
ask that Xt () be constant on diadic blocks of length 2T t . The diadic blocks
at level t are
The reader should check that moving from level t to level t + 1 splits each block
into right and left halves:
Pt = Bt,k with k = 1, . . . , 2T t .
Because Ft Ft+1 , the Pt+1 is a refinement of Pt . The union (2) shows how.
We will return to this example after discussing martingales.
E[Ft+1 | Ft ] = Ft .
If we take the overall expectation of both sides we see that the expectation
value does not depend on t, E[Ft+1 ] = E[Ft ]. The martingale property says
more. Whatever information you might have at time t notwithstanding, still
the expectation of future values is the present value. There is a gambling in-
terpretation: Ft is the amount of money you have at time t. No matter what
has happened, your expected winnings at between t and t + 1, the martingale
difference Yt+1 = Ft+1 Ft , has zero expected value. You can also think of
martingale differences as a generalization of independent random variables. Pt If
the random variables Yt were actually independent, then the sums Ft = k=1 Yt
would form a martingale (using the Ft , generated by the Y1 , . . ., Yt ). The reader
should check this.
1 For finite this is the whole story. For countable we also assume that the sums defining
E[Xt ] converge absolutely. This means that E[|Xt |] < . That implies that the conditional
expectations E[Xt + 1 | Ft are well defined.
2
1.7. Examples: The simplest way to get a martingale is to start with
a random variable, F (), and define Ft = E[F | Ft ]. If we apply this to
a Markov chain with the minimal filtration Ft , and F is a final time reward
F = V (X(T
)), then Ft = f (X(t), t) as in the previous lecture. If we apply this
to = 1, 2, . . . , 2T , with uniform probability P (k) = 2T for k , and the
diadic filtration, we get the diadic martingale with Ft (j) constant on the diadic
blocks (1) and equal to the average of F over the block j is in.
We see this using classical conditional expectation over the sets in the partition
defining F. Let B be one of these sets. Let yB = E[Y | B] be the value of
E[Y | F] for B. We know that U () is constant in B because U F. Call
this value uB . Then E[U Y | B] = uB E[Y | B] = ub yb . But this is the value of
U E[Y | F] for B. Since each is in some B, this proves (3) for all .
1.9. Doobs principle: This lemma lets us make new martingales from
old ones. Let Ft be a martingale and Yt = Ft Ft1 the martingale differ-
ences (called innovations by statisticians and returns in finance). We use the
convention that F1 = 0 so that Pt F0 = Y0 . The martingale condition is that
E[Yt+1 | Ft ] = 0. Clearly Ft = t0 =0 Yt0 .
Suppose that at time t we are allowed to place a bet of any size2 on the as
yet unknown martingale difference, Yt+1 . Let Ut Ft be the size of the bet.
The return from betting on Yt will be Ut1 Yt , and the total accumulated return
up to time t is
Gt = U0 Y1 + U1 Y2 + + Ut1 Yt . (4)
Because of the lemma (3), the betting returns have E[Ut Yt+1 | Ft ] = 0, so
E[Gt+1 | Ft ] = Gt and Gt also is a martingale.
The fact that Gt in (4) is a martingale sometimes is called Doobs principle
or Doobs theorem after the probabilist who formulated it. A special case below
for stopping times is Doobs stopping time theorem or the optional stopping
theorem. They all say that strategizing on a martingale never produces anything
but a martingale. Nonanticipating strategies on martingales do not give positive
expected returns.
1.10. Weak and strong efficient market hypotheses: It is possible that the
random variables Ft form a martingale with respect to their minimal filtration,
Ft , but not with respect to an enriched filtration Gt Ft . The simplest example
would be the algebras Gt = Ft+1 , which already know the value of Ft+1 at time
t. Note that the Ft also are a stochastic process with respect to the Gt . The
2 We may have to require that the bet have finite expected value.
3
weak efficient market hypothesis is that et St is a martingale (St being the
stock price and its expected growth rate) with respect to its minimal filtra-
tion. Technical analysis means using trading strategies that are nonanticipating
with respect to the minimal filtration. Therefore, the weak efficient market hy-
pothesis says that technical trading does not produce better returns than buy
and hold. Any extra information you might get by examining the price history
of S up to time t is already known by enough people that it is already reflected
in the price St .
The strong efficient market hypothesis states that et St is a martingale
with respect to the filtration, Gt , representing all the public information in the
world. This includes the previous price history of S and much more (prices of
related stocks, corporate reports, market trends, etc.).
1.11. Investing with Doob: Economists sometimes use Doobs principle and
the efficient market hypotheses to make a point about active trading in the
stock market. Suppose that Ft , the price of a stock at time t, is a martingale3 .
Suppose that at time t we all the information in Ft , and choose an amount,
Ut , to invest at time t. The fact that the resulting accumulated, Gt , has zero
expected value is said to show that active investing is no better than a buy
and hold strategy that just produces the value Ft . The well known book A
Random Walk on Wall Street is mostly an exposition of this point of view.
This argument breaks down when applied to non martingale processes, such as
stock prices over longer times. Active trading strategies such as (4) may produce
reduce the risk more than enough to compensage risk averse investors for small
amounts of lost expected value. Mertons optimal dynamic investment analysis
is a simple example of an active trading strategy that is better for some people
than passive buy and hold.
1.13. Doobs stopping time theorem for one stopping time: Because stop-
ping times are nonanticipating strategies, they also cannot make money from a
martingale. One version of this statement is that E[X ] = E[X1 ]. The proof of
this makes use of the events Bt , that = t. The stopping time hypothesis is
that Bt Ft . Since has some value 1 T , the Bt form a partition of .
Also, if Bt , () = t, so X = Xt . Therefore,
E[X1 ] = E[XT ]
3 This is a reasonable approximation for much short term trading
4
T
X
= E[XT | Bt ]P (Bt )
t=1
T
X
= E[X ]P ( = t)
t=1
= E[X ] .
1.14. Stopping time paradox: The technical hypotheses above, finite state
space, bounded stopping times, may be too strong, but they cannont be com-
pletely ignored, as this famous example shows. Let Xt be a symmetric random
walk starting at zero. This forms a martingale, so E[X ] = 0 for any stopping
time, . On the other hand, suppose we take = min(t | Xt = 1). Then X = 1
always, so E[X ] = 1. The catch is that there is no T with () T for all
. Even though < almost surely (more to come on that expression),
E[ ] = (explination later). Even that would be OK if the possible values of
Xt were bounded. Suppose you choose T and set 0 = min(, T ). That is, you
wait until Xt = 1 or t = T , whichever comes first, to stop. For large T , it is very
likely that you stopped for Xt = 1. Sill, those paths that never reached 1 prob-
ably drifted just far enough in the negative direction so that their contribution
to the overall expected value cancels the 1 to yield E[X 0 ] = 0.
5
Stochastic Calculus Notes, Lecture 4
Last modified October 4, 2004
1 Continuous probability
Indeed, the typical situation in continuous probability is that any event consist-
ing of a single outcome has probability zero: P ({}) = 0 for all .
As we explain below, the classical formalism of probability densities also does
not apply in many of the situations we are interested in. Abstract probability
measures give a framework for working with probability in path space, as well
as more traditional discrete probability and probabilities given by densities on
Rn .
These notes outline the Kolmogorovs formalism of probability measures for
continuous probability. We leave out a great number of details and mathemat-
ical proofs. Attention to all these details would be impossible within our time
constraints. In some cases we indicate where a precise definition or a complete
proof is missing, but sometimes we just leave it out. If it seems like something
is missing, it probably is.
1
Thus the probability of any individual outcome is zero. An event with
positive probability (P (A) > 0) is made up entirely of outcomes x0 A,
with P (x0 ) = 0. Because of countable additivity (see below), this is only
possible when is uncountable.
Rn , sequences of n numbers (possibly viewed as a row or column vector depend-
ing on the context): X = (X1 . . . , Xn ). Here too if there is a probability
density then the probability of any given outcome is zero.
S N . Let S be the discrete state space of a Markov chain. The space S T
is the set of sequences of length T of elements of S. An element of S T
may be written x = (x(0), x(1), , x(T 1)), with each of the x(t) in
S. It is common to write xt for x(t). An element of S N is an infinite se-
quence of elements of S. The exponent N stands for natural numbers.
We misuse this notation because ours start with t = 0 while the actual
natural numbers start with t = 1. We use S N when we ask questions
about an entire infinite trajectory. For example the hitting probability is
P (X(t) 6= 1 for all t 0). Cantor proved that S N is not countable when-
ever the state space has more than one element. Generally, the probability
of any particular infinite sequence is zero. For example, suppose the tran-
sition matrix has P11 = .6 and u0 (1) = 1. Let x be the infinite sequence
that never leaves state 1: x = (1, 1, 1, ). Then P (x) = u0 (1) .6 .6 .
Multiplying together an infinite number of .6 factors should give the an-
swer P (x) = 0. More generally, if the transition matrix has Pjk r < 1
for all (j, k), then P (x) = 0 for any single infinite path.
C([0, T ] R), the path space for Brownian motion. The C stands for con-
tinuous. The [0, T ] is the time interval 0 t T ; the square brackets
tell us to include the endpoints (0 and T in this case). Round parentheses
(0, T ) would mean to leave out 0 and T . The final R is the target space,
the real numbers in this case. An element of is a continuous function
from the interval [0, T ] to R. This function could be called X(t) or Xt (for
RT
0 t T ). In this space we can ask questions such as P ( 0 X(t)dt > 4).
There are still other ways to specify probabilities of events in path space. All
of these probability measures satisfy the same basic axioms.
Suppose that for each A F we have a number P (A). The numbers P (A)
are a probability measure if
i. If A F and B F are disjoint events, then P (A B) = P (A) + P (B).
2
ii. P (A) 0 for any event A F.
iii. P () = 1.
iv. If An F is a sequence
P of events each disjoint from all the others and
n=1 An = A, then n=1 P (An ) = P (A).
1.5. Borel sets: It is rare that one can define P (A) for all A . Usually,
there are non measurable events whose probability one does not try to define
(see below). This is not related to partial information, but is an intrinsic aspect
of continuous probability. Events that are not measurable are quite artificial,
but they are impossible to get rid of. In most applications in stochastic calculus,
it is convenient to take the largest algebra to be the Borel sets2
In a previous lecture we discussed how to generate a algebra from a
collection of sets. The Borel algebra is the algebra that is generated by all
balls. The open ball with center x0 and radius r > 0 in n dimensional space is
Br (x0 ) = {x | |x x0 | < r. A ball in one dimension is an interval. In two
dimensions it is a disk. Note that the ball is solid, as opposed to the hollow
sphere, Sr (x0 ) = {x | |x x0 | = r}. The condition |x x0 | r instead of
|x x0 | < r, defines a closed ball. The algebra generated by open balls is
the same as the algebra generated by closed balls (check this if you wish).
1.6. Borel sets in path space: The definition of Borel sets works the same
way in the path space of Brownian motion, C([0, T ], R). Let x0 (t) amd x(t) be
two continuous functions of t. The distance between them in the sup norm is
We often use double bars to represent the distance between functions and single
bar absolute value signs to represent the distance between numbers or vectors
in Rn . As before, the open ball of radius r about a path x0 is the set of all
paths with kx x0 k < r.
1.7. The algebra for Markov chain path space: There is a convenient
limit process that defines a useful algebra on S N , the infinite time horizon
path space for a Markov chain. We have the algebras Ft generated by the first
2 The larger algebra of Lebesgue sets seems to more of a nuisance than a help, particularly
3
t + 1 states x(0), x(1), . . ., x(t). We take F to be the algebra generated
by all these. Note that the event A = {X(t) 6= 1 for t 0} is not in any of
the Ft . However, the event At = {X(t) 6= 1 for 0 t T } is in Ft . Therefore
A = t0 At must be in any algebra that contains all the Ft . Also note that
the union of all the Ft is an algebra of sets, though it is not a algebra.
4
B + n when k 6= n, suppose that x B + k and x n . Then x = y + k
and x = z + n for y B and z B. But (and this is the punch line) this
would mean y z, which is impossible because B has only one representative
from each equivalence class. The possibility of selecting a single element from
each partition element without having to say how it is to be done is the axiom
of choice.
n
We will see that in R with a density u, this agrees with the classical definition
Z
E[f ] = f (x)u(x)dx ,
Rn
3 The Feynman integral in path space has some properties of true integrals but lacks others.
The probabilist Mark Kac (pronounced cats) discovered that Feynmans ideas applied to
the heat equation rather than the Schrodinger equation can be interpreted as integration with
respect to Wiener measure. This is now called the Feynman Kac formula.
5
if we write dP (x) = u(x)dx. Note that the abstract variable is replaced by
the concrete variable, x, in this more concrete situation. The general definition
is forced on us once we make the natural requirements
i. If A F is any event, then E[1A ] = P (A). The integral of the indicator
function if an event is the probability of that event.
ii. If f1 and f2 have f1 () f2 () for all , then E[f1 ] E[f2 ]. Integra-
tion is monotone.
iii. For any reasonable functions f1 and f2 (e.g. bounded), we have E[af1 +
bf2 ] = aE[f1 ] + bE[f2 ]. (Linearity of integration).
6
decreasing sequence converging to the same number, which is the only possible
value of E[f ] consistent with (i), (ii), and (iii).
It is sometimes said that the difference between classical (Riemann) integra-
tion and abstract integration (here) is that the Riemann integral cuts the x axis
into little pieces, while the abstarct integral cuts the y axis (which is what the
simple function approximations amount to).
If the function f is positive but not bounded, it might happen that E[f ] = .
The cut off functions, fM () = min(f (), M ), might have E[fM ] as
M . If so, we say E[f ] = . Otherwise, property (iv) implies that
E[f ] = limM E[fM ]. If f is both positive and negative (for different ),
we integrate the positive part, f+ () = max(f (), 0), and the negative part
f () = min(f (), 0 separately and subtract the results. We do not attempt a
definition if E[f+ ] = and E[f ] = . We omit the long process of showing
that these definitions lead to an integral that actually has the properties (i) -
(iv).
This is one of the definitions we gave before, the one that works for continuous
and discrete probability. In the theory, it is possible to show that there is a
minimizer and that it is unique.
7
notation), then there is some function u(x1 , . . . , xn ) so that
The intuition was that F contains the information you get by knowing the
values of the functions Xk . Any function measurable with respect to this alge-
bra is determined by knowing the values of these functions, which is precisely
what (3) says. This approach using functions is often convenient in continuous
probability.
If is a continuous probability space, we may again specify functions Xk
that we want to be measurable. Again, these functions generate an algebra,
a algebra, F. If F is measurable with respect to this algebra then there is
a (Borel measurable) function u(x1 , . . .) so that F () = u(X1 , . . .), as before.
In fact, it is possible to define F in this way. Saying that A F is the same
as saying that 1A is measurable with respect to F. If u(x1 , . . .) is a Borel
measurable function that takes values only 0 or 1, then the function F defined by
(3) defines a function that also takes only 0 or 1. The event A = { | F () = 1
has (obviously) F = 1A . The algebra generated by the Xk is the set of
events that may be defined in this way. A complete proof of this would take a
few pages.
1.17. Marginal density and total probability: The abstract situation is that
we have a probability space, with generic outcome . We have some
functions (X1 (), . . . , Xn ()) = X(). With in the background, we can ask
for the joint PDF of (X1 , . . . , Xn ), written u(x1 , . . . , xn ). A formal definition of
u would be that if A Rn , then
Z
P (X() A) = u(x)dx . (4)
xA
Suppose we neglect the last variable, Xn , and consider the reduced vector
X() = (X1 , . . . , Xn1 ) with probability density u(x1 , . . . , xn1 ). This u is
the marginal density and is given by integrating u over the forgotten variable:
Z
u(x1 , . . . , xn1 ) = u(x1 , . . . , xn )dxn . (5)
8
We can prove (5) from (4) by considering a set B Rn1 and the corre-
sponding set A Rn given by A = B R (i.e. A is the set of all pairs x, xn )
with x = (x1 , . . . , xn1 ) B). The definition of A from B is designed so that
P (X A) = P (X B). With this notation,
P (X B) = P (X A)
Z
= u(x)dx
ZA Z
= u(x, xn )dxn dx
xB xn =
Z
P (X B) = u(x)dx .
B
The Bayes rule definition of v(xn ) has some trouble because both the denomi-
nator, P (Xn = xn ), and the numerator,
are zero.
The classical solution to this problem is to replace the exact condition Xn =
xn with an approximate condition having positive (though small) probability:
xn Xn xn + . We use the approximaion
Z xn +
g(x, n )dn g(x, xn ) .
xn
The error is roughly proportional to 2 and much smaller than either the terms
above. With this approximation the numerator in Bayes rule is
Z Z n =xn +
E[f (X) 1xn Xn xn + ] = f (x, n )u(x, xn )dn dx
xRn1 n =xn
Z
f (x, xn )u(x, xn )dx .
x
9
If we take the Bayes rule quotient and let 0, we get the classical formula
R
f (x, xn )u(x, xn )dx
E[f (X) | Xn = xn ] = x R . (6)
x
u(x, xn )dx
By taking f to be the characteristic function of an event (all possible events)
we get a formula for the probability density of X given that Xn = xn , namely
u(x, xn )
u(x | Xn = xn ) = R . (7)
x
u(x, xn )dx
This is the classical formula for conditional probability density. The integral
in the denominator insures that, for each xn , u is a probability density as a
function of x, that is Z
u(x | Xn = xn )dx = 1 ,
f(xn ) is just a constant. We find the value of f(xn ) that minimizes R(xn ) by
minimizing the quantity
Z
2
(f (x, xn ) g) u(x, xn )dx =
xRn1
Z Z Z
2 2
f (x) u(x, xn )dx + 2g f (x)u(x, xn )dx + g u(x, xn )dx .
10
The optimal g is given by the classical formula (6).
1.20. Modern conditional probability: We already saw that the modern ap-
proach to conditional probability for G F is through conditional expectation.
In its most general form, for every (or almost every) , there should be
a probability measure P on so that the mapping P is measureable
with respect to G. The measurability condition probably means that for every
event A F the function pA () = P (A) is a G measurable function of .
In terms of these measures, the conditional expectation f = E[f | G] would be
f() = E [f ]. Here E means the expected value using the probability measure
P . There are many such subscripted expectations coming.
A subtle point here is that the conditional probability measures are defined
on the original probability space, . This forces the measures to live on
tiny (generally measure zero) subsets of . For example, if = Rn and G is
generated by xn , then the conditional expectation value f(xn ) is an average of
f (using density u) only over the hyperplane Xn = xn . Thus, the conditional
probability
R measures PX depend only on xn , leading us to write Pxn . Since
f(xn ) = f (x)dPxn (x), and f(xn ) depends only on values of f (x, xn ) with
the last coordinate fixed, the measure dPxn is some kind of measure on that
hyperplane. This point of view is useful in many advanced problems, but we
will not need it in this course (I sincerely hope).
In other words, the classical u(x | Xn = xn ) of (7) is the same as the semimodern
uxn (x).
11
normal, or Gaussian, random variable is a scalar with probability density
1 2
u(x) = ex /2 .
2
The normalization factor 12 makes u(x)dx = 1 (a famous fact). The
R
2
mean value is E[X] = 0 (the integrand xex /2 is antisymmetric about x = 0).
The variance is (using integration by parts)
Z
1 2
E[X 2 ] = x2 ex /2 dx
2
Z
1 2
= x xex /2 dx
2
Z
1 d x2 /2
= x e dx
2 dx
Z
1 x2 /2 1 2
= xe + ex /2 dx
2
2
= 0+1
Similar calculations give E[X 4 ] = 3, E[X 6 ] = 15, and so on. I will often write
Z for a standard normal random variable. A one dimensional Gaussian random
variable with mean E[X] = and variance var(X) = E[(X )2 ] = 2 has
density
1 (x)2
u(x) = e 22 .
2 2
It is often more convenient to think of Z as the random variable (like ) and
write X = + Z. We write X N (, 2 ) to express the fact that X is normal
(Gaussian) with mean and variance 2 . The standard normal random variable
is Z N (0, 1)
12
2.3. Diagonalizing H: Suppose the eigenvalues and eigenvectors of H are
n
Hvj = j vj . We canPn express x R as a linear combination of the vj either in
vector form, x = j=1 yj vj , or in matrix form, x = V y, where V is the n n
matrix whose columns are the vj and y = (y1 , . . . , yn ) . Since the eigenvectors
of a symmetric matrix are orthogonal to each other, we may normalize them so
that vj vk = jk , which is the same as saying that V is an orthogonal matrix,
V V = I. In the y variables, the quadratic form x Hx is diagonal, as we can
see using the vector or
Pthe matrix notation.
PnWith vectors, the trick is to use the
n
two expressions x = j=1 yj vj and x = k=1 yk vk , which are the same since j
and k are just summation variables. Then we can write
!
X n X n
x Hx = yj vj H yk vk
j=1 k=1
X
vj Hvk
= yj yk
jk
X
= k vj vk yj yk
jk
X
x Hx = k yk2 . (9)
k
the determinant is det(V ) = 1, for an orthogonal matrix. With this we can find
the normalization constant, z, by
Z
1 = u(x)dx
Z
1 1
= e 2 x Hx dx
z
13
Z
1 1
= e 2 y y
dy
z
Z n
1 1X
= exp( k yk2 ))dy
z 2
k=1
Z n
!
1 Y 2
= ek yk dy
z
k=1
n Z
1 Y 2
= ek yk dyk
z yk =
k=1
n p
1 Y
= 2/k
z
k=1
1 (2)n/2
1 = p .
z det(H)
This gives a formula for z, and the final formula for the multivariate normal
density
det H 1 x Hx
u(x) = e 2 . (10)
(2)n/2
We use (V y)(V y) = V (yy )V and take the constant matrices V outside the
integral. This gives C as the product of three matrices, first V , then an integral
involving yy , then V . So, to calculate C, we can calculate all the matrix
elements Z
det H 1
Bjk = yj yk e 2 y y dy .
(2)n/2 Rn
14
Clearly, if j 6= k, Bjk = 0, because the integrand is an odd (antisymmetric)
function, say, of yj . The diagonal elements Bkk may be found using the fact
that the integrand is a product:
Z ! Z
det H Y 2 2
Bkk = n/2
ej yj /2 dyj yk2 ek yk /2 dyk .
(2) j6=k yj yk
p
As before, j factors (for j 6= k) integrate to 2/j . The k factor integrates
p
to 2/(k )3/2 . The k factor differs from the others only by a factor 1/k .
Most of these factors combine to cancel the normalization. All that is left is
1
Bkk = .
k
C = V 1 V .
C = H 1 . (11)
The covariance matrix is the inverse of the matrix defining the multivariate
normal.
The reader should verify that if CX is n n, then this formula gives a CY that
is m m. The reader should also be able to derive the formula for CY in terms
15
of CX without assuming that Y = 0. We will soon give the proof that linear
functions of Gaussians are Gaussian.
16
X (1) , . . ., X (100) , each consisting of a height and weight pair. The weight of
(27)
person 27 is X2 . Let = E[X] be the mean and C = E[(X )(X ) ]
the covariance matrix. The Central Limit Theorem (CLT) states that for large
n, the random variable
n
1 X (k)
R(n) = (X )
n
k=1
has a probability distribution close to the multivariate normal with mean zero
and covariance C. One interesting consequence is that if X1 and X2 are uncor-
( (n)
related then an average of many independent samples will have R1 n) and R2
nearly independent.
2.10. What the CLT says about Gaussians: The Central Limit Theorem
tells us that if we avarage a large number of independent samples from the
same distribution, the distribution of the average depends only on the mean
and covariance of the starting distribution. It may be surprising that many
of the properties that we deduced from the formula (10) may be found with
almost no algebra simply knowing that the multivariate normal is the limit of
averages. For example, we showed (or didnt show) that if X is multivariate
normal and Y = AX where the rows of A are linearly independent, then Y is
multivariate normal. This is a consequence of the averaging property. If X is
(approximately) the average of iid random variables Uk , then Y is the average
of random variables Vk = AUk . Applying the CLT to the averaging of the Vk
shows taht Y is also multivariate normal.
Now suppose U is a univariate random variable with iid samples Pn Uk , and
E[Uk ] = 0, E[Uk2 = 2 ], and E[Uk4 ] = a4 < Define Xn = 1n k=n Uk . A
calculation shows that E[Xn4 ] = 3 4 + n1 a4 . For large n, the fourth moment of
the average depends only on the second moment of the underlying distribution.
A multivariate and slightly more general version of this calculation gives Wicks
theorem, an expression for the expected value of a product of components of
a multivariate normal in terms of covariances.
17
Stochastic Calculus Notes, Lecture 5
Last modified October 21, 2004
1 Brownian Motion
1.2. History: Late in the 18th century, an English botanist named Brown
looked at pollen grains in water under a microscope. To his amazement, they
were moving randomly. He had no explination for supposedly inert pollen grains,
and later inorganic dust, seeming to swim as though alive. In 1905, Einstein
proposed the explination that the observed Brownian motion was caused by
individual water molecules hitting the pollen or dust particles. This allowed
him to estimate, for the first time, the weight of a water molecule and won him
the Nobel prize (relativity and quantum mechanics being too controversial at
the time). This is the modern view, that the observed random motion of pollen
grains is the result of a huge number of independent and random collisions with
tiny water molecules.
1
1.4. Wiener measure: The probability space for standard Brownian motion
is C0 ([0, T ], R). As we said before, this consists of continuous functions, X(t),
defined for t in the range 0 t T . The notation C0 means1 that X(0) = 0.
The algebra representing full information is the Borel algebra. The infi-
nite dimensional Gaussian probability measure on C0 ([0, T ], R) that represents
Brownian motion is called Wiener measure2 .
This measure is uniquely specified by requiring that for any times 0 = t0 <
t1 < < tn T , the increments Yk = X(tk+1 ) X(tk ) are independent Gaus-
sian random variables with var(Yk ) = tk+1 tk . The proof (which we omit) has
two parts. First, it is shown that there indeed is such a measure. Second, it
is shown that there is only one such. All the information we need is contained
in the joint distribution of the increments. The fact that increments from dis-
joint time intervals are independent is the independent increments property. It
also is possible to consider Brownian motion on an infinite time horizon with
probability space C0 ([0, ), R).
and inarticulate.
2
is a random function of t. This functional is just what we called a func-
tion of a random variable (the path X palying the role of the abstract ran-
dom outcome ). The simplest example of a functional is just a function of
X(T ): F (X) = V (X(T )). More complicated functionals are integrals: F (X) =
RT
0
V (X(t))dt.n extrema: F (X) = maxo tT X(t), or stopping times such as
Rt
F (X) = min t such that 0 X(s)dx 1 . Stochastic calculus provides tools
for computing the expected values of many such functionals, often through solu-
tions of partial differential equations. Computing expected values of functionals
is our main way to understand the behavior of Brownian motion (or any other
stochastic process).
1.9. Path probabilities: For discrete Markov chains, as here, the individual
outcomes are paths, X. For Markov chains one can compute the probability
of an individual path by multiplying the transition probabilities. The situation
is different Brownian motion, where each individual path has probability zero.
We will make much use of the following partial substitute. Again choose times
t0 = 0 < t1 < < tn T , let ~t = (t1 , . . . , tn ) be the vector of these times, and
~ = (X(t1 ), . . . , X(tn )) be the vector of the corresponding observations of
let X
X. We write U (n) (~x, ~t) for the joint probability density for the n observations,
which is found by multiplying together the transition probability densities (1)
(and using properties of exponentials):
n1
Y
U (n) (~x, ~t) = G(xk , xk+1 , tk+1 tk )
k=0
3
n1 n1
!
1 Y 1 1 X (xk+1 xk )2
= n/2
exp . (2)
(2) tk+1 tk 2 tk+1 tk
k=0 k=0
1.10. Consistency: You cannot give just any old probability densities to
replace the joint densities (2). They must satisfy a simple consistency condition.
Having given the joint density for n observations, you also have given the joint
density for a subset of these observations. For example, the joint density for
X(t1 ) and X(t3 ) must be the marginal of the joint density of X((t1 ), X(t2 ),
and X(t3 ):
Z
(2)
U (x1 , x3 , t1 , t3 ) = U (3) (x1 , x2 , x3 , t1 , t2 , t3 )dx2 .
x2 =
1.11. Rough paths: The above picture shows 5 Brownian motion paths.
They are random and differ in gross features (some go up, others go down), but
the fine scale structure of the paths is the same. They are not smooth, or even
differentiable functions of t. If X(t) is a differentiable function of t, then for
small t its increments are roughly proportional to t:
dX
X = X(t + t) X(t) tl.
dt
For Brownian motion, the expected value of the square of X (the variance of
X) is proportional
to t. This suggests that typical values of X will be on
the order of t. In fact, an easy calculation gives
t
E[|X|] = .
2
4
This would be impossible if successive increments of Brownian motion were all
in the same direction (see Total variation below). Instead, Brownian motion
paths are constantly changing direction. They go nowhere (or not very far) fast.
1.12. Total variation: One quantitative sense of path roughness is the vact
that Brownian motion paths have infinite total variation. The total variation
of a function X(t) measures the total distance it moves, counting both ups and
downs. For a differentiable function, this would be
Z T
dX
TV(X) = dt dtl.
(3)
0
If X(t) has simple jump discontinuities, we add the sizes of the jumps to (3).
For general functions, the total variation is
n1
X
TV(X) = sup |X(tk+1 ) X(tk )| , (4)
k=0
where the supremum as over all positive n and all sequences t0 = 0 < t1 < <
tn T .
Suppose X(t) has finitely many local maxima or minima, such as t0 = local
max, t1 = local min, etc. Then taking these t values in (4) gives the exact total
variation (further subdivision does not increase the left side). This is one way
to relate the general definition (4) to the definition for differentiable functions
(??). This does not help for Brownian motion paths, which have infinitely many
local maxima and minima.
Let B C0 ([0, t], R) be the set of paths with finite total variation.. This is a
countable union [ [
B= {TV(X) < N } = BN .
N >0 N >0
Since P (BN ) < ) for any > 0, we must have P (BN ) = 0. Countable additivity
then implies that P (B) = 0, which means that P (TV = ) = 1.
There is a distinction between outcomes that do not exist and events that
never happen because they have probability zero. For example, if Z is a one
dimensional Gaussian random variable, the outcome Z = 0 does exist, but the
event {Z = 0} is impossible (never will be observed). This is what we mean
when we say a Gaussian random variable never is zero, or every Brownian
motion path has invinite total variation.
5
1.14. The TV of BM: The heart of the matter is tha actual calculation
behind the inequality (5). We choose an n > 0 and define (not for the last time)
t = T /n and tk = kt. Let Y be the random variable
n1
X
Y = |X(tk+1 ) X(tk )| .
k=0
Remember that Y is one of the candidates we must use in the supremem (4) that
defines the total variation.qIf Y is large, then the total
q variation is at least as
2
2
large. Because E[|X|] = t, we have E[Y ] = T n. A calculation
using the independent increments property shows that
2
var(Y ) = 1 T
If we take very large n and medium large k, this inequality says that
it is very
unlikely for Y (or total variation of X) to be much less than const n. Our
inequality (5) follows from this whth a suitable choice of n and k.
1.15. Structure of BM paths: For any function X(t), we can define the
total variation on the interval [t1 , t2 ] in an obvious way. The odometer of a car
records the distance travelled regardless of the direction. For X(t), the total
variation on the interval [0, t] plays a similar role. Clearly, X is monotone on
the interval [t1 , t2 ] if and only if TV(X, t1 , t2 ) = |X(t2 ) X(t1 )|. Otherwise,
X has at least one local min or max within [t1 , t2 ]. Now, Brownian motion
paths have infinite total variation on any interval (the proof above implies this).
Therefore, a Brownian motion path has a local max or min within any interval.
This means that (like the rational numbers, for example) the set of local maxima
and minima is dense: There is a local max or min arbitrarily close to any given
number.
1.16. Dynamic trading: The infinite total variation of Brownian motion has
a consequence for dynamic trading strategies. Some of the simplest dynamic
trading strategies, Black-Scholes hedging, and Merton half stock/half cash trad-
ing, call for trades that are proportional to the change in the stock price. If the
stock price is a diffusion process and there are transaction costs proportional
to the size of the trade, then the total transaction costs will either be infinite
(in the idealized continuous trading limit) or very large (if we trade as often as
3 If E[Y ] = and var(Y ) = 2 , then P (|Y | > k) < 1
k2
. The proof and more examples
are in any good basic probability book.
6
possible). It turns out that dynamic trading strategies that take trading costs
into account can approach the idealized zero cost strategies when trading costs
are small. Next term you will learn how this is done.
1.18. Trading volatility: The quadratic variation of a stock price (or a similar
quantity) is called its realized volatility. The fact that it is possible to buy
and sell realized volatility says that the (geometric) Brownian motion model
of stock price movement is not completely realistic. That model predicts that
realized volatility is a constant, which is nothing to bet on.
limit t 0, we would get the same answer for continuous paths or paths with TV(X) < .
You dont have to use uniformly spaced times in the definition of Q(X), but I think you get
a different answer if you let the times depend on X as they might in the definition of total
variation.
5 Thes does not quite prove that (almost surely) Q T as n . We will come back
n
to this point in later lectures.
7
1.21. Continuous time martingales: A stochastic process Ft (with and the
Ft ) is a martingale if E[Fs | Ft ] = Ft for s > t. Brownian motion forms the first
example of a continuous time martingale. Another famous martingale related to
Brownian motion is Ft = Xt2 t (the reader should check this). As in discrete
time, any random variable, Y , defines a continuous time martingale through
conditional expectations: Yt = E[Y | Ft ]. The Ito calculus is based on the idea
that a stochastic integral with respect to X should produce a martingale.
2.1. Introduction: Forward and backward equations are tools for calculating
probabilities and expected values related to Brownian motion, as they are for
Markov chains and stochastic processes more generally. The probability density
of X(t) satisfies a forward equation. The conditional expectations E[V | Ft ]
satisfy backward equations for a variety of functionals V . For Brownian motion,
the forward and backward equations are partial differential equations, either the
heat equation or a close relative. We will see that the theory of partial differential
equations of diffusion type (the heat equation being the a prime example) and
the theory of diffusion processes (Brownian motion being a prime example) each
draw from the other.
8
2.3. Heat equation via Taylor series: The above is not so much a derivation
of the heat equation as a verification. We are told that u(x, t) (the probability
density of Xt ) satisfies the heat equation and we verify that fact. Here is a
method for deriving a forward equation without knowing it in advance. We
assume that u(x, t) is smooth enough as a function of x and t that we may expand
it to to second order in Taylor series, do the expansion, then take the conditional
expectation of the terms. Variations of this idea lead to the backward equations
and to major parts of the Ito calculus.
Let us fix two times separated by a small t: t0 = t + t. The rules of
conditional probability allow us to compute the density of X = X(t0 ) in terms
of the density of Y = X(t) and the transition probabilit density (1):
Z
u(x, t + t) = G(y, x, t)u(y, t)dy . (10)
y=
The main idea is that for small t, X(t + t) will be close to X(t). This is
expressed in G being small unless y is close to x, which is evident in (1). In
the integral, x is a constant and y is the variable of integration. If we would
approximate u(y, t) by u(x, t), the value of the integral just would be u(x, t).
This would give the true but not very useful approximation u(x, t + t)
u(x, t) for small t. Adding the next Taylor series term (writing ux for x u):
R t) u(x, t)+ux (x, t)(y x), the integral does not change the result because
u(y,
G(y, x, t)(y x)dy = 0. Adding the next term:
1
u(y, t) u(x, t) + ux (x, t)(y x) + uxx (x, t)(y x)2 ,
2
gives (because E[(Y X)2 ] = t)
1
u(x, t + t) u(x, t) + uxx (x, t)t .
2
To derive a partial differential equation, we expand the left side as u(x, t+t) =
u(x, t) + ut (x, t)t + O(t2 ). On the right, we use
Z
3
G(y, x, t) |y x| dy = O(t3/2 ) .
If we cancel the common u(x, t) then cancel the common factor t and let
t 0, we get the desired heat equation (9).
2.4. The initial value problem: The heat equation (9) is the Brownian motion
anologue of the forward equation for Markov chains. If we know the time 0
density u(x, 0) = u0 (x) and the evolution equation (9), the values of u(x, t) are
completely and uniquely determined (ignoring mathematical technicalities that
9
would be unlikely to trouble a practical person). The task of finding u(x, t) for
t > 0 from u0 (x) and (9) is called the initial value problem, with u0 (x) being
the initial value (or values??). This initial value problem is well posed,
which means that the solution, u(x, t), exists and depends continuously on the
initial data, u0 . If you want a proof that the solution exists, just use the integral
formula for the solution (8). Given u0 , the integral (8) exists, satisfies the heat
equation, and is a continuous function of u0 . The proof that u is unique is more
technical, partly because it rests on more technical assumptions.
2.5. Ill posed problems: In some situations, the problem of finding a function
u from a partial differential equation and other data may be ill posed, useless
for practical purposes. A problem is ill posed if it is not well posed. This means
either that the solution does not exist, or that it does not depend continuously
on the data, or that it is not unique. For example, if I try to find u(x, t) for
positive t knowing only u0 (x) for x > 0, I must fail. A mathematician would say
that the solution, while it exists, is not unique, there being many different ways
to give u0 (x) for x > 0, each leading to a different u. A more subtle situation
arises, for example, if we give u(x, T ) for all x and wish to determine u(x, t)
for 0 t < T . For example, if u(x, T ) = 1[0,1] (x), there is no solution (trust
me). Even if there is a solution, for example given by (8), is does not depend
continuously on the values of u(x, T ) for T > t (trust me).
The heat equation (9) relates values of u at one time to values at another
time. However, it is well posed only for determining u at future times from u
at earlier times. This forward equation is well posed only for moving forward
in time.
10
is given by the integral f (x, t) as an integral, we get
Z
f (x, t) = G(x, y, T t)V (y)dy . (12)
2.8. Backward equation by Taylor series: As with the forward equation (9),
we can find the backward equation by Taylor series expansions. We start by
choosing a small t and expressing f (x, t) in terms of6 f (, t + t). As before,
define Ft = E[V (XT ) | Ft ] = f (Xt , t). Since Ft Ft+t , the tower property
implies that Ft = E[Ft+t | Ft ].
f (y, t + t)
1
= f (x, t) + fx (x, t)(y x) + fxx (x, t)(y x)2 + ft (x, t)t
2
3
+O(|y x| ) + O(t2 ) .
f (x, t) depends only on f at time t + t for the same x value. Instead, it depends on all the
values f (y, t + t).
11
matrix P acts from the left on column vectors f (summing Pjk over P k) but from
the right on row vectors
P u (summing P jk over j). For each j, k Pjk = 1 but
the column sums j Pjk may not equal one. Of course, the sign of the t term
is different in the two cases because we did the t Taylor series on the right side
of (14) but on the left side of (10).
2.9. The final value problem: The final values f (x, T ) = V (x), together with
the backward evolution equation (13) allow us to determine the values f (, t)
for t < T . The definition (11) makes this obvious. This means that the final
value problem for the backward heat equation is a well posed problem.
On the other hand, the initial value problem for the backward heat equation
is not a well posed problem. If we have a f (x, 0) and we want a V (x) that leads
to it, we are probably out of luck.
2.10. Duality: As for Markov chains, we can express the expected value of
V (XT ) in terms of the probability density at any earlier time t T
Z
E[V (XT )] = u(x, t)f (x, t)dx .
This again implies that the right side is independent of t, which in turn al-
lows us to derive the forward equation (9) from the backward equation (13) or
conversely. For example, differentiating and using (13) gives
d
0 =
Zdt Z
= ut (x, t)f (x, t)dx + u(x, t)ft (x, t)dx
Z Z
= ut (x, t)f (x, t)dx u(x, t) 21 fxx (x, t)dx .
If this relation holds for a sufficiently rich family of functions f , we can only
conclude that ut 12 uxx is identically zero, which is the forward equation (9).
12
derivatives of G are still smooth integrable functions of x. The same can be said
for f using (12); as long as t < T , any derivatives of f with respect to x and/or t
are bounded. A function that has all partial derivatives of any order bounded is
called smooth. (Warning, this term is not used consistently. Some people say
smoooth to mean, for example, merely having derivatives up to second order
bounded.) Solutions of more general forward and backward equations often,
but not always, have the smoothing property.
2.12. Rate of smoothing: Suppose the payout (and final value) function,
V (x), is a discontinuous function such as V (x) = 1x<0 (x) (a digital option in
finance). The solution to the backward equation can be expressed in terms of
the cumulative normal (with Z N (0, 1))
Z x
1 2
N (x) = P (Z < x) = ez /2 dz .
2 z=
Then we have
Z 0
f (x, t) = G(x, y, T t)dy
y=
Z 0
1 2
= p e(xy) /2(tt)
dy
2(T t) y=
f (x, t) = N (x/ T t) . (15)
From this it is clear
that f is differentiable when t < T , but the first x derivative
is as large as 1/ T t, the second as large as 1/(T t), etc. All derivatives
blow up as t T with higher derivatives blowing up faster. This can make
numerical solution of the backward equation difficult and inaccurate when the
final data V (x) is not smooth.
The formula (15) can be derived without integration.
One way is to note that
f (x, t) = P (XT < 0 | Xt = x) and XT x+ T tZ, (Gaussian increments) so
that XT < 0 is the same as Z < x/ T t. Even without the normal probability,
a physicist would tell you that X t, sothe hitting probability starting
from x at time t has to be some function of x/ T t.
2.13. Diffusion: If you put a drop of ink into a glass of still water, you
will see the ink slowly diffuse through the water. This is modelled as a vast
number of tiny ink particles each preforming an independent Brownian motion
in the water. Let u(x, t) represent the density of particles about x at time t
(say, particles per cubic
R millemeter). This u satisfies the heat equation but not
the requirement that u(x, t)dx = 1. If ink has been diffusing through water
for some time, there will be dark regions with a high density of particles (large
u) and lighter regions with smaller u. In the absence of boundaries (sides of the
class and the top of the water), the ink distribution would be Gaussian.
2.14. Heat: Heat also can diffuse through a medium, as happens when
we put a thick metal pan over a flame and wait for the other side to heat
13
up. We can think of u(x, t) as representing the temperature in a metal at
location x at time t. This helps us interpret solutions of the heat equation
(9) when u is not necessarily positive. In particular, it helps us imagine the
cancellation that can occur when regions of positive and negative u are close to
each other. Heat flows from the high temperature regions to low or negative
temperature regions in a way that makes the temperature distribution a more
uniform. A physical argument that heat (temperature) flowing through a metal
should satisfy the heat equation was given by the French mathematical phycisist,
friend of Napoleon, and founder of Ecole Polytechnique, Joseph Fourier.
2.15. Hitting times: A stopping time, , is any time that depends on the
Brownian motion path X so that the event t is measurable with respect to
Ft . This is the same as saying that for each t there is some process that has as
input the values Xs for 0 s t and as output a decision t or > t. One
kind of stopping time is a hitting time:
a = min (t | Xt = a) .
More generally (particularly for Brownian motion in more than one dimension)
if A is a closed set, we may consider A = min(t | Xt A). It is useful to define
a Brownian motion that stops at time : Xt = Xt if t , Xt = X if t .
2.17. Forward equation for u: The density for the absolutely continuous part,
u(x, t), is the density for paths that have not touched X = a. In the diffusion
interpretation, think of a tiny ink particle diffusing as before but being absorbed
if it ever touches a. It is natural to expect that when x 6= a, the density satisfies
the heat equation (9). u knows about the boundary condition because of
the boundary condition u(a, t) = 0. This says that the density of particles
approaches zero near the absorbing boundary. By the end of the course, we
14
will have several ways to prove this. For now, think of a diffusing particle, a
Brownian motion path, as being hyperactive; it moves so fast that it has already
visited a neighborhood of its current location. In particluar, if Xt is close to a,
then very likely Xs = a for some s < t. Only a small minority of the particles
at x near a, with small density u(x, t) 0 as x a have not touched a.
2.19. Images and Reflections: We want a function u(x, t) that satisfies the
heat equation when x > 0, the boundary condition u(0, t) = 0, and goes to x0
as t 0. The method of images is a trick for doing this. We think of x0 as
a unit charge (in the electrical, not financial sense) at x0 and g(x x0 , t) =
2
1 e(xx0 ) /2t as the response to this charge, if there is no absorbing boundary.
2
For example, think of puting a unit drop of ink at x0 and watching it spread
along the x axis in a bell shaped (i.e. gaussian) density distribution. Now
think of adding a negative image charge at x0 so that u0 (x) = x0 x0
and correspondingly
1 (xx0 )2 /2t 2
u(x, t) = e e(x+x0 ) /2t . (18)
2t
15
This function satisfies the heat equation everywhere, and in particular for x > 0.
It also satisfies the boundary condition u(0, t) = 0. Also, it has the same initial
data as g, as long as x > 0. Therefore, as long as x > 0, the u given by (18)
represents the density of unabsorbed particles in a Brownian motion with ab-
sorption at x = 0. You might want to consider the image charge contribution
2
in (18), 12 e(xx0 ) /2t , as red ink (the ink that represents negative quanti-
ties) that also diffuses along the x axis. To get the total density, we subtract
the red ink density from the black ink density. For x = 0, the red and black
densities are the same because the distance to the sources at x0 are the same.
When x > 0 the black density is higher so we get a positive u. We can think of
the image point, x0 , as the reflection of the original source point through the
barrier x = 0.
2.20. The reflection principle: The explicit formula (18) allows us to evaluate
p(t), the probability of touching x = 0 by time t starting at X0 = x0 . This is
Z Z
1 (xx0 )2 /2t 2
p(t) = 1 u(x, t)dx = e e(x+x0 ) /2t dx .
x>0 x>0 2t
R 1 (xx )/2t
Because 2t e 0
dx = 1, we may write
Z 0 Z
1 (xx0 )2 /2t 1 (x+x0 )2 /2t
p(t) = e dx + e dx .
2t 0 2t
Of course, the two terms on the right are the same! Therefore
Z 0
1 (xx0 )2 /2t
p(t) = 2 e dx .
2t
This formula is a particular case the Kolmogorov reflection principle. It says
that the probability that Xs < 0 for some s t is (the left side) is exactly
twice the probability that Xt < 0 (the integral on the right). Clearly some of
the particles that cross to the negative side at times s < t will cross back, while
others will not. This formula says that exactly half the particles that touch
for some s t have Xt > 0. Kolmogorov gave a proof of this based on the
Markov property and the symmetry of Brownian motion. Since X = 0 and
the increments of X for s > are independent of the increments for s < , and
since the increments are symmetric Gaussian random variables, they have the
same chance to be positive Xt > 0 as negative Xt < 0.
16
Stochastic Calculus Notes, Lecture 5
Last modified October 26, 2004
1.2. The integral of Brownian motion: Consider the random variable, where
X(t) continues to be standard Brownian motion,
Z T
Y = X(t)dt . (1)
0
1.3. The variance of Y : We will start the hard way, computing the variance
from (2) and letting t 0. The trick is to use two summation variables
Pn1 Pn1
Yn = t k=0 X(tk ) and Yn = t j=0 X(tj ). It is immediate from (2) that
E[Yn ] = 0 and var(Yn ) = E[Yn2 ]:
E[Yn2 ] = E[Yn Yn ]
! n1
n1
X X
= E t X(tk ) t X(tj )
k=0 j=0
X
2
= t E[X(tk )X(tj )] .
jk
1
If we now let t 0, the left side converges to E[Y 2 ] and the right side
converges to a double integral:
Z T Z T
E[Y 2 ] = E[X(t)X(s)]dsdt . (3)
s=0 t=0
E[Xt Xs ] = min(t, s) ,
Going from the (4) to (5) involves changing the order of integration1 . After all,
E[] just represents integration over a probability space. The right side of (4)
has the abstarct form
Z Z Z !
F (, s, t)dtds dP () .
s[0,T ] t[0,T ]
1 The possibility of changing order of abstract integrals was established by the twentieth
century mathematician Fubini. He proved it to be correct if the double (triple in our case)
integral converges absolutely (a requirement even for ordinary Riemann integrals) and the
function F is jointly measurable in all its arguments. Our integrand is nonnegative, so the
result will be infinite if the integral does not converge absolutely. We omit a discussion of
product measures and joint measurability.
2
Here F = X(s)X(t), and is the random outcome (the whole path X[0, T ]
here), and P represents Wiener measure. If we interchange the ordinary Rie-
mann dsdt integral with the abstract dP integral, we get
Z Z Z
F (, s, t)dP () dsdt ,
s[0,T ] t[0,T ]
1.5. The Xt3 martingale: Many martingales are constructed from integrals
involving Brownian motion. A simple one is
Z t
3
F (t) = X(t) 3 X(s)ds .
0
To check the martingale property, choose t2 > t1 and, for t > t1 , write X(t) =
X(t1 ) + X(t). Then
Z t2 Z t1 Z t2
E X(t)ds | Ft1 = E X(t)dt + X(t)dt Ft1
0 0 t1
Z t Z t2
= E X(t)dt | Ft +E (X(t1 ) + X(t)) dt | Ft
0 t1
Z t
= X(t)dt + (t2 t1 )X(t1 ) .
0
In the last line we use the facts that X(t) Ft1 when t < t1 , and Xt1 Ft1 ,
and that E[X(t) | Ft1 ] = 0 when t > t1 , which is part of the independent
increments property. For the X(t)3 part, we have,
h i
3
E (X(t1 ) + X(t2 )) | Ft1
= E X(t1 )3 + 3X(t1 )2 X(t2 ) + 3X(t1 )X(t2 )2 + X(t2 )3 | Ft1
In the last line we used the independent increments property to get E[X(t2 ) |
Ft1 ] = 0, and the formula for the variance of the increment to get E[X(t2 )2 |
Ft1 ] = t2 t1 . This verifies that E[F (t2 ) | Ft ] = F (t1 ), which is the martingale
property.
3
1.6. Backward equations for expected values of integrals: Many integrals
involving Brownian motion arise in applications and may be solved using
RT
backward equations. One example is F = 0 V (X(t))dt, which represents the
total accumulated V (X) over a Brownian motion path. If V (x) is a continuous
function of x, the integral is a standard Riemann integral, because V (X(t))
is a continuous function of t. We can calculate E[F ], using the more general
function "Z #
T
f (x, t) = Ex,t V (X(s))ds . (6)
t
As before, we can describe the function f (x, t) in terms of the random variable
"Z #
T
F (t) = E V (X(s))dt | Ft .
t
Since F (t) is measurable in Ft and depends only on future values (X(s) with
s > t), F (t) is measurable in Gt . Since Gt is generated by X(t) alone, this
means that F (t) is a function of X(t), which we write as F (t) = f (X(t), t).
Of course, this definition is a big restatement of definition (6). Once we know
f (x, t), we can plug in t = 0 to get E[F ] = F (0) = f (x0 , 0) if X(0) = x0 is
known. Otherwise, E[F ] = E[f (X(0), t)].
The backward equation for f is
1
t f + x2 f + V (x, t) = 0 , (7)
2
with final conditions f (x, T ) = 0. The derivation is similar to the one we used
before for the backward equation for Ex,t [V (XT )]. We use Taylor series and the
tower property to calculate how f changes over a small time increment, t. We
start with
Z T Z t+t Z T
V (X(s))ds = V (X(s))ds + V (X(s))ds ,
t t t+t
The first integral on the right has the value V (x)t + o(t). We write o(t) for
a quantity that is smaller than t in the sense that o(t)/t 0 as t 0
(we will shortly divide by t, take the limit t 0, and neglect all o(t)
terms.). The second term has
"Z #
T
Ex,t V (X(s))ds | Ft+t = F (Xt+t ) = f (X(t + t), t + t) .
t+t
4
Writing X(t + t) = X(t) + X, we use the tower property with Ft Ft+t
to get
"Z #
T
E V (X(s))ds | Ft = E [f (Xt + X, t + t) | Ft ] .
t+t
5
satisfies the backward equation
1
t f + x2 f + V (x)f = 0 . (11)
2
When someone refers to the Feynman Kac formula, they usually are referring
to the fact that (10) is a formula for the solution of the PDE (11). In our work,
the situation mostly will be reversed. We use the PDE (11) to get information
about the quantity defined by (10) or even just about the process X(t).
We can verify that (10) satisfies (11) more or less as in the preceding para-
graph. We note that
(Z )
t+t Z T
exp V (X(s))ds + V (X(s))ds
t t+t
(Z ) (Z )
t+t T
= exp V (X(s))ds exp V (X(s))ds
t t+t
(Z )
T
= (1 + tV (X(t)) + o(t)) exp V (X(s))ds
t+t
1.9. The Feynman integral: A precurser to the Feynman Kac formula, is the
Feynman integral2 solution to the Schrodinger equation. The Feynman integral
is not an integral in the sense of measure theory. (Neither is the Ito integral, for
that matter.) The colorful probabilist Marc Kac (pronounced Katz) discov-
ered that an actual integral over Wiener measure (10) gives the solution of (11).
Feynmans reasoning will help us derive the Girsanov formula, so we pause to
sketch it.
The finite difference approximation
Z T n1
X
V (X(t))dt t V (X(tk )) , (12)
0 k=0
6
The functional Fn depends only on finitely many values Xk = X(tk ), so we may
~ = (X1 , . . . , Xn ). The
evaluate (13) using teh known joint density function for X
density is (see Path probabilities from Lecture 5):
n1
!
(n) 1 X
2
U (~x) = exp (xk+1 xk ) /et .
(2tn/2 k=0
Also,
xk+1 xk dx
= x(tk ) ,
t dt
so we should also have
n1 2 Z T
t X xk+1 xk
x(t)2 dt .
2 t 0
k=0
As n , the integral over Rn should converge to the integral over all paths
x(t). We denote this by P without worring about exactly which paths are
allowed (continuous, differentiable, ...?). The integration element d~x has the
possible formal limit
n1
Y n1
Y T
Y
d~x = dxk = dx(tk ) dx(t) .
k=0 k=0 t=0
Altogether, this gives the formal expression for the limit of (15):
Z Z T Z T ! T
Y
1 2
F = const exp V (x(t))dt 2 x(t) dt dx(t) . (16)
P 0 0 t=0
7
1.10. Feynman and Wiener integration: Mathematicians were quick to com-
plain about (16). For one thing, the constant const = limn (2t)n/2 should
be Rinfinite.
QT More seriously, there is no abstract integral measure corresponding
to P t=0 dx(t) (it is possible to prove this). Kac proposed to write (16) as
Z Z T !" Z T ! T #
Y
F = exp V (x(t))dt const exp 21 x(t)2 dt dx(t) .
P 0 0 t=0
2 Mathematical formalism
2.1. Introduction: We examine the solution formulas for the backward and
forward equation from two points of view. The first is an analogy with linear
8
algebra, with function spaces playing the role of vector space and operators
playing the role of matrices. The second is a more physical picture, interpreting
G(x, y, t) as the Greens function describing the forward diffusion of a point mass
of probability or the backward diffusion of a localized unit of payout.
2.2. Solution operator As time moves forward, the probability density for
Xt changes, or evolves. As time moves backward, the value function f (x, t)
also evolves3 The backward evolution process is given by (for s > 0, this is a
consequence of the tower property.)
Z
f (x, t s) = G(x, y, s)f (y, t)dy . (18)
The operation is linear, which means that G(af (1) + bf (2) ) = aGf (1) + bGf (2) .
The family of operators G(s) for s > 0 produces the solution to the backward
equaion, so we call G(s) the solution operator for time s.
t f + x2 f = V (x, t) , (19)
where
g(x, t, t0 ) = Ex,t [V (X(t0 ))] .
3 Unlike biological evolustion, this evolution process makes the solution less complicated,
not more.
4 We often say homogeneous to mean zero and inhomogeneous to mean not zero. That
may be because if V (x, t) is zero then it is constant, i.e. the same everywhere, which is the
usual meaning of homogeneous.
9
This g is the expected value (at (x, t)) of a payout (V (, t0 ) at time t0 > t). As
such, g is the solution of a homogeneous final value problem with inhomogeneous
final values:
t g + 12 x2 g = 0 for t < t0 ,
(21)
g(x, t0 ) = V (x, t0 ) .
2.4. Infinitesimal generator: There are matrices of many different types that
play various roles in theory and computation. And so it is with operators. In
addition to the solution operator, there is the infinitesimal generator (or simply
generator). For Brownian motion in one dimension, the generator is
L = 21 x2 . (22)
t f + Lf = 0 . (23)
For other diffusion processes, the generator is the operator L that puts the
backward equation for process in the form (23).
Just as a matrix has a transpose, an operator has an adjoint, written L .
The forward equation takes the form
t u = L u .
The operator (22) for Brownian motion is self adjoint, which means that L = L,
which is why the operator 21 x2 is the appears in both. We will return to these
points later.
2.6. Composing solution operators: The solution operator G(s1 moves the
value function backward in time by the amount s1 , which is written f (t s1 ) =
G(s1 )f (t). The operator G(s2 ) moves it back an additional s2 , i.e. f (t (s1 +
s2 )) = G(s2 )f (ts1 ) = G(s2 )G(s1 )f (t). The result is to move f back by s1 +s2
in total, which is the same as applying G(s1 + s2 ). This shows that for every
(allowed) f , G(s2 )G(s1 )f = G(s2 + s1 )f ,. which means that
10
This is called the semigroup property. It is a basic property of the solution
operator for any problem. The matrix anologue for Markov chains is P s2 +s+1 =
P s2 P s1 , which is a basic fact about powers of matrices having nothing to do
with Markov chains. The property (24) would be called the group property if
we were to allow negative s2 or s1 , which we do not. Negative s is allowed in
the matrix version if P is nonsingular. There is no particular physical reason
for the transition matrix of a Markov chain to be non singular.
If operators A and B both have kernels, then the composite operator has the
kernel Z
(AB)(x, y) = A(x, z)B(z, y)dz . (25)
R
To derive this
R formula, set g = Bf and h = Ag. Then h(x) = A(x, z)g(z)dz
and g(z) = B(z, y)f (y)dy implies that
Z Z
h(x) = A(x, z)B(z, y)dz f (y)dy .
This shows that (25) is the kernel of AB. The formula is anologous to the
formula for matrix multiplication.
2.8. The semigroup property: When we defined (18) the solution operators
G(s), we did so by specifying the kernels
1 2
G(x, t, s) = e(xy) /2s .
2s
According to (25). the semigroup property should be an integral identity in-
volving G. The identity is
Z
G(x, y, s2 + s1 ) = G(x, z, s2 )G(z, y, s1 )dz .
More concretely:
1 2
p e(xy) /2(s2 +s1 )
2(s2 + s1 )
Z
1 1 2 2
=p p e(xz) /2s2 e(zy) /2s1 dz .
2(s2 ) 2(s1 )
5 The term kernel also describes vectors f with Af = 0, it is unfortunate that the same
11
The reader is encouraged to verify this by direct integration. It also can be
verified by recognizing it as the statement that adding independent mean zero
Gaussian random variables with variance s2 and s1 respectively gives a Gaussian
with variance s2 + s1 .
12
If we take L outside the integral on the right, we recognize what is left in the
integral as f (t). Altogether, we have t f = V (t) Lf (t). This is almost right,
I just have to fix the minus sign somehow.
2.11. Greens function: Consider the solution formula for the homogeneous
final value problem t f + Lf = 0, f (T ) = V :
Z
f (x, t) = G(x, y, T t)V (y)dy . (30)
Since the backward equation is linear, the general value function will be the
weighted sum (integral) of the point mass value functions, which is the formula
(30).
13
Stochastic Calculus Notes, Lecture 7
Last modified December 3, 2004
Our plan is to lay out the principle ideas first then address the mathemat-
ical foundations for them later. There will be many points in the beginning
paragraphs where we appeal to intuition rather than to mathematical analysis
in making a point. To justify this approach, I (mis)quote a snippet of a poem I
memorized in grade school: So you have built castles in the sky. That is where
they should be. Now put the foundations under them. (Author unknown by
me).
1.2. The Ito integral: Let Ft be the filtration generated by Brownian motion
up to time t, and let F (t) Ft be an adapted stochastic process. Correspond-
ing to the Riemann sum approximation to the Riemann integral we define the
following approximations to the Ito integral
X
Yt (t) = F (tk )Wk , (2)
tk <t
1
with the usual notions tk = kt, and Wk = W (tk+1 ) W (tk ). If the limit
exists, the Ito integral is
Y (t) = lim Yt (t) . (3)
t0
There is some flexibility in this definition, though far less than with the Riemann
integral. It is absolutely essential that we use the forward difference rather than,
say, the backward difference ((wrong) Wk = W (tk ) W (tk1 )), so that
E F (tk )Wk Ftk = 0 . (4)
But this is not what we get from the definition (2) with actual rough path
Brownian motion. Instead we write
1 1
W (tk ) = 2 (W (tk+1 ) + W (tk )) 2 (W (tk+1 ) W (tk )) ,
and get
X
Yt (tn ) = W (tk ) (W (tk+1 ) W (tk ))
k<n
X
1
= 2 (W (tk+1 ) + W (tk )) (W (tk+1 ) W (tk ))
k<n
X
1
2 (W (tk+1 ) W (tk )) (W (tk+1 ) W (tk ))
k<n
X X 2
= 1
2 W (tk+1 )2 W (tk )2 1
2 (W (tk+1 ) W (tk )) .
k<n k<n
2
The second term is a sum of n independent random variables, each with expected
value t/2 and variance t2 /2. As a result, the sum is a random variable with
mean nt/2 = tn /2 and variance nt2 /2 = tn t/2. This implies that
X 2
1
2 (W (tk+1 ) W (tk )) T /2 as t 0 . (6)
tk <T
The difference between the right answer (7) and the wrong answer (5) is the
T /2 coming from (6). This is a quantitative consequence of the roughness of
Brownian motion paths. If W (t) were a differentiable function of t, that term
would have the approximate value
Z T 2
dW
t dt 0 as t 0 .
0 dt
1.5. Martingales: The Ito integral is a martingale. It was defined for that
purpose. Often one can compute an Ito integral by starting with the ordinary
calculus guess (such as 12 W (T )2 ) and asking what needs to change to make the
answer a martingale. In this case, the balancing term T /2 does the trick.
1.6. The Ito differential: Itos lemma is a formula for the Ito differential,
which, in turn, is defined in using the Ito integral. Let F (t) be a stochastic
process. We say dF = a(t)dW (t) + b(t)dt (the Ito differential) if
Z T Z T
F (T ) F (0) = a(t)dW (t) + b(t)dt . (8)
0 0
The first integral on the right is an Ito integral and the second is a Riemann
integral. Both a(t) and b(t) may be stochastic processes (random functions of
1 E[(W (t ) W (t
n n1 ))W (tn1 )] = 0, so E[(W (tn ) W (tn1 ))W (tn )] = E[(W (tn )
W (tn1 ))(W (tn ) W (tn1 ))] = t
3
time). For example, the Ito differential of W (t)2 is
1.7. Itos lemma: The simplest version of Itos lemma involves a function
f (w, t). The lemma is the formula (which must have been stated as a lemma
in one of his papers):
2
df (W (t), t) = w f (W (t), t)dW (t) + 21 w f (W (t), t)dt + t f (W (t), t)dt . (9)
f (W (T ), T ) f (W (0), 0) (10)
Z T Z T
2
= w f (W (t), t)dW (t) + w f (W (t), t) + t f (W (t), t) dt (11)
0 0
1.8. Using Itos lemma to evaluate an Ito integral: Like the fundamental
theorem of calculus, Itos lemma can be used to evaluate integrals. For example,
consider Z T
Y (T ) = W (t)2 dW (t) .
0
4
W (t) values for 0 t T . A more technical version of this remark is coming
after the discussion of the Brownian bridge.
We give one and a half of the two parts of the proof of this theorem. If
b = 0 for all t (and all, or almost all ), then F (T ) is an Ito integral and
hence a martingale. If b(t) is a continuous function of t, then we may find a
t and > 0 and > 0 so that, say, b(t) > > 0 when |t t | < . Then
E[F (t + ) F (t )] > 2 > 0, so F is not a martingale2 .
The tower property tells us that F (t) = F (W (t), t) is a martingale. But Itos
lemma, together with the previous paragraph, implies that F (W (t), t) is a mar-
tingale of and only if t F + 21 = 0, which is the backward equation for this case.
In fact, the proof of Itos lemma (below) is much like the proof of this backward
equation.
5
This says that dF (t) = a(t)dW (t) + b(t)dt where
b(t) = V (W (t), t) .
1 2
But also, b(t) = t f + 2 w f . Equating these gives the backward equation from
Lecture 6:
2
t f + 12 w f + V (w, t) = 0 .
Taylor series expansion of the terms on the right of (14) will produce terms
that converge to the three integrals on the right of (13) plus error terms that
converge to zero. In our pre-Ito derivations of backward equations, we used the
relation E[(W )2 ] = t. Here we argue that with many independent Wk , we
may replace (Wk )2 with t (its mean value).
The Taylor series expansion is
2 2
fk+1 fk = w fk Wk + 12 w fk (W ) + t fk t + Rk , (15)
where w fk means w f (W (tk ), tk ), etc. The remainder has the bound3
|Rk | C t2 + t |Wk | + Wk3 .
Finally, we separate the mean value of Wk2 from the deviation from the mean:
1 2 1 2 1 2
fk Wk2 = w fk t + w fk (Wk2 t) .
2 w 2 2
The individual summands on the right side all have order of magnitude t.
However, the mean zero terms (the second sum) add up to much less than the
first sum, as we will see. With this, (14) takes the form
n1
X n1
X n1
X
1 2
fn f0 = w fk Wk + t fk t + 2 w fk t
k=0 k=0 k=0
n1
X n1
X
2
+ 12 fk W 2 t +
w Rk . (16)
k=0 k=0
3 We assume that f (w, t) is thrice differentiable with bounded third derivatives. The error
in a finite Taylor approximation is bounded by the sized of the largest terms not used. Here,
that is t2 (for omitted term t2 f ), t(W )2 (for t w ), and W 3 (for w
3 ).
6
The first three sums on the right converge respectively to the corresponding
integrals on the right side of (13). A technical digression will show that the last
two converge to zero as n in a suitable way.
1.13. Like Borel Cantelli: As much as the formulas, the proofs in stochastic
calculus rely on calculating expected values of things. Here, Sm is a sequence
of random numbers and we want to show that Sm 0 as m (almost
surely).
P We use two observations. First, if sm is a sequence of numbers with
m=1 |sm | < , then sm 0 as m . Second, if B > 0 is a random
variable with E[B] < , then B < almost surely (if P the event {B = } has
positive
P probability, then E[B] = ). We take B = m=1 |Sm |. If B <
then m=1 |Sm | < so Sm 0 as m . What this shows is:
X
E [|Sm |] < = Sm 0 as m (a.s.) (17)
m=1
This observarion is a variant of the Borel Cantelli lemma, which often is used
in such arguments.
1.14. One of the error terms: To apply the Borel Cantelli lemma we must
find bounds for the error terms, bounds whose sum is finite.
Pn1 We start with the
last error term in (16). Choose n = 2m and define Sm = k=0 Rk , with
3
|Rk | C t2 + t |Wk | + |Wk | .
3
Since E[|Wk |] C t and E[|Wk | ] Ct3/2 (you do the math the
integrals), this gives (with nt = T )
E [|Sm |] Cn t2 + t3/2
CT t .
Expressed in terms of m, we have t = T /2m and t = T 2m/2 =
m m
T 2 . Therefore PE [|Sm |] C(T ) 2 . Now, if z is any number
m
greater
P than one, then m=1 z = 1/(1 + 1/z)) < . This implies that
m=1 E [|S m |] < and (using Borel Cantelli) that S m 0 as m (almost
surely).
This argument would not have worked thisway had we taken n = m instead
of n = 2m . The error bounds of order 1/ n would not have had a finite
sum. If both error terms in the bottom line of (16) go to zero as m
with n = 2m , this will prove Itos lemma. We will return to this point when
we discuss the difference between almost sure convergence, which we are using
here, and convergence in probability, which we are not.
1.15. The other sum: The other error sum in (16) is small not because of the
smallness of its terms, but because of cancellation. The positive and negative
7
terms roughly balance, leaving a sum smaller than the sizes of the terms would
suggest. This cancellation
Pn1 is of the same sort appearing in the central limit
theorem, where k=0 Xk = Un is of order n rather than n when the Xk are
i.i.d. with finite variance. In fact, using a trick we used before we show that Un2
is of order n rather than n2 :
X
E Un2 = E [Xj Xk ] = nE Xk2 = cn .
jk
Our sum is X
1 2
Wk2 tk .
Un = 2 w f (Wk , tk )
The above argument applies, though the terms are not independent. Suppose
j 6= k and, say, k > j. The cross term involving Wj and Wk still vanishes
because
E Wk t Ftk = 0 ,
and the rest is in Ftk . Also (as we have used before)
h i
2
E (Wk t) Ftk = 2t2 .
Therefore
n1
X 2
E Un2 = 1 2
f (Wk , tk ) t2 C(T )t .
4 w
k=0
8
Wk+1/2 = W (tk+1/2 ), etc. The tk term in the Ym sum corresponds to the time
interval (tk , tk+1 ). The Ym+1 sum divides this interval into two subintervals of
length t/2. Therefore, for each term in the Ym sum there are two correspond-
ing terms in the Ym+1 sum (assuming T is a multiple of t), and:
X
Ym+1 (T ) Ym (T ) = F (tk )(Wk+1/2 Wk ) + F (tk+1/2 )(Wk+1 Wk+1/2 )
tk <T
F (tk )(Wk+1 Wk )
X
= (Wk+1 Wk+1/2 )(F (tk+1/2 ) F (tk ))
tk <T
X
= Rk ,
tk <T
where
Rk = (Wk+1 Wk+1/2 )(F (tk+1/2 ) F (tk )) .
We compute E[(Ym+1 (T )Ym (T ))2 ] = jk E[Rj Rk ]. As before,4 E[Rj Rk ] =
P
0 unless j = k. Also, the independent increments property and (19) imply that5
h 2 i h 2 i
E[Rk2 ] = E Wk+1 Wk+1/2 E F (tk+1/1 ) F (tk )
t t
C = Ct2 .
2 2
This gives
2
E (Ym+1 (T ) Ym (T )) C2m .
(20)
The convergence of the Ito sums follows from (??) using our Borel Cantelli
m
type lemma. Let Sm = Ym+1 Ym . From (20), we have6 E |Sm |] C 2 .
Thus X
lim Ym (T ) = Y1 (T ) + Ym+1 (T ) Ym (T )
m
m1
exists and is finite. This shows that the limit defining the Ito integral exists, at
least in the case of an integrand that satisfies (19), which includes most of the
cases we use.
9
The derivation uses what we just have done. We approximate the Ito integral
by the sum X
a(tk )Wk .
T1 tk <T2
It is tempting to take dW (t) = (t)dt in the Ito integral and use (22) to derive
the Ito isometry formula. However, this must by done carefully because the
existence of the Ito integral, and the isometry formula, depend on the causality
structure that makes dW (t) independent of a(t).
7 A generalized function is not an actual function, but has properties defined as though it
were an Ractual function through integration. The function for example, is defined by the
formula f (t)(t)dt = f (0). No actual function can do this. Generalized functions also are
called distributions.
10
2 Stochastic Differential Equations
where the first integral on the right is a Riemann integral and the second is an
Ito integral. We often specify initial conditions X(0) u0 (x), where u0 (x) is
the given probability density for X(0). Specifying X(0) = x0 is the same as
saying u0 (x) = (x x0 ). As in the general Ito differential, a(X(t), t)dt is the
drift term, and (X(t), t)dW (t) is the martingale term. We often call (x, t)
the volatility. However, this is a different use of the letter from Black Scholes,
where the martingale term is x for a constant (also called volatility).
with initial data X(0) = 1, defines geometric Brownian motion. In the general
formulation above, (25) has drift coefficient a(x, t) = x, and volatility (x, t) =
x (with the conflict of terminology noted above). If W (t) were a differentiable
function of t, the solution would be
2
dX(W (t), t) = Xdt + XdW (t) + Xdt .
2
11
2
We can remove the unwanted final term by multiplying by e t/2
, which sug-
gests that the formula
2
X(t) = et t/2+W (t) (27)
satisfies (25). A quick Ito differentiation verifies that it does.
The solution, with initial data X(0) = 1, is the simple geometric Brownian
motion
X(t) = exp(W (t) t/2) . (29)
We discuss (29) in relation to the martingale property (X(t) is a martingale
because the drift term in (23) is zero in (28)). A simple calculation based on
12
2.6. Strong and weak solutions: A strong solution is an adapted function
X(W, t), where the Brownian motion path W again plays the role of the abstract
random variable, . As in the discrete case, X(t) (i.e. X(W, t)) being measurable
in Ft means that X(t) is a function of the values of W (s) for 0 s t. The two
examples we have, geometric Brownian motion (27), and the Ornstein Uhlenbeck
process8 Z t
X(t) = e(ts) dW (s) , (30)
0
both have this property. Note that (27) depends only on W (t), while (30)
depends on the whole pate up to time t.
A weak solution is a stochastic process, X(t), defined perhaps on a different
probability space and filtration (, Ft ) that has the statistical properties called
for by (23). These are (using X = X(t + t) X(t)) roughly9
and
E[X 2 | Ft ] = 2 (X(t), t)t + o(t) . (32)
We will see that a strong solution satisfies (31) and (32), so a strong solution
is a weak solution. It makes no sense to ask whether a weak solution is a
strong solution since we have no information on how, or even whether, the weak
solution depends on W .
The formulas (31) and (32) are helpful in deriving SDE descriptions of phys-
ical or financial systems. We calculate the left sides to identify the a(x, t) and
(x, t) in (23). Brownian motion paths and Ito integration are merely a tool
for constructing the desired process X(t). We saw in the example of geomertic
Brownian motion that expressing the solution in terms of W (t) can be very
convenient for understanding its properties. For example, it is not particularly
easy to show that X(t) 0 as t from (31) and (32) with a = X and10
(x, t) = x.
2.7. Strong is weak: We just verify that the strong solution to (23) that
satisfies (24) also satisfies the weak form requirements (31) and (32). This is an
important motivation for using the Ito definition of dW rather than, say, the
Stratonovich definition.
A slightly more general fact is simpler to explain. Define R and I by
Z t+t Z t+t
R= a(t)dt , I= (t)dW (t) ,
t t
8 This process satisfies the SDE dX = Xdt + dW , with X(0) = 0.
9 The little o notation f (t) = g(t) + o(t) informally means that the difference between f
and g is a mathematicians order of magnitude smaller than t for small t. Formally, it means
that (f (t) g(t))/t 0 as t 0.
10 This conflict of notation is common in discussing geometric Brownian motion. On the left
13
where a(t) and (t) are continuous adapted stochastic processes. We want to
see that
E R + I Ft = a(t) + o(t) , (33)
and
E (R + I)2 Ft = 2 (t) + o(t) .
(34)
We may leave I out of (33) because E[I] = 0 always. We may leave R out of (34)
because |I| >> |R|. (If a is bounded then R = O(t) so E[R2 | Ft ] = O(t2 ).
The Ito isometry formula suggests that E[I 2 | Ft ] = O(t). Cauchy Schwartz
then gives E[RI | Ft ] = O(t3/2 ). Altogether, E[(R + I)2 | Ft ] = E[I 2 |
Ft ] + O(t3/2 ).)
To verify (33) without I, we assume that a(t) is a continuous function of t
in the sense that for s > t,
E a(s) a(t) Ft 0 as s t .
so that
Z t+t
E R Ft = E [a(s) | Ft ]
t
Z t+t Z t+t
= E [a(t) | Ft ] + E [a(s) a(t) | Ft ]
t t
= ta(t) + o(t) .
14
have a limit as a diffusion. The main step in proving the limit exists is tightness,
which we hint at a lecture to follow. We identify a and by calculations. Then
we use the representation theorem to say that the process may be represented
as the strong solution to (23).
2.9. Backward equation: The simplest backward equation is the PDE sat-
isfied by f (x, t) = Ex,t [V (X(T ))]. We derive it using the weak form conditions
(31) and(32) and the tower property. As with Brownian motion, the tower
property gives
f (x, t) = Ex,t [V (X(T ))] = Ex,t [F (t + t)] ,
where F (s) = E[V (X(T )) | Fs ]. The Markov property implies that F (s) is a
function of X(s) alone, so F (s) = f (X(s), s). This gives
f (x, t) = Ex,t [f (X(t + t), t + t)] . (35)
If we assume that f is a smooth function of x and t, we may expand in Taylor
series, keeping only terms that contribute O(t) or more.12 We use X =
X(t + t) x and write f for f (x, t), ft for ft (x, t), etc.
f (X(t + t), t + t) = f + ft t + fx X + 21 fxx X 2 + smaller terms.
Therefore (31) and (32) give:
f (x, t) = Ex,t [f (X(t + t), t + t)]
= f (x, t)ft t + fx Ex,t [X] + 12 fxx Ex,t [X 2 ] + o(t)
= f (x, t) + ft t + fx a(x, t)t + 12 fxx 2 (x, t)t + o(t) .
We now just cancel the f (x, t) from both sides, let t 0 and drop the o(t)
terms to get the backward equation
2 (x, t) 2
t f (x, t) + a(x, t)x f (x, t) + x f (x, t) = 0 . (36)
2
2.10. Forward equation: The forward equation follows from the backward
equation by duality. Let u(x, t) be the probability density for X(t). Since
f (x, t) = Ex,t [V (X(T ))], we may write
Z
E[V (X(T ))] = u(x, t)f (x, t)dx ,
15
We integrate by parts to put the x derivatives on u. We may ignore boundary
terms if u decaus fast enough as |x| and if f does not grow too fast. The
result is
Z
x (a(x, t)u(x, t)) 12 x2 2 (x, t)u(x, t) + t u(x, t) f (x, t)dx = 0 .
Since this should be true for every function f (x, t), the integrand must vanish
identically, which implies that
This is the forward equation for the Markov process that satisfies (31) and (32).
In this equation, t and y are merely parameters, but s may not be smaller than
t. The initial condition that represents the requirement that X(t) = y is
G(y, x, t, t) = (x y) . (39)
The transition density is the Greens function for the forward equation, which
means that the general solution may be written in terms of G as
Z
u(x, s) = u(y, t)G(y, x, t, s)dy . (40)
This formula is a continuous time version of the law of total probability: the
probability density to be at x at time s is the sum (integral) of the probability
density to be at x at time s conditional on being at y at time t (which is
G(y, x, t, s)) multiplied by the probability density to be at y at time s (which is
u(y, t)).
2.12. Greens function for the backward equation: We can also express the
solution of the backward equation in terms the transition probabilities G. For
s > t,
f (y, t) = Ey,t [f (X(s), s)] ,
which is an expression of the tower property. The expected value on the right
may be evaluated using the transition probability density for X(s). The result
is Z
f (y, t) = G(y, x, t, s)f (x, s)dx . (41)
16
For this to hold, G must satisfy the backward equation as a function of y and t
(which were parameters in (38). To show this, we apply the backward equation
operator (see below for terminology) t + a(y, t)y + 12 2 (y, t)y2 to both sides.
The left side gives zero because f satisfies the backward equation. Therefore we
find that
Z
t + a(y, t)y + 21 2 (y, t)y2 G(y, x, t, s)f (x, s)dx
0=
Here x and s are parameters. The final condition for (42) is the same as (39).
The equality s = t represents the initial time for s and the final time for t
because G is defined for all t s.
2.13. The generator: The generator of an Ito process is the operator con-
taining the spatial part of the backward equation13
We leave out the t variable for the moment. If u and f are complex, we take
the complex conjugate of u above. The adjoint is defined by the requirement
that for general u and f ,
hu, Lf i = hL u, f i .
13 Some people include the time derivative in the definition of the generator. Watch for this.
17
In practice, this boils down to the same integration by parts we used to derive
the forward equation from the backward equation:
Z
u(x) a(x)x f (x) + 12 2 (x)x2 f (x) dx
hu, Lf i =
Z
x (a(x)u(x)) + 21 x2 ( 2 (x)u(x)) f (x)dx .
=
t u = L(t) u .
All we have done here is define notation (L ) and show how our previous deriva-
tion of the forward equation is expressed in terms of it.
2.15. Adjoints and the Greens function: Let us summarize and record what
we have said about the transition probability density G(y, x, t, s). It is defined
for s t and has G(x, y, t, t) = (x y). It moves probabilities forward by
integrating over y (38) and moves expected values backward by integrating over
x ??). As a function of x and s it satisfies the forward equation
18
This follows immediately from the representation
The expected value of f (X(s), s) cannot be larger than its maximum value.
Since this holds for every x, it holds in particular for the maximizer.
There is a more complicated proof of the maximum principle that uses the
backward equation. I give a slightly naive explination to avoid taking too long
with it. Let m(t) = maxx f (x, t). We are trying to show that m(t) never
increases. If, on the contrary, m(t) does increase as t decreases, there must be
a t with dm dt (t ) = < 0. Choose x so that f (x , t ) = maxx f (x, t ). Then
fx (x , t ) = 0 and fxx (x , t ) 0. The backward equation then implies that
ft (x , t ) 0 (because 2 0), which contradicts ft (x , t ) < 0.
The PDE proof of the maximum principle shows that the coefficients a and
2 have to be outside the derivatives in the backward equation. Our argument
that Lf 0 at a maximum where fx = 0 and fxx 0 would be wrong if we had,
say, x (a(x)f (x, t)) rather than a(x)x f (x, t). We could get a non zero value
because of variation in a(x) even when f was constant. The forward equation
does not have a maximum principle for this reason. Both the Ornstein Uhlen-
beck and geometric Brownian motion problems have cases where maxx u(x, t)
increases in forward time or backward time.
R
3.3. Conservation of probability: The probability density has
u(x, t)dx =
1. We can see that
d
Z
u(x, t)dx = 0
dt
also from the forward equation (37). We simply differentiate under the integral,
substitute from the equation, and integrate the resulting x derivatives. For this
it is crucial that the coefficients a and 2 be inside the derivatives. Almost any
example with a(x, t) or (x, t) not independent of x will show that
d
Z
f (x, t)dx 6= 0 .
dt
d
Z
d
E[X(t)] = xu(x, t)dx
dt dt
Z
= xut (x, t)
Z
= x 12 x2 ( 2 (x, t)u(x, t))dx
19
Z
1 2
= 2 x ( (x, t)u(x, t))dx
= 0.
This would not be true for the backward equation form 12 2 (x, t)x2 f (x, t) or even
for the mixed form we get from the Stratonovich calculus 12 x ( 2 (x, t)x f (x, t)).
The mixed Stratonovich form conserves probability but not expected value.
3.5. Drift and advection: If there is no drift then the SDE (23) becomes the
ordinary differential equation (ODE)
dx
= a(x, t) . (45)
dt
If x(t) is a solution, then clearly the expected payout should satisfy f (x(t), t) =
f (x(s), s), if nothing is random then the expected value is the value. It is easy
to check using the backward equation that f (x(t), t) is independent of t if x(t)
satisfies (45) and = 0:
d dx
f (x(t), t) = ft (x(t), t) + fx (x(t), t) = ft (x(t), t) + a(x(t), t)fx (x(t), t) = 0 .
dt dt
Advection is the process of being carried by the wind. If there is no diffusion,
then the values of f are simply advected by the drift. The term drift implies
that this advection process is slow and gentle. If is small but not zero, then
f may be essentially advected with a little spreading or smearing induced by
diffusion. Computing drift dominated solutions can be more challenging than
computing diffusion dominated ones.
The probability density does not have u(x(t), t) a constant (try it in the
forward equation). There is a conservation of probability correction to this that
you can find if you are interested.
20
Stochastic Calculus Notes, Lecture 8
Last modified December 14, 2004
The estimate is unbiased because the bias, A Eu [Abu ], is zero. The error is
1
determined by the variance var(A
bu ) = varu ((X)).
N
Let v(x) be another probability density so that v(x) 6= 0 for all x with
u(x) 6= 0. Then clearly
Z Z
u(x)
A = (x)u(x)dx = (x) v(x)dx .
v(x)
We express this as
u(x)
A = Eu [(X)] = Ev [(X)L(X)] , where L(x) = . (2)
v(x)
The ratio L(x) is called the score function in Monte Carlo, the likelihood ratio
in statistics, and the Radon Nikodym derivative by mathematicians. We get a
different unbiased estimate of A by generating N independent samples of v and
taking
N
bv = 1
X
AA (Xk )L(Xk ) . (3)
N
k=1
1
The accuracy of (3) is determined by
Z
varv ((X)L(X)) = Ev [((X)L(X) A)2 ] = ((x)L(x) A)2 v(x)dx .
Xk N (0, 1), k = 1, , N ,
For large a, the hits, Xk > a, would be a small fraction of the samples, with the
rest being wasted.
One importance sampling strategy uses v corresponding to N (a, 1). It
seems natural to try to increase the number of hits by moving the mean from
0 to a. Since most hits are close to a, it would be a mistake to move the
2
mean farther than a. Using the probability densities u(x) = 12 ex /2 and
2 2
v(x) = 12 e(xa) /2 , we find L(x) = u(x)/v(x) = ea /2 eax . The importance
sampling estimate is
Xk N (a, 1), k = 1, , N ,
1 a2 /2 X aXk
A Av = N e e .
b
Xk >a
Some calculations show that the variance of A bv is smaller than the variance
bu by a factor of roughly ea2 /2 . A simple way to
of of the naive estimator A
generate N (a, 1) random variables is to start with mean zero standard normals
2 2
Yk N (0, 1) and add a: Xk = Yk + a. In this form, ea /2 eaXk = ea /2 eaYk ,
and Xk > a, is the same as Yk > 0, so the variance reduced estimator becomes
Yk N (0, 1), k = 1, , N ,
a2 /2 1 (5)
X
A Av = e
b eaYk .
N
Yk >0
The naive Monte Carlo method (4) produces a small A b by getting a small
number of hits in many samples. The importance sampling method (5) gets
2
roughly 50% hits but discounts each hit by a factor of at least ea /2 to get the
same expected value as the naive estimator.
2
1.4. Radon Nikodym derivative: Suppose is a measure space with
algebra F and probability measures P and Q. We say that L() is the
Radon Nikodym derivative of P with respect to Q if dP () = L()dQ(), or,
more formally, Z Z
V ()dP () = V ()L()dQ() ,
which is to say
EP [V ] = EQ [V L] , (6)
dP
for any V , say, with EP [|V |] < . People often write L = dQ , and call it
the Radon Nikodym derivative of P with respect to Q. If we know L, then the
right side of (6) offers a different and possibly better way to estimate EP [V ].
Our goal will be a formula for L when P and Q are measures corresponding to
different SDEs.
1.5. Absolute continuity: One obstacle to finding L is that it may not exist.
If A is an event with P (A) > 0 but Q(A) = 0, L cannot exist because the
formula (6) would become
Z Z Z
P (A) = dP () = 1A ()dP () = 1A ()L()dQ() .
A
Looking back at our definition of the abstract integral, we see Rthat if the event
A = {f ()
R 6= 0} has Q(A) = 0, then all the approximations to f ()dQ() are
zero, so f ()dQ() = 0.
We say that measure P is absolutely continuous with respect to Q if P (A) =
0 Q(A) = 0 for every1 A F. We just showed that L cannot exist unless
P is absolutely continuous with respect to Q. On the other hand, the Radon
Nikodym theorem states that an L satisfying (6) does exist if P is absolutely
continuous with respect to Q.
In practical examples, if P is not absolutely continuous with respect to Q,
then P and Q are completely singular with respect to each other. This means
that there is an event, A F with P (A) = 1 and Q(A) = 0.
for this reason always to use the algebra of Borel sets. It is common to imagine completing
a measure by adding to F all subsets of events with P (A) = 0. It may seem better to have
more measurable events, it makes the change of measure discussions more complicated.
3
Q if the densities are different from zero in different places. An example with
n = 1 is P corresponding to a negative exponential random variable u(x) = ex
for x 0 and u(x) = 0 for x > 0, while Q corresponds to a positive exponential
v(x) = ex for x 0 and v(x) = 0 for x < 0.
Another way to get singular probability measures is to have measures using
functions concentrated on lower dimensional sets. An example with = R2 has
Q saying that X1 and X2 are independent standard normals while P says that
2
X1 = X2 . The probability density for P is u(x1 , x2 ) = 12 ex1 /2 (x2 x1 ).
The event A = {X1 = X2 } has Q probability zero but P probability one.
1.9. Coin tossing: In common situations where this works, the function F ()
is a limit that exists almost surely (but with different values) for both P and Q.
If limn Fn () = a almost surely with respect to P and limn Fn () = b
almost surely with respect to Q, then P and Q are completely singular.
Suppose we make an infinite sequence of coin tosses with the tosses being
independent and having the same probability of heads. We describe this by
taking to be infinite sequences = (Y1 , Y2 , . . .), where the k th toss Yk equals
one or zero, and the Yk are independent. Let the measure P represent tossing
with Yk = 1 with probability p, and Pn Q represent tossing with Yk = 1 with
probability q 6= p. Let Fn () = n1 k=1 Yk . The (Kolmogorov strong) law of
large numbers states that Fn p as n almost surely in P and Fn
q as n almost surely in Q. This shows that P and Q are completely
singular with respect to each other. Note that this is not an example of discrete
probability in our sense because the state space consists of infinite sequences.
The set of infinite sequences is not countable (a theorem of Cantor).
1.10. The Cameron Martin formula: The Cameron Martin formula relates
the measure, P , for Brownian motion with drift to the Wiener measure, W , for
standard Brownian motion without drift. Wiener measure describes the process
4
The P measure describes solutions of the SDE
This is exact in the case a = 0. We write V (~x) for the joint density of X ~ for
W and U (~x) for teh joint density under (9). We calculate Lt (~x) = U (~x)/V (~x)
and observe the limit as t 0.
To carry this out, we again note that the joint density is the product of the
transition probability densities. For (7), if we know xk , then Xk+1 is normal
with mean xk and variance t. This gives
1 2
G(xk , xk+1 , t) = e(xk+1 xk ) /2t ,
2t
and !
n1
n/2 1 X
V (~x) = 2 t exp (xk+1 kk )2 . (10)
2t
k=0
For (9), the approximation to (8), Xk+1 is normal with mean xk + ta(xk , tk )
and variance t. This makes its transition density
1 2
G(xk , xk+1 , t) = e(xk+1 xk ta(xk ,tk )) /2t ,
2t
so that
n1
!
n/2 1 X
U (~x) = 2 t exp (xk+1 kk ta(xk , tk ))2 . (11)
2t
k=0
Dividing U by V removes the 2 factors and the x2k in the exponents. What
remains is
5
The first term in the exponent converges to the Ito integral
n1
X Z T
(a(xk ), tk )(xk+1 xk ) a(X(t), t)dX(t) as t 0,
k=0 0
if tn = max {tk < T }. The second term converges to the Riemann integral
n1
X Z T
2
t a(xk ), tk ) a2 (X(t), t)dt as t 0.
k=0 0
2 Multidimensional diffusions
2.2. Strong solutions: The drift now is a drift for each component of X,
a(x, t) = (a1 (x, t), . . . , an (x, t))t . Each component of a may depend on all com-
ponents of X. The now is an n m matrix, where m is the number of
independent sources of noise. We let B(t) be a column vector of m independent
standard Brownian motion paths, B(t) = (B1 (t), . . . , Bm (t))t . The stochastic
differential equation is
The middle term on the right is a vector of Riemann integrals whose k th com-
ponent is the standard Riemann integral
Z t
ak (X(s), s)ds .
0
6
The last term on the right is a collection of standard Ito integrals. The k th
component is
Xm Z t
kj (X(s), s)dBj (s) ,
j=1 0
with each summand on the right being a scalar Ito integral as defined in previous
lectures.
2.3. Weak form: The weak form of a multidimensional diffusion problem asks
for a probability measure, P , on the probability space = C([0, T ], Rn ) with
filtration Ft generated by {X(s) for s t} so that X(t) is a Markov process
with
E X Ft = a(X(t), t)t + o(t) , (14)
and
E XX t Ft = (X(t), t)t + o(t) .
(15)
Here X = X(t+t)X(t), we assume t > 0, and X t = (X1 , . . . , Xn ) is
the transpose of the column vector X. The matrix formula (15) is a convenient
way to express the short time variances and covariances2
E Xj Xk Ft = jk (X(t), t)t + o(t) . (16)
2.4. Backward equation: As for one dimensional diffusions, the weak form
conditions (14) and (15) give a simple derivation of the backward equation for
and expand f (x+X, t+t) about (x, t) to second order in X and first order
in t:
f (x + X, t + t) = f + xk f Xk + 21 xj xk Xj Xk + t f t + R .
Here follow the Einstein summation convention by leaving out the sums over j
and k on the right. We also omit arguments of f and its derivatives when the
arguments are (x, t). For example, xk f Xk really means
n
X
xk f (x, t) Xk .
k=1
7
As in one dimension, the error term R satisfies
3
|R| C |X| t + |X| + t2 ,
so that, as before,
E [|R|] C t3/2 .
Putting these back into (17) and using (14) and (15) gives (with the same
shorthand)
Again we cancel the f from both sides, divide by t and take t 0 to get
8
the SVD is called principal component analysis (PCA). The columns of U and
V (not V t ) are left and right singular vectors respectively, which also are called
principal components or principal component vectors. The calculation
C = AAt = (U V t )(V t U t ) = U t U t
2.8. Choosing (x, t): This non uniqueness of A carries over to non unique-
ness of (x, t) in the SDE (13). A diffusion process X(t) defines (x, t) through
(15), but any (x, t) with t = leads to the same distribution of X trajec-
tories. In particular, if we have one (x, t), we may choose any adapted matrix
valued function V (t) with V V t Imm , and use 0 = V . To say this another
way, if we solve dZ 0 = V (t)dZ(t) with Z 0 (0) = 0, then Z 0 (t) also is a Brownian
motion. (The Levi uniqueness theorem states that any continuous path process
that is weakly Brownian motion in the sense that a 0 and I in (14) and
(15) actually is Brownian motion in the sense that the measure on is Wiener
9
measure.) Therefore, using dZ 0 = V (t)dZ gives the same measure on the space
of paths X(t).
The conclusion is that it is possible for SDEs wtih different (x, t) to repre-
sent the same X distribution. This happens when t = 0 0 t . If we have , we
may represent the process X(t) as the strong solution of an SDE (13). For this,
we must choose with some arbtirariness a (x, t) with (x, t)(x, t)t = (x, t).
The number of noise sources, m, is the number of non zero eigenvalues of . We
never need to take m > n, but m < n may be called for if has rank less than
n.
2.9. Correlated Brownian motions: Sometimes we wish to use the SDE model
(13) where the Bk (t) are correlated. We can accomplish this with a change in .
Let us see how to do this in the simpler case of generating correlated standard
normals. In that case, we want Z = (Z1 , . . . , Zm )t Rm to be a multivariate
mean zero normal with var(Zk ) = 1 and given correlation coefficients
cov(Zj , Zk )
jk = p = cov(Zj , Zk ) .
var(Zj )var(Zk )
This is the same as generating Z with covariance matrix C with ones on the
diagonal and Cjk = jk when j 6= k. We know how to do this: choose A with
AAt = C and take Z = AZ 0 . This also works in the SDE. We solve
with the Bk being independent standard Brownian motions. We get the effect
of correlated Brownian motions by using independent ones and replacing (x, t)
by (x, t)A.
10
rank deficient. In either case we call the stochastic process a degenerate diffu-
sion. Nondegenerate diffusions have qualitative behavior like that of Brownian
motion: every component has infinite total variation and finite quadratic varia-
tion, transition densities are smooth functions of x and t (for t > 0) and satisfy
forward and backward equations (in different variables) in the usual sense, etc.
Degenerate diffusions may lack some or all of these properties. The qualitative
behavior of degenerate diffusions is subtle and problem dependent. There are
some examples in the homework. Computational methods that work well for
nondegenerate diffusions may fail for degenerate ones.
E [V (Y (T ))] ,
where Z T
Y (T ) = S(t)dt .
0
To get a backward equation for this, we need to identify a state space so
that the state is a Markov process. We use the two dimensional vector
S(t)
X(t) = ,
Y (t)
where S(t) satisfies (22) and dY (t) = S(t)dt. Then X(t) satisfies (13) with
rS
a= ,
S
S 2 2
t 0
= = ,
0 0
s2 2 2
t f + rss f + sy f + f =0. (23)
2 s
11
Note that this is a partial differential equation in two space variables,
x = (s, y)t . Of course, we are interested in the answer at t = 0 only for y = 0.
Still, we have include other y values in the computation. If we were to try the
standard finite difference approximate solution of (23) we might use a central
1
difference approximation y f (s, y, t) 2y (f (s, y + y, t) f (s, y y, t)).
If > 0 it is fine to use a central difference approximation for s f , and this
is what most people do. However, a central difference approximation for y f
leads to an unstable computation that does not produce anything like the right
answer. The inherent instability of centeral differencing is masked in s by the
strongly stabilizing second derivative term, but there is nothing to stabalize the
unstable y differencing in this degenerate diffusion problem.
2.13. Integration with dX: We seek the anologue of the Ito integral and
Itos lemma for a more general diffusion. If we have a function f (x, t), we seek
a formula df = adt + bdX. This would mean that
Z T Z T
f (X(T ), T ) = f (X(0), 0) + a(t)dt + b(t)dX(t) . (24)
0 0
The first integral on the right would be a Riemann integral that would be defined
for any continuous function a(t). The second would be like the Ito integral with
Brownian motion, whose definition depends on b(t) being an adapted process.
The definition of the dX Ito integral should be so that Itos lemma becomes
true.
For small t we seek to approximate f = f (X(t + t), t + t) f (X(t), t).
If this follows the usual pattern (partial justification below), we should expand
to second order in X and first order in t. This gives (wth summation con-
vention)
f (xj f )Xj + 21 (xj xk f )Xj Xk + t f t . (25)
As with the Ito lemma for Brownian motion, the key idea is to replace the
products Xj Xk by their expected values (conditional on Ft ). If this is true,
(15) suggests the general Ito lemma
2.14. Itos rule: One often finds this expressed in a slightly different way. A
simpler way to represent the small time variance condition (15) is
This has the advantage of displaying the main idea, which is that the fluctuations
in dXj are important but only the mean values of dX 2 are important, not the
12
fluctuations. Itos rule (never enumciated by Ito as far as I know) is the formula
Although this leads to the correct formula (26), it is not structly true, since the
standard defiation of the left side is as large as its mean.
In the derivation of (26) sketched below, the total change in f is represented
as the sum of many small increments. As with the law of large numbers, the
sum of many random numbers can be much closer to its mean (in relative terms)
than the random summands.
2.15. Ito integral: The definition of the dX Ito integral follows the definition
of the Ito integral with respect to Brownian motion. Here is a quick sketch
with many details missing. Suppose X(t) is a multidimensional diffusion pro-
cess, Ft is the algebra generated by the X(s) for 0 s t, and b(t) is a
possibly random function that is adapted to Ft . There are n components of b(t)
corresponding to the n components of X(t). The Ito integral is (tk = kt as
usual):
Z T X
b(t)dX(t) = lim b(tk ) (X(tk+1 ) X(tk )) . (28)
0 t0
tk <T
where
Rk = b(tk+1/2 ) b(tk ) X(tk+1 ) X(tk+1/2 ) .
The bound h 2 i
E Yt/2 Yt = O(tp ) , (29)
Since b(tt+1/2 ) is known in Ftk+1/2 , we may use the tower property and our
assumption on b to get
h 2 2 i
E[Rk2 ] E X(tk+1 ) X(tk+1/2 ) b(tk+1/2 ) b(t) = O(t2 ) .
13
This gives (29) with p = 1 (as for Brownian motion) for that case. For the
general case, my best effort is too complicated for these notes and gives (29)
with p = 1/2.
2.16. Itos lemma: We give a half sketch of the proof of Itos lemma for
diffusions. We want to use k to represent the time index (as in tk = kt) so
we replace the index notation above with vector notation: x f X instead of
xk Xk , x2 (Xk , Xk ) instead of (xj xk f )Xj Xk , and tr(x2 f ) instead
of (xj xk f )jk . Then Xk will be the vector X(tk+1 ) X(tk ) and x2 fk the
n n matrix of second partial derivatives of f evaluated atP(X(tk ), tk ), etc.
Now it is easy to see who f (X(T ), T ) f (X(0), 0) = tk <T Fk is given
by the Riemann and Ito integrals of the right side of (26). We have
fk = t fk t + x fk Xk + 21 x2 fk (Xk , Xk )
+ O(t2 ) + O (t |Xk |) + O Xk3 .
As t 0, the contribution from the second row terms vanishes (the third
term takes some work, see below). The sum of the t fk t converges to the
RT
Riemann integral 0 t f (X(t), t)dt. The sum of the x fk Xk converges to the
RT
Ito integral 0 x f (X(t), t)dX(t). The remaining term may be written as
so
X Z T
E x2 fk (Xk , Xk ) Ftk tr x2 f (X(t), t)(X(t), t) dt ,
tk <T 0
(the Riemann integral) as t 0. This shows how the terms in the Ito lemma
(26) are accounted for.
2.17. Theory left out: We did not show that there is a process satisfying (14)
and (15) (existence) or that these conditions characterize the process (unique-
ness). Even showing that a process satisfying (14) and (15) with zero drift and
14
= I is Brownian motion is a real theorem: the Levi uniqueness theorem.
The construction of the stochastic
h processi X(t) (existence) also gives bounds
4
on higher moments, such as E |X| C t2 , that we used above. The
higher moment estimates are true for Brownian motion because the increments
are Gaussian.
where the Zk are i.i.d. N (0, Imm ). This has the properties corresponding to
(14) and (15) that
E Xk+1 Xk X1 , , Xk = a(Xk , tk )t
and
cov(Xk+1 Xk ) = t .
This is the forward Euler method. There are methods that are better in some
ways, but in a surprising large number of problems, methods better than this
are not known. This is a distinct contrast to numerical solution of ordinary
differential equations (without noise), for which forward Euler almost never is
the method of choice. There is much research do to to help the SDE solution
methodology catch up to the ODE solution methodology.
(xk+1 xk )t 1
1 k (xk+1 xk )
G(xk , xk+1 , tk , t) = p exp
(2)n/2 det(k ) 2t
(31)
With nonzero drift, the prefactor
1
zk = p
(2)n/2 det(k )
15
is the same and the exponential factor accomodates the new mean:
(xk+1 xk ak t)t 1
k (xk+1 xk ak t)
G(xk , xk+1 , tk , t) = zk exp .
2t
(32)
Let U (x1 , . . . , xN ) be the joint density without drift and U (x1 , . . . , xN ) with
drift. We want to evaluate L(~x) = V (~x)/U (~x) Both U and V are products of
the appropriate transitions densities G. In the division, the prefactors zk cancel,
as they are the same for U and V because the k are the same.
The main calculation is the subtraction of the exponents:
(xk ak t)t 1 t 1
k (xk ak t)xk xk = 2tatk 1 2 t 1
k xk +t ak k ak .
This gives:
1 N 1
N
!
X t X t 1
L(~x) = exp atk 1
k xk + ak k ak .
2
k=0 k=0
This is the exact likelihood ratio for the discrete time processes without drift.
If we take the limit t 0 for the continuous time problem, the two terms in
the exponent converge respectively to the Ito integral
Z T
a(X(t), t)t (X(t), t)1 dX(t) ,
0
16