Jukes Cantor

(From last lecture)
Basic Matrix model of molecular evolution
MATH0011
Numbers and Patterns in Nature and Life
We assume that:
Lecture 11
Molecular Evolution (2)
http://147.8.101.93/MATH0011/
Only base substitutions may occur.

Each site of the ancestral sequence, behaving identically
and independently of every other site, appears randomly
as A,G,C, or T according to probabilities PA, PG, PC, PT.
Note : each Pj t 0, PA+PG+PC+PT =1.
Over one time step, at each site, the base is subject to
possible substitution according to conditional probabilities
P(S1=i | S0=j). (E.g, if the base is G, then there is a
chance of P(S1=T | S0=G) that it will change to T after 1
time step.)
1
(From last lecture)
(From last lecture)
Probability vector and transition matrix
Markov model for molecular evolution
Write probability vector of ancestral sequence

p 0 = ( P A , P G , PC , P T )
We further assume that:

Probabilistic mutation process over subsequent time
steps is equivalent to that of the 1st time step.
For each site, what happens during each of
subsequent time steps depends only on what the
base was at the beginning of that time step, and is
irrelevant to what that base was in previous time
steps (i.e., the process has no memory) .
Arrange conditional probabilities of base substitutions

into transition matrix :
Here we write
P(S1=i | S0=j) as
Pi | j for simplicity
Note :
Always use the ordering A,G,C,T.
Column sums of M are all 1.
This kind of model is a Markov model.

2
(From last lecture)
A side-note on matrix multiplication
Powers of the matrix M
When matrices A and B are multiplied together to form a

matrix C (=AB):
length of each (horizontal) row of A must equal to

length of each (vertical) column of B (otherwise the
product AB is not defined),
the i-th row of A is multiplied to the j-th column of B to

form the (i,j)-th entry of C (which is at the i-th row and
j-th column of C),
Note that in general AB z BA

A nice description can be found in website:
http://www.mai.liu.se/~halun/matrix/matrix.html
Fact : for any t = 1,2,..., the matrix M t (i.e., product of t

copies of M) is equal to:
Hence entries of M t give conditional probabilities

P(St=i | S0=j) where i, j = A, G, C, or T.
Jukes-Cantor model
Some Markov models for molecular evolution
This is the simplest Markov model of base

substitution, which assumes:
p0 = ( 1/4, 1/4, 1/4, 1/4), and
Jukes-Cantor model
2-parameter Kimura model
conditional probabilities Pi | j are the same for any pair

of i, j with i z j (here i, j = A,G,C,T). This means:
(There are other kinds of models for molecular

evolution)
Jukes-Cantor model
Jukes-Cantor model
Significance of D :
For the Jukes-Cantor model, we have:
D is the only parameter in M.

D rate of observable base substitutions occur over one time step.
Examples of known estimated value of D :
D # 108 mutations per site per year for mitochondrial DNA in

mammals.
D # 0.01 mutations per site per year for influenza A virus.
Therefore, pt = (1/4, 1/4, 1/4, 1/4) for any subsequent

time instances t = 1, 2, , and we say that (1/4, 1/4,
1/4, 1/4) is an equilibrium base distribution for DNA
sequences under the Jukes-Cantor model.
We say that there is a molecular clock when mutation

rates are constant.
However, in general, the mutation rate may not be constant
over time or for different locations within the DNA.
Computing Mt with Jukes-Cantor model

Jukes-Cantor model
Note that :
Question : For the Jukes-Cantor model, find the

probability (as an expression in terms of D) that a
base A in the ancestral sequence will have mutated
to become T in the descendent sequence after 100
time steps?
where
Answer is P(S100=T | S0=A), which is the entry on
the 4th row and 1st column of the matrix power

M100. But, how to get the numerical value (in
For any real number r, define K(r) := r I + (1r)J.

Since II = I, IJ = JI = JJ = J, one has
K(r) K(s) = (r I + (1r) J) (s I + (1s) J)
= rs I + (r(1s)+(1r)s +(1r)(1s)) J = rs I + (1rs) J = K(rs),
and K(r) t = K(r) K(r) K(r) = K(r t).
terms of D) of this particular entry?
10
11
Computing Mt with Jukes-Cantor model

Since M = K(r) where
, one gets
The Kimura 2-parameter model is another Markov model

which allows for different rates of transitions (changes
within the purine group or within the pyrimidine group)
and transversions (changes from purine to pyrimidine or
vice versa). The Markov matrix for this model is:
Mt = K(r t) = r t I + (1 r t) J
Therefore answer to the previous problem is
Here E = mutation rate for transitions,

J = mutation rate for each of the possible transversions.
Numerical value of is 1 E 2J.
P(S100=T | S0=A) =
12
13
Phylogenetic distances
The Kimura 3-parameter model

has Markov matrix :
Problem : Given an ancestral sequence S0 and a

descendent sequence St (evolved over t time steps),
how to estimate total amount of mutations (including
hidden mutations) from observed amount of mutations?
Numerical value of is 1 E J G.
Note :
Jukes-Cantor model is a particular case of 2-parameter

Kimura model.
2-parameter Kimura model is particular case of 3parameter Kimura model.
All these three models satisfy M p0 = p0 where

p0 = (1/4, 1/4, 1/4, 1/4).
S0 : GCTAGT ATGATCAGCGG
p
p
p
St : GTTAGA ACGATCAGCAG
14
15
Example : Consider the case of influenza A virus, which

has mutation rate D = .01 per year. The following graph
gives values of p(t) for the case D = .01.
Phylogenetic distances
Assume Jukes-Cantor model with mutation rate D. As
diagonal entries of Mt is
fraction of sites in descendent sequence St (after t

time steps) that agreed with their initial bases of
ancestral sequence S0 is
Therefore fraction of sites that are different is
16
Example : Suppose mutations of influenza A

virus follow Jukes-Cantor model with mutation
rate D = .01 per year. Then, after 10 years,
fraction of sites of descendent sequence that
are different from initial sequence is :
17
From equation
one can deduce

Example : consider previous example of influenza A virus
where D = .01 per year. If 3/10 of sites of descendent
sequence St are different from the ancestral sequence,
then p = 0.3, and the above formula gives estimate
length of time elapsed t = 38.06 (years).
18
19
Jukes-Cantor distance
Even when D is not known, we may give estimate of a
distance between ancestral sequence and
descendent sequence in the following way.
Note that
t D = (no. of time steps) (mutation rate)
= (no. of time steps)(no. of substitutions per site / time step)
= (expected no. of substitutions per site during the elapsed time).
Note that, for any real number x near 0, one has
Here the substitutions include those hidden ones.
ln(1+x) # x
Let d := t D. Then d gives a measure of distance between the

ancestral sequence and the descendent sequence.
Thus if D is small, ln(1 4D/3) # 4D/3. Hence
In this way, we define Jukes-Cantor distance between DNA

sequences Sa and Sb as
where p is fraction of sites that disagree in comparing Sa with Sb .
Therefore
20
21
Kimura distances
Provided a molecular clock hypothesis is valid (i.e.,

mutation rate is constant over time), the distance
computed is proportional to the length of elapsed time.
For Kimura 3-parameter model, the distance formula is:

dK3 = (1/4) (ln(1 2E 2J) + ln(1 2E 2G) ln(1 2J 2G))
In such case, the distance is a measure of how much time

was required for one sequence to mutate into the other.
where E, J, G are parameters in the Kimura 3-parameter

matrix.
Example: consider the 40-base

sequence example of the last
lecture.
For Kimura 2-parameter model, distance formula is

dK2 = (1/2) ln(1 2p1 p2) (1/4) ln(1 2 p2)
where p1 is probability of a transition and p2 is
probability of a transversion.
As p = 11/40, one gets :
Exercise : deduce the formula of dK2 from that of dK3.
22
23
Symmetric property of distances
Additive property of distances
All distances discussed so far are symmetric :
d(Sa, Sb) = d(Sb , Sa).
Also, all distances discussed so far follow the additive

rule :
if ancestral sequence S0 evolved to become S1, and S1
evolved to become S2 , then
Consequently, if we have two sequences, one

of which (but we do not know which one) is a
descendent of the other, then we may take one
of them arbitrarily as the ancestor and the other
as descendent to calculate the distance
between them. Different choices of the ancestor
and descendent will not affect the result.
d(S0, S2) = d(S0, S1) +d(S1, S2).

d(S0, S1)
S0
d(S1, S2)
S1
S2
Symmetric and additive properties of distances are

useful in constructing phylogenetic trees relating many
species.
24
25
Distance between two descendents
In practice, it is uncommon to have DNA sequences for both an

ancestor and a descendent. Instead, we usually have two
descendent DNA sequences S1 and S2 that mutated from a
common, yet unknown ancestral sequence S0 :
S0
S1
S2
There are many other kinds of models for

molecular evolution, and many other different
kinds of distances for DNA sequences. In this
course we are only able to study some of the
simplest cases.
In such case we can only compute d(S1, S2). However, if d is

symmetric and additive, then since d(S0, S1) = d(S1, S0), one may
treat S1 as ancestor and S0 as descendent. Using additive rule,
we have
d(S1, S2) = d(S1, S0) + d(S0, S2) = d(S0, S1) + d(S0, S2)
26
27
Trees
Questions :
A tree is a diagram of vertices and edges, in which

all vertices are interconnected by edges, and no
loops are formed by the edges.
Which of the gorilla, chimpanzee, orangutan, and

gibbon are humans closest evolutionary kin?
Which of the following evolutionary trees is more
probable?
Vertices in a tree:
leaves = terminal
vertices
internal vertices
Example of a tree
28
29
Examples of non-trees
Phylogenetic rooted trees
each leaf = one taxon

(source of DNA sequence)
Some vertices not

connected to others
internal vertices =
ancestors
root = ultimate
common ancestor
bifurcating : each
ancestor gives 2
immediate descendents
Loop exists
30
31
Topological trees and metric trees

Unrooted trees
Have no roots
Placing a root at different
locations yields different
rooted trees
Topological trees : show only branching relationship

of vertices; do not care about edge lengths
Metric trees : show also lengths of edges (usually by
indicating lengths on edges)
Topological tree
Metric tree
(lengths not to scale)
Metric tree
(lengths to scale)
32
33
If an edge has zero length
Meaning of edge lengths
In a metric tree, an edge of zero length can be

treated as if its two end vertices coincide.
Edge lengths in phylogenetic trees usually

represent amount of mutation occurred between
splittings (e.g., Juke-Cantor distance between
DNA sequences).
If molecular clock assumption holds :

edge lengths are proportional to elapsed time,
every taxon of a rooted tree has same total
distance from the root.
is the same as
34
35
Topologically equivalent trees
Number of topologically distinct trees
Trees are topologically the same if they represent

same branching relations; otherwise they are
topologically distinct.
Topologically equivalent rooted tree
When n grows, number of distinct rooted or

unrooted bifurcating trees with n taxa (as leaves)
grows rapidly.
Topologically distinct
unrooted trees
36
37
Another way to draw rooted trees
Reference
Rooted trees may be drawn in square manner:

vertical lines represent splits,
horizontal lines represent distances of splits.
Mathematical Models in Biology, An Introduction, E.S.

Allman and J.A. Rhodes, Cambridge University Press, 2004.
WWW resource
http://www.nmsr.org/upgma.htm
Two ways of drawing the same rooted tree

38
39

Jukes Cantor

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Jukes Cantor

Caricato da

Copyright:

Formati disponibili

(From last lecture)

Basic Matrix model of molecular evolution

Molecular Evolution (2)

Only base substitutions may occur.

(From last lecture)

(From last lecture)

Probability vector and transition matrix

Markov model for molecular evolution

Write probability vector of ancestral sequence

We further assume that:

Arrange conditional probabilities of base substitutions

This kind of model is a Markov model.

(From last lecture)

A side-note on matrix multiplication

Powers of the matrix M

When matrices A and B are multiplied together to form a

length of each (horizontal) row of A must equal to

the i-th row of A is multiplied to the j-th column of B to

Note that in general AB z BA

Fact : for any t = 1,2,..., the matrix M t (i.e., product of t

Hence entries of M t give conditional probabilities

Some Markov models for molecular evolution

This is the simplest Markov model of base

p0 = ( 1/4, 1/4, 1/4, 1/4), and

conditional probabilities Pi | j are the same for any pair

(There are other kinds of models for molecular

For the Jukes-Cantor model, we have:

D is the only parameter in M.

Examples of known estimated value of D :

D # 108 mutations per site per year for mitochondrial DNA in

Therefore, pt = (1/4, 1/4, 1/4, 1/4) for any subsequent

We say that there is a molecular clock when mutation

Computing Mt with Jukes-Cantor model

Question : For the Jukes-Cantor model, find the

Answer is P(S100=T | S0=A), which is the entry on

the 4th row and 1st column of the matrix power

For any real number r, define K(r) := r I + (1r)J.

terms of D) of this particular entry?

Computing Mt with Jukes-Cantor model

2-parameter Kimura model

The Kimura 2-parameter model is another Markov model

Therefore answer to the previous problem is

Here E = mutation rate for transitions,

3-parameter Kimura model

The Kimura 3-parameter model

Problem : Given an ancestral sequence S0 and a

Jukes-Cantor model is a particular case of 2-parameter

2-parameter Kimura model is particular case of 3parameter Kimura model.

All these three models satisfy M p0 = p0 where

Example : Consider the case of influenza A virus, which

fraction of sites in descendent sequence St (after t

Therefore fraction of sites that are different is

Example : Suppose mutations of influenza A

one can deduce

Note that, for any real number x near 0, one has

Here the substitutions include those hidden ones.

Let d := t D. Then d gives a measure of distance between the

Thus if D is small, ln(1  4D/3) #  4D/3. Hence

In this way, we define Jukes-Cantor distance between DNA

where p is fraction of sites that disagree in comparing Sa with Sb .

Provided a molecular clock hypothesis is valid (i.e.,

For Kimura 3-parameter model, the distance formula is:

In such case, the distance is a measure of how much time

where E, J, G are parameters in the Kimura 3-parameter

D # 108 mutations per site per year for mitochondrial DNA in

For any real number r, define K(r) := r I + (1r)J.

Thus if D is small, ln(1 4D/3) # 4D/3. Hence