Sei sulla pagina 1di 10

(From last lecture)

Basic Matrix model of molecular evolution

MATH0011
Numbers and Patterns in Nature and Life

We assume that:

Lecture 11

Molecular Evolution (2)

http://147.8.101.93/MATH0011/

Only base substitutions may occur.


Each site of the ancestral sequence, behaving identically
and independently of every other site, appears randomly
as A,G,C, or T according to probabilities PA, PG, PC, PT.
Note : each Pj t 0, PA+PG+PC+PT =1.
Over one time step, at each site, the base is subject to
possible substitution according to conditional probabilities
P(S1=i | S0=j). (E.g, if the base is G, then there is a
chance of P(S1=T | S0=G) that it will change to T after 1
time step.)
1

(From last lecture)

(From last lecture)

Probability vector and transition matrix

Markov model for molecular evolution

Write probability vector of ancestral sequence


p 0 = ( P A , P G , PC , P T )

We further assume that:


Probabilistic mutation process over subsequent time
steps is equivalent to that of the 1st time step.
For each site, what happens during each of
subsequent time steps depends only on what the
base was at the beginning of that time step, and is
irrelevant to what that base was in previous time
steps (i.e., the process has no memory) .

Arrange conditional probabilities of base substitutions


into transition matrix :
Here we write
P(S1=i | S0=j) as
Pi | j for simplicity

Note :
Always use the ordering A,G,C,T.
Column sums of M are all 1.

This kind of model is a Markov model.


2

(From last lecture)

A side-note on matrix multiplication

Powers of the matrix M

When matrices A and B are multiplied together to form a


matrix C (=AB):

length of each (horizontal) row of A must equal to


length of each (vertical) column of B (otherwise the
product AB is not defined),

the i-th row of A is multiplied to the j-th column of B to


form the (i,j)-th entry of C (which is at the i-th row and
j-th column of C),

Note that in general AB z BA


A nice description can be found in website:
http://www.mai.liu.se/~halun/matrix/matrix.html

Fact : for any t = 1,2,..., the matrix M t (i.e., product of t


copies of M) is equal to:

Hence entries of M t give conditional probabilities


P(St=i | S0=j) where i, j = A, G, C, or T.

Jukes-Cantor model

Some Markov models for molecular evolution

This is the simplest Markov model of base


substitution, which assumes:

p0 = ( 1/4, 1/4, 1/4, 1/4), and

Jukes-Cantor model
2-parameter Kimura model
3-parameter Kimura model

conditional probabilities Pi | j are the same for any pair


of i, j with i z j (here i, j = A,G,C,T). This means:

(There are other kinds of models for molecular


evolution)

Jukes-Cantor model

Jukes-Cantor model

Significance of D :

For the Jukes-Cantor model, we have:

D is the only parameter in M.


D rate of observable base substitutions occur over one time step.

Examples of known estimated value of D :

D # 108 mutations per site per year for mitochondrial DNA in


mammals.
D # 0.01 mutations per site per year for influenza A virus.

Therefore, pt = (1/4, 1/4, 1/4, 1/4) for any subsequent


time instances t = 1, 2, , and we say that (1/4, 1/4,
1/4, 1/4) is an equilibrium base distribution for DNA
sequences under the Jukes-Cantor model.

We say that there is a molecular clock when mutation


rates are constant.
However, in general, the mutation rate may not be constant
over time or for different locations within the DNA.

Computing Mt with Jukes-Cantor model


Jukes-Cantor model
Note that :

Question : For the Jukes-Cantor model, find the


probability (as an expression in terms of D) that a
base A in the ancestral sequence will have mutated
to become T in the descendent sequence after 100
time steps?

where

Answer is P(S100=T | S0=A), which is the entry on

the 4th row and 1st column of the matrix power


M100. But, how to get the numerical value (in

For any real number r, define K(r) := r I + (1r)J.


Since II = I, IJ = JI = JJ = J, one has
K(r) K(s) = (r I + (1r) J) (s I + (1s) J)
= rs I + (r(1s)+(1r)s +(1r)(1s)) J = rs I + (1rs) J = K(rs),
and K(r) t = K(r) K(r) K(r) = K(r t).

terms of D) of this particular entry?

10

11

Computing Mt with Jukes-Cantor model


Since M = K(r) where

2-parameter Kimura model

, one gets

The Kimura 2-parameter model is another Markov model


which allows for different rates of transitions (changes
within the purine group or within the pyrimidine group)
and transversions (changes from purine to pyrimidine or
vice versa). The Markov matrix for this model is:

Mt = K(r t) = r t I + (1 r t) J

Therefore answer to the previous problem is

Here E = mutation rate for transitions,


J = mutation rate for each of the possible transversions.
Numerical value of is 1  E  2J.

P(S100=T | S0=A) =
12

3-parameter Kimura model

13

Phylogenetic distances

The Kimura 3-parameter model


has Markov matrix :

Problem : Given an ancestral sequence S0 and a


descendent sequence St (evolved over t time steps),
how to estimate total amount of mutations (including
hidden mutations) from observed amount of mutations?

Numerical value of is 1  E  J  G.
Note :

Jukes-Cantor model is a particular case of 2-parameter


Kimura model.

2-parameter Kimura model is particular case of 3parameter Kimura model.

All these three models satisfy M p0 = p0 where


p0 = (1/4, 1/4, 1/4, 1/4).

S0 : GCTAGT ATGATCAGCGG
p
p
p

St : GTTAGA ACGATCAGCAG

14

15

Example : Consider the case of influenza A virus, which


has mutation rate D = .01 per year. The following graph
gives values of p(t) for the case D = .01.

Phylogenetic distances
Assume Jukes-Cantor model with mutation rate D. As
diagonal entries of Mt is

fraction of sites in descendent sequence St (after t


time steps) that agreed with their initial bases of
ancestral sequence S0 is

Therefore fraction of sites that are different is

16

Example : Suppose mutations of influenza A


virus follow Jukes-Cantor model with mutation
rate D = .01 per year. Then, after 10 years,
fraction of sites of descendent sequence that
are different from initial sequence is :

17

From equation

one can deduce


Example : consider previous example of influenza A virus
where D = .01 per year. If 3/10 of sites of descendent
sequence St are different from the ancestral sequence,
then p = 0.3, and the above formula gives estimate
length of time elapsed t = 38.06 (years).

18

19

Jukes-Cantor distance
Even when D is not known, we may give estimate of a
distance between ancestral sequence and
descendent sequence in the following way.

Note that
t D = (no. of time steps) (mutation rate)
= (no. of time steps)(no. of substitutions per site / time step)
= (expected no. of substitutions per site during the elapsed time).

Note that, for any real number x near 0, one has

Here the substitutions include those hidden ones.

ln(1+x) # x

Let d := t D. Then d gives a measure of distance between the


ancestral sequence and the descendent sequence.

Thus if D is small, ln(1  4D/3) #  4D/3. Hence

In this way, we define Jukes-Cantor distance between DNA


sequences Sa and Sb as

where p is fraction of sites that disagree in comparing Sa with Sb .

Therefore
20

21

Kimura distances

Provided a molecular clock hypothesis is valid (i.e.,


mutation rate is constant over time), the distance
computed is proportional to the length of elapsed time.

For Kimura 3-parameter model, the distance formula is:


dK3 =  (1/4) (ln(1  2E  2J) + ln(1  2E  2G)  ln(1  2J  2G))

In such case, the distance is a measure of how much time


was required for one sequence to mutate into the other.

where E, J, G are parameters in the Kimura 3-parameter


matrix.

Example: consider the 40-base


sequence example of the last
lecture.

For Kimura 2-parameter model, distance formula is


dK2 =  (1/2) ln(1  2p1  p2)  (1/4) ln(1  2 p2)
where p1 is probability of a transition and p2 is
probability of a transversion.

As p = 11/40, one gets :

Exercise : deduce the formula of dK2 from that of dK3.

22

23

Symmetric property of distances

Additive property of distances

All distances discussed so far are symmetric :

d(Sa, Sb) = d(Sb , Sa).

Also, all distances discussed so far follow the additive


rule :
if ancestral sequence S0 evolved to become S1, and S1
evolved to become S2 , then

Consequently, if we have two sequences, one


of which (but we do not know which one) is a
descendent of the other, then we may take one
of them arbitrarily as the ancestor and the other
as descendent to calculate the distance
between them. Different choices of the ancestor
and descendent will not affect the result.

d(S0, S2) = d(S0, S1) +d(S1, S2).


d(S0, S1)
S0

d(S1, S2)
S1

S2

Symmetric and additive properties of distances are


useful in constructing phylogenetic trees relating many
species.

24

25

Distance between two descendents

In practice, it is uncommon to have DNA sequences for both an


ancestor and a descendent. Instead, we usually have two
descendent DNA sequences S1 and S2 that mutated from a
common, yet unknown ancestral sequence S0 :

S0

S1

S2

There are many other kinds of models for


molecular evolution, and many other different
kinds of distances for DNA sequences. In this
course we are only able to study some of the
simplest cases.

In such case we can only compute d(S1, S2). However, if d is


symmetric and additive, then since d(S0, S1) = d(S1, S0), one may
treat S1 as ancestor and S0 as descendent. Using additive rule,
we have
d(S1, S2) = d(S1, S0) + d(S0, S2) = d(S0, S1) + d(S0, S2)
26

27

Trees

Questions :

A tree is a diagram of vertices and edges, in which


all vertices are interconnected by edges, and no
loops are formed by the edges.

Which of the gorilla, chimpanzee, orangutan, and


gibbon are humans closest evolutionary kin?
Which of the following evolutionary trees is more
probable?

Vertices in a tree:
leaves = terminal
vertices
internal vertices

Example of a tree
28

29

Examples of non-trees

Phylogenetic rooted trees

each leaf = one taxon


(source of DNA sequence)

Some vertices not


connected to others

internal vertices =
ancestors
root = ultimate
common ancestor
bifurcating : each
ancestor gives 2
immediate descendents

Loop exists
30

31

Topological trees and metric trees


Unrooted trees

Have no roots
Placing a root at different
locations yields different
rooted trees

Topological trees : show only branching relationship


of vertices; do not care about edge lengths
Metric trees : show also lengths of edges (usually by
indicating lengths on edges)

Topological tree

Metric tree
(lengths not to scale)

Metric tree
(lengths to scale)

32

33

If an edge has zero length

Meaning of edge lengths

In a metric tree, an edge of zero length can be


treated as if its two end vertices coincide.

Edge lengths in phylogenetic trees usually


represent amount of mutation occurred between
splittings (e.g., Juke-Cantor distance between
DNA sequences).

If molecular clock assumption holds :


edge lengths are proportional to elapsed time,
every taxon of a rooted tree has same total
distance from the root.

is the same as

34

35

Topologically equivalent trees

Number of topologically distinct trees

Trees are topologically the same if they represent


same branching relations; otherwise they are
topologically distinct.

Topologically equivalent rooted tree

When n grows, number of distinct rooted or


unrooted bifurcating trees with n taxa (as leaves)
grows rapidly.

Topologically distinct
unrooted trees
36

37

Another way to draw rooted trees

Reference

Rooted trees may be drawn in square manner:


vertical lines represent splits,
horizontal lines represent distances of splits.

Mathematical Models in Biology, An Introduction, E.S.


Allman and J.A. Rhodes, Cambridge University Press, 2004.

WWW resource

http://www.nmsr.org/upgma.htm

Two ways of drawing the same rooted tree


38

39

Potrebbero piacerti anche