Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, October 8, 2015
Outline
1 Recap
2 Sequence alignment
Today
1 Recap
2 Sequence alignment
Dynamic Programming
I
A quadratic DP algorithm
n
X
i=1
(1)
err(L, P ) =
(2)
i=1
satisfies
a
b =
P
P
xi yi ( i xi )( i yi )
P
P
n i x2i ( i xi )2
P
P
i yi a
i xi
n
n
(3)
(4)
is minimum.
Remark 1.
1. We can precompute and store all ei,j using equations (2),
(3), (4) in O(n3 ) time. Can be improved to O(n2 ).
2. The natural recursive algorithm arising from recurrence (5)
is not efficient (think about its recursion tree!).
1. Overlapping subproblems
2. An easy-to-compute recurrence (5) for combining solutions
to the smaller subproblems into a solution to a larger
subproblem in O(n) time (once smaller subproblems have
been solved).
3. Iterative, bottom-up computations: compute the
subproblems from smallest (0 points) to largest (n points),
iteratively.
4. Small number of subproblems: we only need to solve n
subproblems.
OP T (j) = min
1ij
n
o
ei,j + C + OP T (i 1)
end for
Return M [n]
Running time: time required to fill in dynamic programming
array M is O(n3 ) + O(n2 ). Can be brought down to O(n2 ).
Today
1 Recap
2 Sequence alignment
String similarity
Input:
c
c
u
u
r
r
r
r
a
e
n
n
c
c
e
e
c
c
u
u
r
r
r
r
n
n
c
c
e
e
Strings in biology
Why similarity?
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
Then {(2, 1), (3, 2), (4, 3)} is an alignment of X and Y : these
are the pairs of positions in X, Y that are aligned (matched).
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
Cost of an alignment
Let L be an alignment of X = x1 . . . xm , Y = y1 . . . yn .
1. Gap penalty : there is a cost for every position of X
that is not matched in Y ; and vice versa.
2. Mismatch cost: there is a cost pq for every pair of
alphabet symbols p, q that are matched in L.
I
I
The cost of alignment L is the sum of all the gap and the
mismatch costs.
1jn
(i,j)L
Examples
Example 1.
Let L1 be the alignment shown below.
x1
o
o
y1
x2
c
c
y2
c
y3
x3
u
u
y4
x4
r
r
y5
x5
r
r
y6
x6
a
e
y7
x7
n
n
y8
x8
c
c
y9
x9
e
e
y10
L1 = {(1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10)}
cost(L1 ) = + ae
(This is Y3L1 + x6 y7 .)
Examples
Example 2.
Let L2 be the alignment shown below.
x1
o
o
x2
c
c
x3
u
u
x4
r
r
x5
r
r
x6
a
-
x7
n
n
x8
c
c
L1 = {(1, 1), (2, 3), (3, 4), (4, 5), (5, 6), (7, 8), (8, 9), (9, 10)}
cost(L2 ) = 3
x9
e
e
Examples
Example 3.
Let L3 , L4 be the alignments shown below.
x1
G
C
y1
x2
C
A
y2
x3
A
T
y3
x4
T
G
y4
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
cost(L3 ) = GC + CA + AT + TG
cost(L4 ) = 2
Input:
I
Claim 1.
Let L be the optimal alignment. Then either
1. the last two symbols xm , yn of X, Y are matched in L,
hence the pair (m, n) L; or
2. xm , yn are not matched in L, hence (m, n) 6 L.
In this case, at least one of xm , yn is not matched in L,
hence at least one of m, n does not appear in L.
Proof of Claim 1
By contradiction.
Suppose (m, n) 6 L but xm and yn are both matched in L.
That is,
1. xm is matched with yj for some j < n, hence (m, j) L;
2. yn is matched with xi for some i < m, hence (i, n) L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.
Rewriting Claim 1
Fact 4.
In an optimal alignment L, at least one of the following is true
1. (m, n) L; or
2. xm is not matched; or
3. yn is not matched.
Let
OP T (i, j) = minimum cost of an alignment between x1 . . . xi , y1 . . . yj
xi yj + OP T (i 1, j 1)
+ OP T (i 1, j)
OP T (i, j) =
min
+ OP T (i, j 1)
, if i = 0
, if i, j 1
, if j = 0
Remarks
I
0
0
i-1
i
m
j-1 j
Pseudocode
SequenceAlignment(X, Y )
Initialize M [i, 0] to i
Initialize M [0, j] to j
for j = 1 to n do
for i = 1 to m don
M [i, j] = min xi yj + M [i 1, j 1],
o
+ M [i 1, j], + M [i, j 1]
end for
end for
return M [m, n]
Running time?
I
I
Time: O(mn)
Space: O(mn)
I
I
English words: m, n 10
Computational biology: m = n = 100000
I
I