Slides10 8

Algorithms for Data Science
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, October 8, 2015
Outline
1 Recap
Applying the DP principle
2 Sequence alignment
Today
1 Recap
Review of the last lecture
Dynamic Programming
I
The problem: data segmentation
A mathematical formulation of the problem
An exponential-time brute-force approach
An exponential-time recursive algorithm
A quadratic DP algorithm
Linear least squares fitting

A foundational problem in statistics: find a line of best fit
through some data points.
Linear least squares fitting
Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );

we assume x1 < x2 < . . . < xn .
Output: the line L defined as y = ax + b that minimizes the

error
err(L, P ) =
n
X
i=1
(yi axi b)2
(1)
Linear least squares fitting: solution

Given a set P of data points, we can use calculus to show that
the line L given by y = ax + b that minimizes
n
X
err(L, P ) =
(yi axi b)2
(2)
i=1
satisfies
a
b =
P
P
xi yi ( i xi )( i yi )
P
P
n i x2i ( i xi )2
P
P
i yi a
i xi
n
n
How fast can we compute a, b?
(3)
(4)
What if the data changes direction?
Formalizing the problem

Input: data set P = {p1 , . . . , pn } of points on the plane.
I
A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the

input.
Let S be a partition of P into mS segments S1 , S2 , . . . , SmS .

For every segment Sk , use (2), (3), (4) to compute a line Lk that
minimizes err(Lk , Sk ).
Let C > 0 be a fixed multiplier. The cost of the partition is

X
err(Lk , Sk ) + mS C
Sk S
Segmented least squares
This problem is an instance of change detection in data mining

and statistics.
Input: A set P of n data points pi = (xi , yi ) as before.

Output: A segmentation S = {S1 , S2 , . . . , SmS } of P whose
cost
X
err(Lk , Sk ) + mS C
Sk S
is minimum.
A recurrence for the optimal solution

Notation: let ei,j = err(L, {pi , . . . , pj }), for 1 i j n.
Then
n
o
OP T (n) = min ei,n + C + OP T (i 1) .
1in
If we apply the above expression recursively to remove the last

segment, we obtain the recurrence
n
o
OP T (j) = min ei,j + C + OP T (i 1)
(5)
1ij
Remark 1.
1. We can precompute and store all ei,j using equations (2),
(3), (4) in O(n3 ) time. Can be improved to O(n2 ).
2. The natural recursive algorithm arising from recurrence (5)
is not efficient (think about its recursion tree!).
Elements of DP in segmented least squares
1. Overlapping subproblems
2. An easy-to-compute recurrence (5) for combining solutions
to the smaller subproblems into a solution to a larger
subproblem in O(n) time (once smaller subproblems have
been solved).
3. Iterative, bottom-up computations: compute the
subproblems from smallest (0 points) to largest (n points),
iteratively.
4. Small number of subproblems: we only need to solve n
subproblems.
A dynamic programming approach
OP T (j) = min
1ij
n
o
ei,j + C + OP T (i 1)
The optimal solution to the subproblem on p1 , . . . , pj

contains optimal solutions to smaller subproblems.
Recurrence 5 provides an ordering of the subproblems

from smaller to larger, with the subproblem of size 0 being
the smallest and the subproblem of size n the largest.
There are n + 1 subproblems in total. Solving the j-th

subproblem requires (j) = O(n) time.
The overall running time is O(n2 ).
I
Boundary conditions: OP T (0) = 0.
Segment pk , . . . , pj appears in the optimal solution only if

the minimum in the expression above is achieved for i = k.
An iterative algorithm for segmented least squares

Let M be an array of n entries. M [i] stores the cost of the
optimal segmentation of the first i data points.
SegmentedLS(n, P )
M [0] = 0
for all pairs i j do
Compute ei,j for segment pi , . . . , pj using (2), (3), (4)
end for
for j = 1 to n do
M [j] = min {ei,j + C + M [i 1]}
1ij
end for
Return M [n]
Running time: time required to fill in dynamic programming
array M is O(n3 ) + O(n2 ). Can be brought down to O(n2 ).
Reconstructing an optimal segmentation

I
Suppose we want the optimal solution in addition to its

value, that is, the actual segmentation that achieves the
minimum cost M [n].
We can trace back through the dynamic programming

array M to compute the optimal segmentation.
Initial call: OPTSegmentation(n)

OPTSegmentation(j)
if (j == 0) then return
else
Find 1 i j such that M [j] = ei,j + C + M [i 1]
OPTSegmentation(i 1)
Output segment {pi , . . . , pj }
end if
Obtaining efficient algorithms using DP
1. Optimal substructure: the optimal solution to the problem

contains optimal solutions to the subproblems.
2. A recurrence for the overall optimal solution in terms of
optimal solutions to appropriate subproblems. The
recurrence should provide a natural ordering of the
subproblems from smaller to larger and require polynomial
work for combining solutions to the subproblems.
3. Iterative, bottom-up computation of subproblems, from
smaller to larger.
4. Small number of subproblems (polynomial in n).
Dynamic programming vs Divide & Conquer
They both combine solutions to subproblems to generate

the overall solution.
However, divide and conquer starts with a large problem

and divides it into small pieces.
While dynamic programming works from the bottom up,

solving the smallest subproblems first and building optimal
solutions to steadily larger problems.
Today
1 Recap
String similarity
This problem arises when comparing strings.

Example: consider an online dictionary.
I
Input:
a word, e.g., ocurrance
Output: did you mean occurrence?
Similarity: intuitively, two words are similar if we can almost

line them up by using gaps and mismatches.
Aligning strings using gaps and mismatches
We can align ocurrance and occurrence using

I
one gap and one mismatch

o
o
c
c
u
u
r
r
r
r
a
e
n
n
c
c
e
e
or, three gaps

o
o
c
c
u
u
r
r
r
r
n
n
c
c
e
e
Strings in biology
Similarity of english words is rather intuitive.
Determining similarity of biological strings is a central

computational problem for molecular biologists.
I
Chromosomes again: an organisms genome consists of

chromosomes (giant linear DNA molecules)
We may think of a chromosome as an enormous linear tape
containing a string over the alphabet {A, C, G, T }.
The string encodes instructions for building protein
molecules.
Why similarity?
Why are we interested in similarity of biological strings?

I
Roughly speaking, the sequence of symbols in an

organisms genome determines the properties of the
organism.
So similarity can guide decisions about biological

experiments.
How do we define similarity between two strings?
Similarity based on the notion of lining up two strings
Informally, an alignment between two strings tells us which

pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
Then {(2, 1), (3, 2), (4, 3)} is an alignment of X and Y : these
are the pairs of positions in X, Y that are aligned (matched).
Definition of alignment of two strings

An alignment L of X = x1 . . . xm , Y = y1 . . . yn is a set of
ordered pairs of indices (i, j) with i [1, m], j [1, n] such that
the following two properties hold:
P1. every i [1, m] and every j [1, n] appears at most once;
P2. pairs do not cross: if (i, j), (i0 , j 0 ) L and i < i0 , then
j < j0.
Example: X = GCAT, Y = CATG
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
1. {(2, 1), (3, 2), (4, 3)} is an alignment; but

2. {(2, 1), (3, 2), (4, 3), (1, 4)} is not an alignment (violates P2).
Cost of an alignment
Let L be an alignment of X = x1 . . . xm , Y = y1 . . . yn .
1. Gap penalty : there is a cost for every position of X
that is not matched in Y ; and vice versa.
2. Mismatch cost: there is a cost pq for every pair of
alphabet symbols p, q that are matched in L.
I
I
So every pair (i, j) L incurs a cost of xi yj .

Assumption: pp = 0 (matching a symbol with itself
incurs no cost).
The cost of alignment L is the sum of all the gap and the
mismatch costs.
Cost of alignment in symbols
In symbols, given alignment L, let

I
XiL = 1 if position i of X is not matched,
YjL = 1 if position j of Y is not matched.
Then the cost of alignment L is given by

X
X
X
cost(L) =
XiL +
YjL +
xi yj
1im
1jn
(i,j)L
Examples
Example 1.
Let L1 be the alignment shown below.
x1
o
o
y1
x2
c
c
y2
c
y3
x3
u
u
y4
x4
r
r
y5
x5
r
r
y6
x6
a
e
y7
x7
n
n
y8
x8
c
c
y9
x9
e
e
y10
L1 = {(1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10)}
cost(L1 ) = + ae
(This is Y3L1 + x6 y7 .)
Examples
Example 2.
Let L2 be the alignment shown below.
x1
o
o
x2
c
c
x3
u
u
x4
r
r
x5
r
r
x6
a
-
x7
n
n
x8
c
c
L1 = {(1, 1), (2, 3), (3, 4), (4, 5), (5, 6), (7, 8), (8, 9), (9, 10)}
cost(L2 ) = 3
(This is X6L2 + Y2L2 + Y7L2 .)
x9
e
e
Examples
Example 3.
Let L3 , L4 be the alignments shown below.
x1
G
C
y1
x2
C
A
y2
x3
A
T
y3
x4
T
G
y4
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
L3 = {(1, 1), (2, 2), (3, 3), (4, 4)}
L4 = {(2, 1), (3, 2), (4, 3)}
cost(L3 ) = GC + CA + AT + TG
cost(L4 ) = 2
The sequence alignment problem
Input:
I
two strings X, Y consisting of m, n symbols respectively;

each symbol is from some alphabet
the gap penalty
the mismatch costs {pq } for every pair (p, q) 2
Output: the alignment L of minimum cost.
Towards a recursive solution
Claim 1.
Let L be the optimal alignment. Then either
1. the last two symbols xm , yn of X, Y are matched in L,
hence the pair (m, n) L; or
2. xm , yn are not matched in L, hence (m, n) 6 L.
In this case, at least one of xm , yn is not matched in L,
hence at least one of m, n does not appear in L.
Proof of Claim 1
By contradiction.
Suppose (m, n) 6 L but xm and yn are both matched in L.
That is,
1. xm is matched with yj for some j < n, hence (m, j) L;
2. yn is matched with xi for some i < m, hence (i, n) L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.
Rewriting Claim 1
The following equivalent way of stating Claim 1 will allow us to

easily derive a recurrence.
Fact 4.
In an optimal alignment L, at least one of the following is true
1. (m, n) L; or
2. xm is not matched; or
3. yn is not matched.
The subproblems for sequence alignment
Let
OP T (i, j) = minimum cost of an alignment between x1 . . . xi , y1 . . . yj
We want OP T (m, n). From Fact 4,

1. If (m, n) L, we pay xm yn + OP T (m 1, n 1).
2. If xm is not matched, we pay + OP T (m 1, n).
3. If yn is not matched, we pay + OP T (m, n 1).
How do we decide which of the three to use for OP T (m, n)?
The recurrence for the sequence alignment problem
xi yj + OP T (i 1, j 1)
+ OP T (i 1, j)
OP T (i, j) =
min
+ OP T (i, j 1)
, if i = 0
, if i, j 1
, if j = 0
Remarks
I
Boundary cases: OP T (0, j) = j and OP T (i, 0) = i.
Pair (i, j) appears in the optimal alignment for subproblem

x1 . . . xi , y1 . . . yj if and only if the minimum is achieved by
the first of the three values inside the min computation.
Computing the cost of the optimal alignment

I
I
M is an (m + 1) (n + 1) dynamic programming table.

Fill in M so that all subproblems needed for entry M [i, j]
have already been computed when we compute M [i, j]
(e.g., column-by-column).
0
0
i-1
i
m
j-1 j
Pseudocode
SequenceAlignment(X, Y )
Initialize M [i, 0] to i
Initialize M [0, j] to j
for j = 1 to n do
for i = 1 to m don
M [i, j] = min xi yj + M [i 1, j 1],
o
+ M [i 1, j], + M [i, j 1]
end for
end for
return M [m, n]
Running time?
Reconstructing the optimal alignment

Given M , we can reconstruct the optimal alignment as follows.
TraceAlignment(i, j)
if i == 0 or j == 0 then return
else
if M [i, j] == xi yj + M [i 1, j 1] then
TraceAlignment(i 1, j 1)
Output (i, j),
else
if M [i, j] == + M [i 1, j] then TraceAlignment(i 1, j)
else TraceAlignment(i, j 1)
end if
end if
end if
Initial call: TraceAlignment(m, n)
Running time?
Resources used by dynamic programming algorithm
I
I
Time: O(mn)
Space: O(mn)
I
I
English words: m, n 10
Computational biology: m = n = 100000
I
I
Time: 10 billions ops

Space: 10GB table!
Can we avoid using quadratic space while maintaining

quadratic running time?
Using only O(m + n) space
1. First, suppose we are only interested in the cost of the

optimal alignment.
Easy: keep a table M with 2 columns, hence 2(m + 1)
entries.
2. What if we want the optimal alignment too?
I
No longer possible in O(n + m) time.

Slides10 8

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Slides10 8

Caricato da

Copyright:

Formati disponibili

Algorithms for Data Science

Applying the DP principle

Applying the DP principle

Review of the last lecture

The problem: data segmentation

A mathematical formulation of the problem

An exponential-time brute-force approach

An exponential-time recursive algorithm

Linear least squares fitting

Linear least squares fitting

Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );

Output: the line L defined as y = ax + b that minimizes the

(yi axi b)2

Linear least squares fitting: solution

(yi axi b)2

How fast can we compute a, b?

What if the data changes direction?

Formalizing the problem

A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the

Let S be a partition of P into mS segments S1 , S2 , . . . , SmS .

Let C > 0 be a fixed multiplier. The cost of the partition is

Segmented least squares

This problem is an instance of change detection in data mining

Input: A set P of n data points pi = (xi , yi ) as before.

A recurrence for the optimal solution

If we apply the above expression recursively to remove the last

Elements of DP in segmented least squares

A dynamic programming approach

The optimal solution to the subproblem on p1 , . . . , pj

Recurrence 5 provides an ordering of the subproblems

There are n + 1 subproblems in total. Solving the j-th

Boundary conditions: OP T (0) = 0.

Segment pk , . . . , pj appears in the optimal solution only if

An iterative algorithm for segmented least squares

Reconstructing an optimal segmentation

Suppose we want the optimal solution in addition to its

We can trace back through the dynamic programming

Initial call: OPTSegmentation(n)

Obtaining efficient algorithms using DP

1. Optimal substructure: the optimal solution to the problem

Dynamic programming vs Divide & Conquer

They both combine solutions to subproblems to generate

However, divide and conquer starts with a large problem

While dynamic programming works from the bottom up,

Applying the DP principle

This problem arises when comparing strings.

a word, e.g., ocurrance

Output: did you mean occurrence?

Similarity: intuitively, two words are similar if we can almost

Aligning strings using gaps and mismatches

We can align ocurrance and occurrence using

one gap and one mismatch

or, three gaps

Similarity of english words is rather intuitive.

Determining similarity of biological strings is a central

Chromosomes again: an organisms genome consists of

Why are we interested in similarity of biological strings?

Roughly speaking, the sequence of symbols in an

So similarity can guide decisions about biological

How do we define similarity between two strings?

Similarity based on the notion of lining up two strings

Informally, an alignment between two strings tells us which

Definition of alignment of two strings