Sei sulla pagina 1di 40

Algorithms for Data Science

CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, October 8, 2015

Outline

1 Recap

Applying the DP principle

2 Sequence alignment

Today

1 Recap

Applying the DP principle

2 Sequence alignment

Review of the last lecture

Dynamic Programming
I

The problem: data segmentation

A mathematical formulation of the problem

An exponential-time brute-force approach

An exponential-time recursive algorithm

A quadratic DP algorithm

Linear least squares fitting


A foundational problem in statistics: find a line of best fit
through some data points.

Linear least squares fitting

Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );


we assume x1 < x2 < . . . < xn .

Output: the line L defined as y = ax + b that minimizes the


error
err(L, P ) =

n
X
i=1

(yi axi b)2

(1)

Linear least squares fitting: solution


Given a set P of data points, we can use calculus to show that
the line L given by y = ax + b that minimizes
n
X

err(L, P ) =

(yi axi b)2

(2)

i=1

satisfies
a

b =

P
P
xi yi ( i xi )( i yi )
P
P
n i x2i ( i xi )2
P
P
i yi a
i xi
n
n

How fast can we compute a, b?

(3)
(4)

What if the data changes direction?

Formalizing the problem


Input: data set P = {p1 , . . . , pn } of points on the plane.
I

A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the


input.

Let S be a partition of P into mS segments S1 , S2 , . . . , SmS .


For every segment Sk , use (2), (3), (4) to compute a line Lk that
minimizes err(Lk , Sk ).

Let C > 0 be a fixed multiplier. The cost of the partition is


X
err(Lk , Sk ) + mS C
Sk S

Segmented least squares

This problem is an instance of change detection in data mining


and statistics.

Input: A set P of n data points pi = (xi , yi ) as before.


Output: A segmentation S = {S1 , S2 , . . . , SmS } of P whose
cost
X
err(Lk , Sk ) + mS C
Sk S

is minimum.

A recurrence for the optimal solution


Notation: let ei,j = err(L, {pi , . . . , pj }), for 1 i j n.
Then
n
o
OP T (n) = min ei,n + C + OP T (i 1) .
1in

If we apply the above expression recursively to remove the last


segment, we obtain the recurrence
n
o
OP T (j) = min ei,j + C + OP T (i 1)
(5)
1ij

Remark 1.
1. We can precompute and store all ei,j using equations (2),
(3), (4) in O(n3 ) time. Can be improved to O(n2 ).
2. The natural recursive algorithm arising from recurrence (5)
is not efficient (think about its recursion tree!).

Elements of DP in segmented least squares

1. Overlapping subproblems
2. An easy-to-compute recurrence (5) for combining solutions
to the smaller subproblems into a solution to a larger
subproblem in O(n) time (once smaller subproblems have
been solved).
3. Iterative, bottom-up computations: compute the
subproblems from smallest (0 points) to largest (n points),
iteratively.
4. Small number of subproblems: we only need to solve n
subproblems.

A dynamic programming approach

OP T (j) = min

1ij

n
o
ei,j + C + OP T (i 1)

The optimal solution to the subproblem on p1 , . . . , pj


contains optimal solutions to smaller subproblems.

Recurrence 5 provides an ordering of the subproblems


from smaller to larger, with the subproblem of size 0 being
the smallest and the subproblem of size n the largest.

There are n + 1 subproblems in total. Solving the j-th


subproblem requires (j) = O(n) time.
The overall running time is O(n2 ).
I

Boundary conditions: OP T (0) = 0.

Segment pk , . . . , pj appears in the optimal solution only if


the minimum in the expression above is achieved for i = k.

An iterative algorithm for segmented least squares


Let M be an array of n entries. M [i] stores the cost of the
optimal segmentation of the first i data points.
SegmentedLS(n, P )
M [0] = 0
for all pairs i j do
Compute ei,j for segment pi , . . . , pj using (2), (3), (4)
end for
for j = 1 to n do
M [j] = min {ei,j + C + M [i 1]}
1ij

end for
Return M [n]
Running time: time required to fill in dynamic programming
array M is O(n3 ) + O(n2 ). Can be brought down to O(n2 ).

Reconstructing an optimal segmentation


I

Suppose we want the optimal solution in addition to its


value, that is, the actual segmentation that achieves the
minimum cost M [n].

We can trace back through the dynamic programming


array M to compute the optimal segmentation.

Initial call: OPTSegmentation(n)


OPTSegmentation(j)
if (j == 0) then return
else
Find 1 i j such that M [j] = ei,j + C + M [i 1]
OPTSegmentation(i 1)
Output segment {pi , . . . , pj }
end if

Obtaining efficient algorithms using DP

1. Optimal substructure: the optimal solution to the problem


contains optimal solutions to the subproblems.
2. A recurrence for the overall optimal solution in terms of
optimal solutions to appropriate subproblems. The
recurrence should provide a natural ordering of the
subproblems from smaller to larger and require polynomial
work for combining solutions to the subproblems.
3. Iterative, bottom-up computation of subproblems, from
smaller to larger.
4. Small number of subproblems (polynomial in n).

Dynamic programming vs Divide & Conquer

They both combine solutions to subproblems to generate


the overall solution.

However, divide and conquer starts with a large problem


and divides it into small pieces.

While dynamic programming works from the bottom up,


solving the smallest subproblems first and building optimal
solutions to steadily larger problems.

Today

1 Recap

Applying the DP principle

2 Sequence alignment

String similarity

This problem arises when comparing strings.


Example: consider an online dictionary.
I

Input:

a word, e.g., ocurrance

Output: did you mean occurrence?

Similarity: intuitively, two words are similar if we can almost


line them up by using gaps and mismatches.

Aligning strings using gaps and mismatches

We can align ocurrance and occurrence using


I

one gap and one mismatch


o
o

c
c

u
u

r
r

r
r

a
e

n
n

c
c

e
e

or, three gaps


o
o

c
c

u
u

r
r

r
r

n
n

c
c

e
e

Strings in biology

Similarity of english words is rather intuitive.

Determining similarity of biological strings is a central


computational problem for molecular biologists.
I

Chromosomes again: an organisms genome consists of


chromosomes (giant linear DNA molecules)
We may think of a chromosome as an enormous linear tape
containing a string over the alphabet {A, C, G, T }.
The string encodes instructions for building protein
molecules.

Why similarity?

Why are we interested in similarity of biological strings?


I

Roughly speaking, the sequence of symbols in an


organisms genome determines the properties of the
organism.

So similarity can guide decisions about biological


experiments.

How do we define similarity between two strings?

Similarity based on the notion of lining up two strings

Informally, an alignment between two strings tells us which


pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

Then {(2, 1), (3, 2), (4, 3)} is an alignment of X and Y : these
are the pairs of positions in X, Y that are aligned (matched).

Definition of alignment of two strings


An alignment L of X = x1 . . . xm , Y = y1 . . . yn is a set of
ordered pairs of indices (i, j) with i [1, m], j [1, n] such that
the following two properties hold:
P1. every i [1, m] and every j [1, n] appears at most once;
P2. pairs do not cross: if (i, j), (i0 , j 0 ) L and i < i0 , then
j < j0.
Example: X = GCAT, Y = CATG
x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

1. {(2, 1), (3, 2), (4, 3)} is an alignment; but


2. {(2, 1), (3, 2), (4, 3), (1, 4)} is not an alignment (violates P2).

Cost of an alignment

Let L be an alignment of X = x1 . . . xm , Y = y1 . . . yn .
1. Gap penalty : there is a cost for every position of X
that is not matched in Y ; and vice versa.
2. Mismatch cost: there is a cost pq for every pair of
alphabet symbols p, q that are matched in L.
I
I

So every pair (i, j) L incurs a cost of xi yj .


Assumption: pp = 0 (matching a symbol with itself
incurs no cost).

The cost of alignment L is the sum of all the gap and the
mismatch costs.

Cost of alignment in symbols

In symbols, given alignment L, let


I

XiL = 1 if position i of X is not matched,

YjL = 1 if position j of Y is not matched.

Then the cost of alignment L is given by


X
X
X
cost(L) =
XiL +
YjL +
xi yj
1im

1jn

(i,j)L

Examples

Example 1.
Let L1 be the alignment shown below.
x1
o
o
y1

x2
c
c
y2

c
y3

x3
u
u
y4

x4
r
r
y5

x5
r
r
y6

x6
a
e
y7

x7
n
n
y8

x8
c
c
y9

x9
e
e
y10

L1 = {(1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10)}

cost(L1 ) = + ae

(This is Y3L1 + x6 y7 .)

Examples

Example 2.
Let L2 be the alignment shown below.
x1
o
o

x2
c
c

x3
u
u

x4
r
r

x5
r
r

x6
a
-

x7
n
n

x8
c
c

L1 = {(1, 1), (2, 3), (3, 4), (4, 5), (5, 6), (7, 8), (8, 9), (9, 10)}

cost(L2 ) = 3

(This is X6L2 + Y2L2 + Y7L2 .)

x9
e
e

Examples

Example 3.
Let L3 , L4 be the alignments shown below.
x1
G
C
y1

x2
C
A
y2

x3
A
T
y3

x4
T
G
y4

x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

L3 = {(1, 1), (2, 2), (3, 3), (4, 4)}

L4 = {(2, 1), (3, 2), (4, 3)}

cost(L3 ) = GC + CA + AT + TG

cost(L4 ) = 2

The sequence alignment problem

Input:
I

two strings X, Y consisting of m, n symbols respectively;


each symbol is from some alphabet

the gap penalty

the mismatch costs {pq } for every pair (p, q) 2

Output: the alignment L of minimum cost.

Towards a recursive solution

Claim 1.
Let L be the optimal alignment. Then either
1. the last two symbols xm , yn of X, Y are matched in L,
hence the pair (m, n) L; or
2. xm , yn are not matched in L, hence (m, n) 6 L.
In this case, at least one of xm , yn is not matched in L,
hence at least one of m, n does not appear in L.

Proof of Claim 1

By contradiction.
Suppose (m, n) 6 L but xm and yn are both matched in L.
That is,
1. xm is matched with yj for some j < n, hence (m, j) L;
2. yn is matched with xi for some i < m, hence (i, n) L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.

Rewriting Claim 1

The following equivalent way of stating Claim 1 will allow us to


easily derive a recurrence.

Fact 4.
In an optimal alignment L, at least one of the following is true
1. (m, n) L; or
2. xm is not matched; or
3. yn is not matched.

The subproblems for sequence alignment

Let
OP T (i, j) = minimum cost of an alignment between x1 . . . xi , y1 . . . yj

We want OP T (m, n). From Fact 4,


1. If (m, n) L, we pay xm yn + OP T (m 1, n 1).
2. If xm is not matched, we pay + OP T (m 1, n).
3. If yn is not matched, we pay + OP T (m, n 1).
How do we decide which of the three to use for OP T (m, n)?

The recurrence for the sequence alignment problem

xi yj + OP T (i 1, j 1)
+ OP T (i 1, j)
OP T (i, j) =
min

+ OP T (i, j 1)

, if i = 0
, if i, j 1
, if j = 0

Remarks
I

Boundary cases: OP T (0, j) = j and OP T (i, 0) = i.

Pair (i, j) appears in the optimal alignment for subproblem


x1 . . . xi , y1 . . . yj if and only if the minimum is achieved by
the first of the three values inside the min computation.

Computing the cost of the optimal alignment


I
I

M is an (m + 1) (n + 1) dynamic programming table.


Fill in M so that all subproblems needed for entry M [i, j]
have already been computed when we compute M [i, j]
(e.g., column-by-column).

0
0

i-1
i
m

j-1 j

Pseudocode

SequenceAlignment(X, Y )
Initialize M [i, 0] to i
Initialize M [0, j] to j
for j = 1 to n do
for i = 1 to m don
M [i, j] = min xi yj + M [i 1, j 1],
o
+ M [i 1, j], + M [i, j 1]
end for
end for
return M [m, n]
Running time?

Reconstructing the optimal alignment


Given M , we can reconstruct the optimal alignment as follows.
TraceAlignment(i, j)
if i == 0 or j == 0 then return
else
if M [i, j] == xi yj + M [i 1, j 1] then
TraceAlignment(i 1, j 1)
Output (i, j),
else
if M [i, j] == + M [i 1, j] then TraceAlignment(i 1, j)
else TraceAlignment(i, j 1)
end if
end if
end if
Initial call: TraceAlignment(m, n)
Running time?

Resources used by dynamic programming algorithm

I
I

Time: O(mn)
Space: O(mn)
I
I

English words: m, n 10
Computational biology: m = n = 100000
I
I

Time: 10 billions ops


Space: 10GB table!

Can we avoid using quadratic space while maintaining


quadratic running time?

Using only O(m + n) space

1. First, suppose we are only interested in the cost of the


optimal alignment.
Easy: keep a table M with 2 columns, hence 2(m + 1)
entries.
2. What if we want the optimal alignment too?
I

No longer possible in O(n + m) time.

Potrebbero piacerti anche