Sei sulla pagina 1di 33

Computing Longest Common Substrings Using Sufx Arrays

Maxim A. Babenko, Tatiana A. Starikovskaya


Moscow State University

Computer Science in Russia, 2008

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

1 / 22

Outline

Problem Denition

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

2 / 22

Outline

1 2

Problem Denition Sufx Arrays

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

2 / 22

Outline

1 2 3

Problem Denition Sufx Arrays The Algorithm

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

2 / 22

Outline

1 2 3 4

Problem Denition Sufx Arrays The Algorithm Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

2 / 22

Part 1 Problem Denition

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

3 / 22

LCS

Problem (LCS, Longest Common Substring): Given a collection of N strings A = {1 , . . . , N } and an integer K (2 K N) nd the longest string that is a substring of at least K strings in A.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

4 / 22

LCS

Problem (LCS, Longest Common Substring): Given a collection of N strings A = {1 , . . . , N } and an integer K (2 K N) nd the longest string that is a substring of at least K strings in A. Tools: Sufx Arrays Time and Space: Linear and alphabet-independent Model of Computation: RAM

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

4 / 22

Part 2 Sufx Arrays

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

5 / 22

Useful Denitions

Denition (Sufx): Let = 1 2 . . . n be an arbitrary string of length n. For each i (1 i n) [i..] = i i+1 . . . n is a sufx of . Denition (Lexicographic order): Suppose we have some order on letters of the alphabet . This order can be extended in a standard way to strings over : < iff either is proper prex of or [1] = [1], . . . , [i] = [i], [i + 1] < [i + 1].

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

6 / 22

Sufx Arrays

Denition (Sufx Array): Let be an arbitrary string of length n. Consider its non-empty sufxes [1..], [2..], . . . , [n..]. and order them lexicographically. Let SA(i) denote the starting position of the sufx appearing on the i-th place (1 i n): [SA(1)..] < [SA(2)..] < . . . < [SA(n)..].

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

7 / 22

An Example of Sufx Array


sufxes mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i SA 11 8 5 2 1 10 9 7 4 6 3 sorted sufxes i ippi issippi ississippi mississippi pi ppi sippi sissippi ssippi ssissippi

1 2 3 4 5 6 7 8 9 10 11

Figure: String mississippi, its sufxes, and the corresponding sufx array.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

8 / 22

Why Sufx Arrays?

A simple data structure containing all the necessary information.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

9 / 22

Why Sufx Arrays?

A simple data structure containing all the necessary information. Many nice and simple efcient construction algoritms (e.g. Krkinen, Sanders [2003]) with alphabet-independent time and space complexity.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

9 / 22

Part 3 The Algorithm

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

10 / 22

Our Main Result

Theorem Let the total length of strings 1 , . . . , N be equal to L. Then the answer to the LCS problem can be computed in O(L) time and in O(L) space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

11 / 22

LCS Example

Consider the following example with N = 3, K = 2: 1 = abb 2 = cb 3 = abc Clearly, the answer is ab.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

12 / 22

Observation

The longest common substring for K strings of our set is the longest common prex of some sufxes of these strings.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

13 / 22

Observation

The longest common substring for K strings of our set is the longest common prex of some sufxes of these strings. We calculate the longest common prex of every K sufxes of different strings and take the longest one; the latter is the answer to the LCS problem.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

13 / 22

Preprocessing: Step 1

Combine the strings in A as follows: = 1 $1 2 $2 . . . N $N . Here $i are special symbols (sentinels) that are different and lexicographically less than other symbols of the initial alphabet

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

14 / 22

Preprocessing: Step 1

Combine the strings in A as follows: = 1 $1 2 $2 . . . N $N . Here $i are special symbols (sentinels) that are different and lexicographically less than other symbols of the initial alphabet Example: = abb$1 cb$2 abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

14 / 22

Preprocessing: Step 2

Denition (Longest Common Prexes (LCP) array): The array containing lengths of the longest common prexes for every pair of consecutive sufxes (w.r.t. lexicographical order). LCP array can be easily constructed in linear time and space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

15 / 22

Preprocessing: Step 2

Denition (Longest Common Prexes (LCP) array): The array containing lengths of the longest common prexes for every pair of consecutive sufxes (w.r.t. lexicographical order). LCP array can be easily constructed in linear time and space. We construct the sufx array and the LCP array for .

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

15 / 22

Step 2. Example of SA and LCP


String: abb$1 cb$2 abc$3 4 7 11 1 8 SA: LCP: 0 0 0 2 0 3 1 6 1 SA 4 7 11 1 8 3 6 2 9 10 5 2 1 9 0 10 1 LCP 0 0 0 2 0 1 1 1 0 1
16 / 22

1 2 3 4 5 6 7 8 9 10 11

sufxes abb$1 cb$2 abc$3 bb$1 cb$2 abc$3 b$1 cb$2 abc$3 $1 cb$2 abc$3 cb$2 abc$3 b$2 abc$3 $2 abc$3 abc$3 bc$3 c$3 $3

sorted sufxes $1 cb$2 abc$3 $2 abc$3 $3 abb$1 cb$2 abc$3 abc$3 b$1 cb$2 abc$3 b$2 abc$3 bb$1 cb$2 abc$3 bc$3 c$3 cb$2 abc$3
CSR 2008

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in .

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

17 / 22

Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in . Consider K sufxes at positions i1 , . . . , iK and assume that SA[i1 ] < SA[i2 ] < . . . < SA[iK ]. The length of the longest common prex of these K sufxes is equal to the minimum of LCP[i1 ], . . . , LCP[iK 1].

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

17 / 22

Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in . Consider K sufxes at positions i1 , . . . , iK and assume that SA[i1 ] < SA[i2 ] < . . . < SA[iK ]. The length of the longest common prex of these K sufxes is equal to the minimum of LCP[i1 ], . . . , LCP[iK 1]. Example: SA: LCP: Sufxes: 4 0 7 0 11 0 1 2 8 0 3 1 6 1 2 1 9 0 10 1 5

abb$1 cb$2 abc$3 abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

17 / 22

Extensions

Theorem Problem: Given a collection of N strings A = {1 , . . . , N }, for each K (2 K N) nd the longest string that is a substring of at least K strings in A.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

18 / 22

Extensions

Theorem Problem: Given a collection of N strings A = {1 , . . . , N }, for each K (2 K N) nd the longest string that is a substring of at least K strings in A. Let the total length of strings 1 , . . . , N be equal to L. Then the answer to the above problem can be computed in O(L log L) time and in O(L) space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

18 / 22

Part 4 Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

19 / 22

Open Problem

How to compute an inexact longest common substring?

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

20 / 22

Acknowledgements

The authors are thankful to the students of Department of Mathematical Logic and Theory of Algorithms and to Maxim Ushakov and Victor Khimenko (Google Moscow) for many helpful discussions.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

21 / 22

The End :-)

Thank you for your attention. Questions are welcome!

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays

CSR 2008

22 / 22

Potrebbero piacerti anche