Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Received on June 21, 2003; revised on January 29, 2004; accepted on February 8, 2004
Advance Access publication April 8, 2004
2122 Bioinformatics 20(13) © Oxford University Press 2004; all rights reserved.
Speeding up whole-genome alignment
insert all substrings of a certain length l. The value of l varies which we call frequency space. This transformation allows
between the tools and for different applications (e.g. BLAST for efficient computation of an upper bound for the alignment
uses l = 11 for nucleotides and l = 3 for proteins). The tools score between the substrings represented by these points in the
start by finding exactly matching substrings (known as seeds) frequency space. If the score for two points in the frequency
of length l using this hash table. In a second phase, the seeds space is lower than a user-defined cutoff value, the actual
are extended in both directions, and combined, if possible, in alignment score of the corresponding substrings is also lower
order to find better alignments. The main difference between than the cutoff value; hence, the costly process of aligning the
the tools is that they use different seed lengths and seed exten- strings can be avoided.
sion strategies. Current hash-table-based search tools handle In this section, we first describe the transformation from
short queries well, but become very inefficient, in terms of string space to integer space. Second, we explain the con-
both time and space, for long queries. struction of the F-index, a data structure which groups nearby
A number of homology search tools are based on suffix trees points, thereby allowing for further efficiency improvement.
and derivatives, see Gusfield (1997). These include MUMmer Third, we present an algorithm for computing an upper bound
(Delcher et al., 1999), QUASAR (Burkhardt et al., 1999), for the alignment score in frequency space. Finally, we give
REPuter (Kurtz and Schleiermacher, 1999) and AVID (Bray an algorithm for fast comparison of a set of points (corres-
et al., 2003). QUASAR builds a suffix array on one of the ponding to the substrings of a string) to an F-index built on
2123
T.Kahveci et al.
2124
Speeding up whole-genome alignment
2125
T.Kahveci et al.
2126
Speeding up whole-genome alignment
This probability can be calculated quickly by approximating first string must be compared with the first page of the second
the binomial distribution P (vi ) using the normal distribution string. Current string alignment tools start by finding seeds
with mean wri and variance wri (1 − ri ). This approximation within one of the strings (for instance BLAST finds exactly
is very accurate for large values of w, i.e. for w ≥10 (Ewens matching substrings of length 11).
and Grant, 2001). One could simply construct a hash table on one of the strings
Using this formula, the probability of intersection between and sequentially scan the parts of the other string that corres-
a box and a point can be calculated as pond to marked rows or columns. This is an improvement
over other search tools since the search is restricted to the
||
marked entries. However, if both the strings are very large,
L
P |ui − vi | ≤ i · P (Li ). the hash table may not fit into memory, resulting in excessive
L ≤min{w,c}
2
i i=1 random disk I/O. In order to prevent this, we, iteratively, cut
j Lj ≤2c+σ
slices from the match table by splitting it either vertically or
Computing an estimation for the hit ratio according to this horizontally. As the match table is split along a direction, the
formula is very costly because the formula is over all possible corresponding string is also partitioned into shorter strings.
box sizes. Therefore, we approximate it by computing the Then, we construct a hash table on the marked pages of the
unsplit string, and sequentially scan the other.
2127
T.Kahveci et al.
1
Theoretical, bc=250
Theoretical, bc=500
Theoretical, bc=1000
Human Genome, bc=250
Human Genome, bc=500
Human Genome, bc=1000
E. Coli, bc=250
0.1 E. Coli, bc=500
E. Coli, bc=1000
Probability of intersection
0.01
0.0001
100 1000 10000
Window size
Fig. 5. The theoretical and experimental values for the probability of intersection of a box and a point. A box tightly encloses the frequency
vectors of bc consecutive windows of length w. The experiments are performed on the genetic code of the E.coli bacteria and the entire human
genome separately.
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
Fig. 6. The slices determined by the MAP algorithm. We assume that the available memory can store the hash table for at most 3 pages.
(a) and (b) show vertical partitions whereas (c) and (d) show horizontal partitions.
and q3 , and search them using the hash table constructed on hash table for the slice fits into available memory (step 4a).
s1 and s4 . The lower boundary for the columns of the match table is
Figure 7 presents the pseudocode for the MAP algorithm. updated (step 4b). A hash table is then constructed on the
The inputs to the program are the match table, the match marked rows of the slice (step 4c), and a search is performed
table boundaries and the amount of available memory. The by reading the marked columns of the slice using optimal disk
algorithm starts by checking the boundaries. If the match table scheduling (step 4d). Horizontal partitioning is performed in
is all consumed, it quits (step 1). If there are still unprocessed a similar fashion. The MAP search algorithm is then called
regions, the number of marked rows and columns are com- recursively on the remaining match table (step 6).
puted first (steps 2 and 3). If the number of marked rows
is less than the number of marked columns, the algorithm
goes into vertical split mode (step 4); otherwise it switches EXPERIMENTAL EVALUATION
to horizontal split mode (step 5). In vertical split mode, the Our experiments, which assess the performance of MAP, were
splitting point is iteratively advanced column-wise unless the performed on computers with two AMD Athlon MP 1800+
2128
Speeding up whole-genome alignment
2129
T.Kahveci et al.
(a) (b)
Fig. 8. (a) Match table created by MAP for aligning the genomes of two strains of E.coli. The points correspond to marked entries of the
match table. Match table for the same strings when one of the strings is altered by (b) translocation of two substrings, (c) inversion of one
substring, and (d) duplication of one substring.
The amount of memory required to compute a match table results are shown in Table 1. For MAP, we used a static box
increases quadratically, from 20 MB for two 50 Mb strings to capacity of 1000. BLAST did not complete the alignment
1.5 GB for two 450 Mb strings. This is negligible compared of Arabidopsis thaliana chromosomes 2 and 4 in 10 days;
to the size of the hash table a BLAST-like technique would nor could it align the much smaller H.sapiens chromosome
construct: BLAST could not even align two strings of <20 Mb 18 to itself in the same amount of time. The memory usage
each on our 1-GB machine (Table 1). of BLAST for the last two lines of this table was >1 GB,
Our match table achieved a high amount of pruning. On the available memory on our computer. For short strings,
average, only 3.7% of the match table entries were marked the memory usage of BLAST is better than BLASTZ. How-
in this experiment. We obtained slightly different pruning ever, as strings get longer, BLASTZ uses less memory than
rates for two strains of E.coli (K-12 MG1655, acc. U00096; BLAST. The running time of BLASTZ is more than BLAST
O157:H7, acc. BA000007). In this case, 7.5% of the entries for the first six experiments. However, BLASTZ outperforms
were marked. BLAST for the self-alignment of H.sapiens chromosome 18
Constructing the match table is only one phase of MAP: and the alignment of chromosomes 2 and 4 of A.thaliana.
the match table will also need to be sliced, and BLAST run This is because, unlike BLASTZ, the memory consumption
on each slice. To compare BLAST and MAP, we aligned a of BLAST exceeds the amount of available memory.
number of strings, from 200 kb to 20 Mb, with both, and The times for MAP in Table 1 are higher than necessary
measured the memory consumption and running times. The because the BLAST program being run on the partitions
2130
Speeding up whole-genome alignment
100
80
Cumulative recall [%]
60
40
20
eps = 1 %
eps = 2 %
0
160000 140000 120000 100000 80000 60000 40000 20000 0
Score
Fig. 10. The cumulative recall of MAP over BLAST’s local alignments of length at least w = 4096 for = 1 and 2% for the local alignment
of two strains of E.coli (K-12 MG1655, acc. U00096; O157:H7, acc. BA000007).
2131
T.Kahveci et al.
300
250
200
Time [s]
150
100
50
Fig. 11. Time to compute the match table for strings of different length.
Table 1. Running time and memory usage for BLAST and MAP for a number of smaller datasets
The columns for MAP include running BLAST on the partitions. The match table columns include creating the match table. The units for string lengths, memory consumption, and
time are Mb, MB and seconds respectively.
a
U47924, b AC0002397, c L43967, d NC_000912, e NC_003098, f NC_003028, g L42023, h U00096, i AE000516, j AL123456, k BA000007, l NT_000864, m AC_006837,
n
AL_161471.
performs much more work than is needed. This is because, prunes large unmarked regions, thus it avoids searching the
although BLAST can be instructed to limit itself to certain entire string. Second, the match table depicts a coarse-grained
sections of the input, there is no way to specify that only visualization of the actual alignment. Based on the diagonal
the substring pairs corresponding to the marked entries in runs and dense regions, a user may restrict the alignment to a
the match table should be searched. Thus, BLAST proceeds smaller substring. This is important, especially for very long
as if every entry in the partition were marked in our MAP strings. For example, MAP computes the match table for two
implementation. 450 Mb strings in only 270 s. If only a small percentage of
One way to reduce the memory consumption of BLAST is to bases are correlated in these strings, then the user can see this
partition one or both of the strings into substrings without any by inspecting the match table and align only those substrings.
prior information about the strings and run a BLAST on each To account for the time difference of MAP in Table 1,
of these substrings. Our simple ‘black box’ implementation we performed an experiment where MAP, instead of hand-
of MAP is still superior to this partitioning in a number of ing the partitions to BLAST, handed them to a simulator.
ways. First, partitioning the data based on the match table Based on the amount of memory available—a parameter of
2132
Speeding up whole-genome alignment
10000
1000 BLAST
MAP
Time [s]
100
Fig. 12. The total time for BLAST and MAP to align two strains of E.coli (K-12 MG1655, acc. U00096; O157:H7, acc. BA000007). for
different memory sizes.
the experiment—the simulator counted the number of disk points in each box still correspond to consecutive posi-
seeks and reads necessary to align the marked entries in the tions of the sliding window on the string. Keep splitting
partition. From these numbers and measurements of average the box with the largest volume to get the desired number
seek times and read rates for our machine, we calculated the of boxes.
total cost. Figure 12 shows the total costs, according to this • MHIST-Density. Like MHIST-Volume, but split a box
model, for aligning two strains of E.coli (K-12 MG1655, acc. in to two new boxes i and j such that the total density,
U00096; O157:H7, acc. BA000007) with MAP and with ni /Vi +ni /Vj , is maximized. Keep splitting the box with
BLAST. The total cost of MAP in this figure includes the cost lowest density.
of creating and partitioning the match table, in addition to
aligning the strings. In this experiment, to BLAST’s advant- The match table construction time for Volume and MHIST-
age, we set BLAST to construct the hash table on the shorter Volume are 10–18% less than the time for the Static strategy. In
string (i.e. mouse chromosome 18). For the memory sizes terms of pruning rate, the Static, Volume and MHIST-Volume
considered, MAP performed up to two orders of magnitude strategies were almost identical, and they gave the best results.
better than BLAST. On the other hand, the dynamic strategies require more time
and space to construct the F-index. We recommend the use
Dynamic indexing strategies of the static strategy since the size of the index structure is
So far, we have used a static box capacity in the construction smaller and its performance is very close to the best dynamic
of the F-index. One way to improve the F-index is to use a strategy.
dynamic strategy based on the distribution of the frequency We can summarize the experimental results as follows:
vectors. We implemented the following dynamic strategies: (1) the output quality of MAP is very close to that of BLAST,
(2) MAP is up to two orders of magnitude faster than BLAST
• Fixed volume. Keep adding points to a box until its
and (3) MAP works well even for small memory sizes.
volume exceeds a certain threshold.
• Fixed density. Keep adding points until n/V , the number
of points divided by the volume of the box, falls below a DISCUSSION
certain threshold. In this paper, we have considered the problem of finding local
• MHIST-Volume. Start with one big box containing all the alignments for huge genome strings. We have presented an
points. Find the position to split the box such that the sum algorithm that computes scores in the frequency domain and
of the volumes of two new boxes is minimized, but the an algorithm that finds local alignments between two strings.
2133
T.Kahveci et al.
The algorithm constructs a match table, a boolean matrix in suffix array (QUASAR). In RECOMB. ACM Press, France,
which an entry is marked as true if the corresponding sub- pp. 77–83.
strings may be similar, and false otherwise. The match table Califano,A. and Rigoutsos,I. (1993) FLASH: A fast look-up
is the basis for pruning and for partitioning the search such that algorithm for string homology. In ISMB, Bethesda, MD, pp.56–64.
each partition can be searched in memory by an existing tool Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and
Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids
such as BLAST. The match table also serves as a visualization
Res., 27, 2369–2376.
of the similarity between the strings prior to actual alignment. Ewens,W.J. and Grant,G.R. (2001) Statistical Methods in Bioinform-
In our experiments, we used BLAST (Altschul et al., 1990) atics: An Introduction. Springer, New York.
for comparison. According to our experimental results, MAP Gish,W. (1995) WU-BLAST. http://blast.wustl.edu/.
runs up to two orders of magnitude faster than BLAST. Gusfield,D. (1997) Algorithms on Strings, Trees, and Sequences.
Furthermore, MAP can work well even with long strings and Cambridge University Press.
little memory. MAP achieves these performance improve- Huang,X. and Miller,W. (1991) A time-efficient, linear-space local
ments while keeping the quality of the resulting answer set similarity algorithm. Adv. Appl. Math. 12, 337–357.
very close to that of BLAST. Kahveci,T. and Singh,A.K. (2001) An efficient index structure for
MAP is a very general technique, in the sense that its match- string databases. In VLDB. Morgan Kaufmann, Roma, Italy,
table-based pruning and dynamic splitting scheme can be used September 2001, pp. 351–360.
2134