Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Infm' nca
i m
9 by Springer-Verlag 1980
Abstract. In this paper we investigate the worst-case complexity of range searching: preprocess N points in k-space such that range queries can be answered quickly. A range query asks for all points with each coordinate in some range of values, and arises in many problems in statistics and data bases. We develop three different structures for range searching in this paper. The first structure has absolutely optimal query time (which we prove), but has very high preprocessing and storage costs. The second structure we present has logarithmic query time and O(N a+~) preprocessing and storage costs, for any fixed e > 0 . Finally we give a structure with linear storage, O(NlnN) preprocessing and O (N ~) query time.
1. Introduction One of the fundamental problems of computer science is searching, and m a n y efficient algorithms and data structures have been developed for a wide variety of searching problems. Most of these algorithms deal with problems defined by a single search key, however, and very little work has been done on searching problems defined over many keys. Such problems are usually called multi-key or multidimensional (because each of the key spaces can be viewed as a dimension) searching problems. A survey of many multidimensional searching algorithms can be found in Maurer and O t t m a n n [6]. In this paper we will investigate and (optimally) solve one such multidimensional searching problem. The problem of interest in this paper is called range searching. Phrased in geometric terms, we are given a set F of N points in k-space to preprocess into a data structure. After we have preprocessed the points we must answer queries which ask for all points x of F such that the first coordinate of x (Xl) is in some range I L l , H 1 ] , the second coordinate x2G[Lz,H2]. . . . , and XkE[Lk,Hkl. One * Research in this paper has been supported partially under Office of Naval Research contract N000014-76-C-0373, USA, and by the Austrian Federal Ministry for Science and Research
0001-5903/80/0013/0155/$02.80
156
can also phrase this problem in the terminology of data bases: we are given a file F of N records, each of k keys, to process into a file structure. We must then answer queries asking for all records such that the first key is in some specified range, the second key in a second range, etc. Range searching is called orthogonal range searching by Knuth [4, Sect. 6.5.]. Range searching arises in many applications. In purchasing a desk for a certain office we might ask a furniture data base to list all desks of width 80 cm to 120cm, length 160cm to 240cm, and cost $100.00 to $200.00. Knuth [4, Sect. 6.5] mentions that range searching arises in geographic data bases: in a file of North American cities we can list all cities in Colorado by asking for cities with latitude in [37~ 41~ and longitude in [102~ 109~ Other applications of range searching in statistics and data analysis are mentioned by Bentley and Friedman [3]. In this paper we will study the worst-case complexity of range searching, explicitly ignoring the expected performance of algorithms. The emphasis of this paper is therefore somewhat more "theoretical" than practical. Previous approaches to range searching are discussed in Sect. 2. In Sect. 3 we present three new structures for range searching. The first of these has very rapid retrieval time but requires much storage and preprocessing. The second has slightly increased retrieval time but reduced storage and building costs. The third type of structure is still less efficient as far as query time is concerned, but is optimal in storage requirement and has low prepocessing cost. In Sect. 4 we prove the optimality of the fast retrieval-time structure of Sect. 3 by exhibiting a lower bound for range searching. We present conclusions and directions for further research in Sect. 5.
2. Previous Work Most of the data structures which have been proposed for range searching have been designed to facilitate rapid average query time. Such structures include inverted lists and multidimensional arrays representing "cells" in the space. These and other "average-case" structures are discussed by Bentley and Friedman [3]. Before we describe existing "worst-case" structures for range searching we must state our methods for analyzing a data structure. Our model for searching is that we are given a set F which we preprocess into a data structure G such that we can quickly answer range queries about F by searching G. Note that all the structures we discuss in this paper are static in the sense that they need not support insertions and deletions. To analyze a particular structure we describe three cost functions as functions of N (the size of F) and k (the dimension of the space). These functions are P(N, k), the preprocessing time required to build the structure G; S(N, k), the storage required by G; and Q(N, k), the time required to answer a query. To illustrate this analysis consider the "brute force" approach to range searching, which stores the N points of the file in a linked list. The preprocessing, storage, and query costs of this structure are all linear in Nk, so the analysis of "brute force" yields
157
The multidimensional binary search tree (abbreviated k - d tree in k-space) proposed by Bentley I l l is a more sophisticated data structure which supports range searches. Bentley showed that the preprocessing and storage costs of the k - d tree are respectively
P ( N , k) = O ( k N lg N), S(N, k) = O(Nk),
and
but he did not analyze the worst-case cost of searching. Lee and Wong [5] later analyzed the query time of k - d trees and showed that it is
Q(N, k) = O ( k N a - a/k + A)
where A is the number of answers found in the range. A second structure for range searching is the range tree of Bentley [21. This structure is based on the idea of "multidimensional divide-and-conquer" and has performances
P ( N , k) = O ( N l g k- 1 N), S(N,k)=O(Nlgk-I N),
and
Q(N, k) = O(lgk N + A)
In this section we will introduce three new structures for range searching. We will call these data structures k-ranges and consider overlapping and honorerlapping versions thereof. To simplify our notation we will call overlapping kranges just k-ranges but we will always explicitly mention if k-ranges are nonoverlapping. In Sect. 3.1 we describe (overlapping) k-ranges and establish their performance as
Q(N,k)=O(klgN+A),
and
where A is the number of points found. (In Sect. 4 we will see that this query time is optimal under comparison-based models.) Although k-ranges have very rapid retrieval times they "pay for" this by high preprocessing and storage costs. In Sect. 3.2 we will modify k-ranges to display performance
158
Q ( N , k ) = O ( l g N + A),
and
P(N, k) = 0 (N lg N).
INLiNHiNN
for i = 1 , 2 ..... k.
In calculating preprocessing and query times the time required for normalization will of course have to be considered. Our aim is to store the set F as a k-range, a data structure defined inductively for k = 1, 2,... and permitting the fast processing of range queries. In discussing k-ranges we will first develop the one-dimensional structure and then extend it to successively higher dimensions. Let G be some subset of F, G~_F. To store G as a 1-range we store G as a linear array M of N elements as follows. Each element consists of a set of points M~ and a pointer pi(1 < i N N ) , where M i is the set of all points of G with first
159
coordinate equal to i, and where Pi points to the "next" nonempty M j, i.e. to that nonempty set Mj with i<j and j minimal.
Example 3.1. The set {(1,6), (3,3), (5,1), (5,5), (6,2)} stored as a 1-range is shown in
Fig. 3.1.
M1 M2 M3 M4 M5 M6
Fig. 3.1
Before discussing the notion of a k-range for k > 1 and how k-ranges are used to store k-dimensional point sets F, one more notation is to be mentioned. For all i, j, t with 1 < i < j < N and 1 _<t_< k let F(t! be that subset of F containing all points whose t-th coordinate is between i and j (both inclusive), i.e.
Q(N, 1)=O(IgN+A).
Consider next the planar case k = 2. We store F as a 2-range: the 2-range for F is obtained by storing each of the sets F~(~ ) for 1 <_i<_i<=N as 1-range Rid and by setting up a two dimensional array P of pointers, each element P~A(I<_=i<=j<=N) pointing to Rid. Thus, to carry out a range search [,L1,//1], [-L2,H2] we just have to range search in the 1-range rL2,n" ,,(2) . for I'Ll,H1].
The answer is determined by starting at element # 2 in the chain and following the pointers until element :~5 is passed: the answer (3,7), (5,8) is obtained as desired.
160
9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9
Fig. 3.2.
Point set F
To analyze the 2-range we note that the cost of performing a query is the sum of three costs" normalization, accessing the l-range, and searching the 1range. Since those have costs respectively of 41gN, O(1), and O(A), the cost of querying a 2-range is
Q(N,Z)=O(lgN + A).
The storage required by a 2-range is the sum of the storage required by all 1ranges. Since there are the total storage used is
S(N, 2)=O(N3).
And since each of the zation, we k n o w that
P(N, 2) = O (N3).
We have thus analyzed 2-ranges. Consider now the case k > 2 . We store F as a k-range as follows. Store first all F (k),,j for 1 _<i<j<N_ = as ( k - 1 ) - r a n g e s Ri, j. N o w construct a two-dimensional array P of pointers, each element P/,j (1 <i<j<N) pointing to Ri, j. To carry out
161
a range search ILl, H1], [L2, H2] ILk_l, Hk_l], [L k, Hk] in F it thus suffices to carry out a range search [L1,H1], [Lz, H2], ..., [Lk_l,Hk_l] in F[~),n~. Since this is stored as a ( k - 1 ) - r a n g e this process continues until it remains to range search for [L 1, H1] in a 1-range, the latter (as explained above) requiring O(A) steps, A the number of points determined. Since the normalization for each query requires 2 k l g N comparisons, the total cost for a query in a k-range is
Q(N,k)=O(klgN+A)
for k > 2 . We will show in Sect. 4 that this query time is optimal in any "comparison based" model. We analyze the preprocessing and storage requirements of k-ranges by induction on k, using as the basis for our induction the fact that
S(N, 2) = P(N, 2) = O(N3). Since storing an N-element k-range involves storing (N ; 1 ) N-element (k-1) ranges, we have the recurrence
162
one "block" which contains N 1/e "units" (assume N is a perfect square) which represent N 1/2 points each. On the first level of the 2-level 2-range we then store all --O(N) consecutive intervals of units; that is, we store O(N) 1-
ranges. For reasons of space economy we now choose to store 1-ranges as arrays sorted by x-value; this requires space proportional to the number points in the particular range stored, rather than proportional to N. The second level of our covering consists of N ~/2 blocks each containing N 1/2 units (which are individual points). Within each block we store all possible intervals of units (points) as 1-ranges. This structure is depicted in Fig. 3.4 for the case N--9. In that figure the bold vertical lines represent block boundaries and the regular vertical lines represent unit boundaries; each horizontal line represents a 1range structure.
Level 1
Level 2
Point
Fig. 3.4. A 2-level 2-range To answer a range query in a 2-level 2-range we must choose some covering of the particular y-range of the query from the 2-level structure. This can always be accomplished by selecting at most one sequence of units from level one and two sequences of units from level two; this is illustrated in Fig. 3.5.
1 -
ranges searched
y J-.//:
Fjjf~
J r /
163
It is easy to count the cost of a query in a 2-level 2-range: we search a total of at most 3 1-ranges (each at logarithmic cost) and then enumerate the points found, so we have
Q(N,2)=O(lgN+A).
To count the storage we note that on the first level we have at most O(N) 1ranges of size at most N, so that the storage required on the first level is O(N2). On the second level we have O(N 3/2) 1-ranges, each representing at most N 1/z points, so the storage on that level is also O(N2). Summing these we achieve
S(N, 2) = O(N2).
If the points are kept in sorted linked lists as the structures are built, then the obvious preprocessing time of O(N21gN) can be reduced to
P(N,Z)=O(N2).
The 2-level 2-range can of course be generalized to an /-level 2-range, a structure consisting of l levels. On the first level there is one block containing N TM units of N 1-1/1 points each. The second level has N TM blocks containing N 1/l units of N ~-2/~ points, and so on. On each level we store as 1-ranges all
Taking the product of the above three values gives the cost per level, and since there are altogether/-levels we have
Q(N,3)=O(IgN + A).
164
We store O(N) 2-level 2-ranges on the first level, and that storage. On the second level we store O(N ~/2) blocks of O(N) each of size at m o s t O(N 1/2) (so those 2-ranges require at m o s t storage each). Multiplying these costs we see that the storage second level is 0(N5/2). Thus the total storage cost is
Q(N,k)=O(lgN + A).
One can also show that the preprocessing and storage costs grow by a factor of most N for each dimension "added", so we know that
P(N,k)=S(N,k)=O(NI+~),
Q (N, k) = O (lg N + A).
and
165
sorted by x-value. This structure requires only linear storage and can be built (by N 1/2 distinct sorts) in O(NlgN) time. Suppose now that we are to do a range search defined by an x-range and a y-range: for all the contiguous y-strips contained wholly in the y-range we can perform two binary searches to give the set of all points contained in the x-range. Since there are only N 1/2 such strips altogether and each can be searched in logarithmic time, the total cost of this step is O(N1/ZlgN+A). We can then do a simple scan over the two end y-strips (top and bottom) to see if they contain any points in both x and y ranges; this costs at most O(N l/z) to examine the 2 N 1/2 points. Thus the total cost of searching is O(N 1/zlgN) and the performance of the structure as a whole is
P(N,2)=O(NlgN), S(N,2)=O(N),
and
Q(N,Z)=O(N~/21gN + A).
Nonoverlapping 2-ranges can easily be extended to be multilevel. In the first level of a 3-level 2-range we have one block of N 1/3 units, each representing N 2/3 points and sorted by x-value. On the second level we have N 1/3 blocks, each containing N a/3 units of N 1/3 points contiguous in y (sorted by x). The third level then contains N z/3 blocks of N ~/3 units (points) each. This structure requires storage linear in N and can be built in O(NlgN) time. To answer a range query we must search at most N ~/3 units on the first level and 2N ~/3 units on each of the second and third levels. The cost of each of those searches is logarithmic (excluding the manipulation of points found), so the total cost of searching is O(NI/31gN). The obvious extension to /-level nonoverlapping 2ranges carries through without flaw and has performance
Q(N,k)=O(N~+A).
Note that this is for fixed e, k and l: if l is allowed to vary with N then one
166
achieves a tree-like structure (specifically, the range trees of Bentley [2] if l = lg N).
4. Lower Bounds
We have shown in Section 3 by using k-ranges that a k-dimensional set of N points can be stored such that range queries can be answered in time O(klnN+A). We will now demonstrate that this is optimal. For an arbitrary point set F, let R(F) be the number of different range queries possible for F. (We say that two range queries are different iff their answers are different.) Let
R(N, k ) = m a x {R(F)rF a
It is easy to see that
R(N, 1)=(N+I)+I,
either empty (1 such answer) or can be defined by two of N + I interpoint locations. The exact value of R(N,k) for general N and k seems more difficult to calculate. We can immediately observe that
R(N,k)<(N21) k,---
dimension there are only N + 1 essentially different positions for upper and lower bounds for a range search. One might wonder how close R(N, k) can grow to this bound; Theorem 4.1 partially answers this. Theorem 4.1.
( N ] 2k
Proof. To
all points with a single nonzero integer coordinate in the closed interval ~N--~]. Consider the set of all range queries
[-u
2k '
...,
-~L
i~-I
and I ~ H
i~
(N 2k \~]
obtained in this way clearly determine different sets of points, and the result follows. [] Corollary. For range queries on N points in k dimensions optimal for "comparison-based" methods.
O(klgN+A)
is
Proof. By
the Theorem,
(N~ 2k
- 2 k lg k - 2 k Ig 2 steps. Hence the 2 k lg N comparisons used for range searching in l-level k-ranges is optimal to within second-order terms. []
167
5. Conclusions
We have presented three variants of a new data structure (the k-range) for storing k-dimensional sets of N points and permitting fast responses to range queries. The first variant, one-level k-ranges, requires only 2k lg N comparisons per query, plus an amount of list processing proportional to A, the number of answers found. However, preprocessing and storage costs of O(N 2k-1) are prohibitively high. With the second variant, multi-level k-ranges, lookup time is still O(lg N+A), but preprocessing and storage costs are reducible to O(N ~+~) for every fixed e>0. Employing the third variant, nonoverlapping k-ranges, storage can be reduced to O(kN), preprocessing to O(N lg N) and for every e > 0 a worst case query time of O(N~+A) can be achieved. The results are summarized and compared with previously known techniques in the following table (Fig. 5.1), showing the behaviour for fixed k and large N.
Structure Naive k - d Trees Nonoverlapping k-ranges Range Trees (Overlapping) /-level k-ranges (Overlapping one level) k-ranges
P(N, k)
0 (N) O(NIgN)
S(N, k)
0 (N) O(N)
Q(N, k)
0 (N) O(Nl-a/k+A)
O(N E+ A) 0 (lgk N + A)
O(lg N + A)
O(IgN+A)
Fig. 5.1
Fast solutions to other problems involving point sets in k-dimension can also be obtained by using the data structures and techniques of this paper. Two such examples are the problem of computing the Empirical Cumulative Distribution Function (ECDF searching problem) and the Maxima searching problem discussed in detail in Bentley [-2]. For a point x, the E C D F searching problem and the maxima searching problem can be formulated as follows:
E CDF Searching Problem. Determine the number of points y in F with Yi< xi for
i=l,2,...,k.
168
Despite the further insights into range searching gained by this paper, a number of open problems remain. Are there other data structures with a still better tradeoff between P(N, k), S(N,k), Q(N,k)? In particular, for a total of 2 k l g N comparisons is O ( N 2 k - 1 ) optimal for space and storage? Can the product P(N, k). S(N,k). Q(N,k) be reduced to O(NZlgZN)? If not, can one show lower bounds on the above product, indicating "space-time" tradeoffs. What is the situation when the dynamic case (insertion and deletion of points inbetween queries) is considered? Another problem of independent interest is the exact computation of R(N, k) of Sect. 4. Although we have shown
\~!
<R(N,k)<=
we have been
unable to compute the exact value of R(N, k) in general. We do not even know the values of R(N, 2) except for small N. In this paper we have presented (asymptotically) fast worst case methods for range searching, some of them with (asymptotically) small amount of preprocessing and storage. We do not necessarily advocate these methods for practical applications. The results do, however, suggest that it may be possible to find methods for solving range queries efficiently both as far as average and worstcase behaviour are concerned.
References
1. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509-517 (1975) 2. Bentley, J.L.: Multidimensional Divide-Conquer. Carnegie Mellon University Computer Science Department Research Review, 1978, pp. 7 24 3. Bentley, J.L., Friedmam J.H.: Data structures for range searching. Comput. Surveys (in press 1980) 4. Knuth, D.E.: The Art of Computer Programming, vol. 3. Reading, Mass.: Addison-Wesley 1973 5. Lee, D.T., Wong, C.K.: Worst case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 23-29 (1977) 6. Maurer, H.A., Ottmann, Th.: Manipulating sets of points - a survey In: Graphen, Algorithmen, Datenstrukturen: Workshop 78, Mtinchen-Wien: Hanser 1978
Note added in proof Mr. James Saxe has tightened the bounds on R(N, k) in an article to appear in Discrete Applied Mathematics.