Sei sulla pagina 1di 14

Acta Informatica 13, 155-168 (1980)

Infm' nca
i m

9 by Springer-Verlag 1980

Efficient Worst-Case Data Structures for Range Searching*


J.L. Bentley 1 and H.A. Maurer 2 t Departments of Computer Science and Mathematics, Carnegie-Mellon University, Pittsburgh, PA 15213, USA 2 Institut ftir Informationsverarbeitung,Technische Universit~it Graz, Steyrergasse 17, A-8010 Graz, Austria

Abstract. In this paper we investigate the worst-case complexity of range searching: preprocess N points in k-space such that range queries can be answered quickly. A range query asks for all points with each coordinate in some range of values, and arises in many problems in statistics and data bases. We develop three different structures for range searching in this paper. The first structure has absolutely optimal query time (which we prove), but has very high preprocessing and storage costs. The second structure we present has logarithmic query time and O(N a+~) preprocessing and storage costs, for any fixed e > 0 . Finally we give a structure with linear storage, O(NlnN) preprocessing and O (N ~) query time.

1. Introduction One of the fundamental problems of computer science is searching, and m a n y efficient algorithms and data structures have been developed for a wide variety of searching problems. Most of these algorithms deal with problems defined by a single search key, however, and very little work has been done on searching problems defined over many keys. Such problems are usually called multi-key or multidimensional (because each of the key spaces can be viewed as a dimension) searching problems. A survey of many multidimensional searching algorithms can be found in Maurer and O t t m a n n [6]. In this paper we will investigate and (optimally) solve one such multidimensional searching problem. The problem of interest in this paper is called range searching. Phrased in geometric terms, we are given a set F of N points in k-space to preprocess into a data structure. After we have preprocessed the points we must answer queries which ask for all points x of F such that the first coordinate of x (Xl) is in some range I L l , H 1 ] , the second coordinate x2G[Lz,H2]. . . . , and XkE[Lk,Hkl. One * Research in this paper has been supported partially under Office of Naval Research contract N000014-76-C-0373, USA, and by the Austrian Federal Ministry for Science and Research

0001-5903/80/0013/0155/$02.80

156

J.L. Bentley and H.A. Maurer

can also phrase this problem in the terminology of data bases: we are given a file F of N records, each of k keys, to process into a file structure. We must then answer queries asking for all records such that the first key is in some specified range, the second key in a second range, etc. Range searching is called orthogonal range searching by Knuth [4, Sect. 6.5.]. Range searching arises in many applications. In purchasing a desk for a certain office we might ask a furniture data base to list all desks of width 80 cm to 120cm, length 160cm to 240cm, and cost $100.00 to $200.00. Knuth [4, Sect. 6.5] mentions that range searching arises in geographic data bases: in a file of North American cities we can list all cities in Colorado by asking for cities with latitude in [37~ 41~ and longitude in [102~ 109~ Other applications of range searching in statistics and data analysis are mentioned by Bentley and Friedman [3]. In this paper we will study the worst-case complexity of range searching, explicitly ignoring the expected performance of algorithms. The emphasis of this paper is therefore somewhat more "theoretical" than practical. Previous approaches to range searching are discussed in Sect. 2. In Sect. 3 we present three new structures for range searching. The first of these has very rapid retrieval time but requires much storage and preprocessing. The second has slightly increased retrieval time but reduced storage and building costs. The third type of structure is still less efficient as far as query time is concerned, but is optimal in storage requirement and has low prepocessing cost. In Sect. 4 we prove the optimality of the fast retrieval-time structure of Sect. 3 by exhibiting a lower bound for range searching. We present conclusions and directions for further research in Sect. 5.

2. Previous Work Most of the data structures which have been proposed for range searching have been designed to facilitate rapid average query time. Such structures include inverted lists and multidimensional arrays representing "cells" in the space. These and other "average-case" structures are discussed by Bentley and Friedman [3]. Before we describe existing "worst-case" structures for range searching we must state our methods for analyzing a data structure. Our model for searching is that we are given a set F which we preprocess into a data structure G such that we can quickly answer range queries about F by searching G. Note that all the structures we discuss in this paper are static in the sense that they need not support insertions and deletions. To analyze a particular structure we describe three cost functions as functions of N (the size of F) and k (the dimension of the space). These functions are P(N, k), the preprocessing time required to build the structure G; S(N, k), the storage required by G; and Q(N, k), the time required to answer a query. To illustrate this analysis consider the "brute force" approach to range searching, which stores the N points of the file in a linked list. The preprocessing, storage, and query costs of this structure are all linear in Nk, so the analysis of "brute force" yields

Efficient Worst-Case Data Structures for Range Searching


P ( N , k) = O(Nk), S(N, k) = O(Nk), and Q(N, k) = O(Nk).

157

The multidimensional binary search tree (abbreviated k - d tree in k-space) proposed by Bentley I l l is a more sophisticated data structure which supports range searches. Bentley showed that the preprocessing and storage costs of the k - d tree are respectively
P ( N , k) = O ( k N lg N), S(N, k) = O(Nk),

and

but he did not analyze the worst-case cost of searching. Lee and Wong [5] later analyzed the query time of k - d trees and showed that it is
Q(N, k) = O ( k N a - a/k + A)

where A is the number of answers found in the range. A second structure for range searching is the range tree of Bentley [21. This structure is based on the idea of "multidimensional divide-and-conquer" and has performances
P ( N , k) = O ( N l g k- 1 N), S(N,k)=O(Nlgk-I N),

and

Q(N, k) = O(lgk N + A)

for any fixed k > 2.

3. New Data Structures

In this section we will introduce three new structures for range searching. We will call these data structures k-ranges and consider overlapping and honorerlapping versions thereof. To simplify our notation we will call overlapping kranges just k-ranges but we will always explicitly mention if k-ranges are nonoverlapping. In Sect. 3.1 we describe (overlapping) k-ranges and establish their performance as
Q(N,k)=O(klgN+A),

and

P ( N , k) = S(N, k) = O ( N 2k- 1),

where A is the number of points found. (In Sect. 4 we will see that this query time is optimal under comparison-based models.) Although k-ranges have very rapid retrieval times they "pay for" this by high preprocessing and storage costs. In Sect. 3.2 we will modify k-ranges to display performance

158

J i . Bentley and H.A. Maurer

Q ( N , k ) = O ( l g N + A),

and

P(N, k) = Q(N, k) = O ( N ~+~)


for any fixed e,> 0. In Sect. 3.3 we introduce nonoverlapping k-ranges. Storage and preprocessing costs for this type of data structure are still lower than for the data structures of Sect. 3.1 and 3.2 (in fact even lower than the ones for range trees of Bentley [2]). However, query time is increased somewhat (and is higher than for range trees). Specifically we show for nonoverlapping range trees a performance of

Q(N, k)= O(N ~)


S (N, k) = 0 (N)

(~ > 0 can be chosen arbitrarily)

P(N, k) = 0 (N lg N).

3.1. One Level k-Ranges


Before describing our data structures and techniques it is convenient to transform the problem of range searching in a k-dimensional set F of N points with arbitrary real coordinates into the problem of range searching in a k-dimensional set F of N points with integer coordinates between 1 and N. Such a "normalization" can be carried out as follows. Let F~ = { x i l x ~ F } (1 Ni<=k) be the set of numbers occuring as i-th coordinate. For each point x = (x 1, x2,..., Xk) of F take a point if= (~a, 22,..., 2"k) into if, where xi is the "rank" of x i in F~. (If the numbers of F~ are sorted into ascending order and duplicates are removed, then the position of x i in this sequence is it rank). Note that such normalization can be accomplished in time O ( k N l g N ) and space O(N). A range query [L1,H1], [L2,H2] ..... [Lk, Hk] in F can be "normalized" into a range query [ L 1 , H 1 ] , [L2, tq2] .... , [Lk, Hk] in F in a similar fashion in 2 k lg N comparisons. For the above reasons we can assume in the following that F is a set of N points in k dimensions with all coordinates being integers between 1 and N, and that each range query [L~, Ha] ..... [L k, Hk] consists of integers only with

INLiNHiNN

for i = 1 , 2 ..... k.

In calculating preprocessing and query times the time required for normalization will of course have to be considered. Our aim is to store the set F as a k-range, a data structure defined inductively for k = 1, 2,... and permitting the fast processing of range queries. In discussing k-ranges we will first develop the one-dimensional structure and then extend it to successively higher dimensions. Let G be some subset of F, G~_F. To store G as a 1-range we store G as a linear array M of N elements as follows. Each element consists of a set of points M~ and a pointer pi(1 < i N N ) , where M i is the set of all points of G with first

Efficient Worst-Case Data Structures for Range Searching

159

coordinate equal to i, and where Pi points to the "next" nonempty M j, i.e. to that nonempty set Mj with i<j and j minimal.

Example 3.1. The set {(1,6), (3,3), (5,1), (5,5), (6,2)} stored as a 1-range is shown in
Fig. 3.1.
M1 M2 M3 M4 M5 M6

Fig. 3.1

Before discussing the notion of a k-range for k > 1 and how k-ranges are used to store k-dimensional point sets F, one more notation is to be mentioned. For all i, j, t with 1 < i < j < N and 1 _<t_< k let F(t! be that subset of F containing all points whose t-th coordinate is between i and j (both inclusive), i.e.

EI,J (t!= {xlxeF A i < ~Yt<=J}.


We are now ready to discuss range searches in k dimensions. We first discuss the cases k = 1 and k = 2, and then the general case k > 2. In the linear case k = 1 we store F as a 1-range. To process a (normalized) range query (L1, H1) we list all elements of M(L O, M(L 1 + 1), ..., M(H O. Due to the pointer mechanism employed this takes O(A) time, where A is the number of points found. To compute the query time we note that roughly 21gN steps have to be added for normalization. Taking into account the preprocessing time for normalization of O(NlgN) and of setting up the linear array M we clearly have

P(N, 1)=O(NlgN), S(N, 1)=O(N),


and

Q(N, 1)=O(IgN+A).
Consider next the planar case k = 2. We store F as a 2-range: the 2-range for F is obtained by storing each of the sets F~(~ ) for 1 <_i<_i<=N as 1-range Rid and by setting up a two dimensional array P of pointers, each element P~A(I<_=i<=j<=N) pointing to Rid. Thus, to carry out a range search [,L1,//1], [-L2,H2] we just have to range search in the 1-range rL2,n" ,,(2) . for I'Ll,H1].

Example 3.2. Consider the points F as plotted in Fig, 3.2


To answer the range search [2,5], [-6,8] in F we have to range search for [-2,5] in F(2) 6,8 {(8,6), (3.7), (5,8)} which is available as the 1-range in Fig. 3.3
=

The answer is determined by starting at element # 2 in the chain and following the pointers until element :~5 is passed: the answer (3,7), (5,8) is obtained as desired.

160

J.L. Bentley and H.A. Maurer

9 8 7 6 5 4 3 2 1

2 3 4 5 6 7 8 9
Fig. 3.2.

Point set F

Fig. 3.3. 1-range for ~(2)


~6,8

To analyze the 2-range we note that the cost of performing a query is the sum of three costs" normalization, accessing the l-range, and searching the 1range. Since those have costs respectively of 41gN, O(1), and O(A), the cost of querying a 2-range is

Q(N,Z)=O(lgN + A).
The storage required by a 2-range is the sum of the storage required by all 1ranges. Since there are the total storage used is

=O(N 2) 1-ranges requiring O(N) storage each,

S(N, 2)=O(N3).
And since each of the zation, we k n o w that

O(N 2) 1-ranges can be built in linear time after normali-

P(N, 2) = O (N3).
We have thus analyzed 2-ranges. Consider now the case k > 2 . We store F as a k-range as follows. Store first all F (k),,j for 1 _<i<j<N_ = as ( k - 1 ) - r a n g e s Ri, j. N o w construct a two-dimensional array P of pointers, each element P/,j (1 <i<j<N) pointing to Ri, j. To carry out

EfficientWorst-Case Data Structures for Range Searching


.... ,

161

a range search ILl, H1], [L2, H2] ILk_l, Hk_l], [L k, Hk] in F it thus suffices to carry out a range search [L1,H1], [Lz, H2], ..., [Lk_l,Hk_l] in F[~),n~. Since this is stored as a ( k - 1 ) - r a n g e this process continues until it remains to range search for [L 1, H1] in a 1-range, the latter (as explained above) requiring O(A) steps, A the number of points determined. Since the normalization for each query requires 2 k l g N comparisons, the total cost for a query in a k-range is

Q(N,k)=O(klgN+A)
for k > 2 . We will show in Sect. 4 that this query time is optimal in any "comparison based" model. We analyze the preprocessing and storage requirements of k-ranges by induction on k, using as the basis for our induction the fact that

S(N, 2) = P(N, 2) = O(N3). Since storing an N-element k-range involves storing (N ; 1 ) N-element (k-1) ranges, we have the recurrence

S(N, k) = O(U2) 9S(N, k - 1)


which has solution

S(N, k) = O(N zk- ~).


A similar analysis shows that the preprocessing cost of k-ranges is

P(N, k) = O(N 2k- a).


3.2. Multi-Level k-Ranges
The k-ranges of Sect. 3.1 provide extremely efficient range searching query time at the expense of high preprocessing and storage costs. In this section we will show how to modify k-ranges to become "/-level k-ranges" which maintain the logarithmic query time while reducing the other costs. We will accomplish this by first developing a set of efficient planar structures and then applying those to successively higher dimensions. Throughout this section we will assume that the points to be searched and the queries have been normalized as in the previous section. The essential feature of the rapid retrieval times of 2-ranges is that they were based on a covering of all possible y-intervals of interest to range searching. This covering was the "complete" covering, which explicitly stored all y-intervals. Although the complete covering made possible rapid query time, it forced us to store all O(N 2) 1-ranges. We will now investigate other coverings of N intervals which (slightly) increase query time but significantly decrease storage and preprocessing costs. The first such covering we will investigate is based on a two-level structure (the complete covering is a one-level structure). On the first level we consider

162

J.L. Bentley and H.A. Maurer

one "block" which contains N 1/e "units" (assume N is a perfect square) which represent N 1/2 points each. On the first level of the 2-level 2-range we then store all --O(N) consecutive intervals of units; that is, we store O(N) 1-

ranges. For reasons of space economy we now choose to store 1-ranges as arrays sorted by x-value; this requires space proportional to the number points in the particular range stored, rather than proportional to N. The second level of our covering consists of N ~/2 blocks each containing N 1/2 units (which are individual points). Within each block we store all possible intervals of units (points) as 1-ranges. This structure is depicted in Fig. 3.4 for the case N--9. In that figure the bold vertical lines represent block boundaries and the regular vertical lines represent unit boundaries; each horizontal line represents a 1range structure.

Level 1

Level 2

Point

Fig. 3.4. A 2-level 2-range To answer a range query in a 2-level 2-range we must choose some covering of the particular y-range of the query from the 2-level structure. This can always be accomplished by selecting at most one sequence of units from level one and two sequences of units from level two; this is illustrated in Fig. 3.5.
1 -

ranges searched

y J-.//:

Fjjf~

J r /

Fig. 3.5. Querying a 2-level 2-range

Efficient Worst-Case Data Structures for Range Searching

163

It is easy to count the cost of a query in a 2-level 2-range: we search a total of at most 3 1-ranges (each at logarithmic cost) and then enumerate the points found, so we have

Q(N,2)=O(lgN+A).
To count the storage we note that on the first level we have at most O(N) 1ranges of size at most N, so that the storage required on the first level is O(N2). On the second level we have O(N 3/2) 1-ranges, each representing at most N 1/z points, so the storage on that level is also O(N2). Summing these we achieve

S(N, 2) = O(N2).
If the points are kept in sorted linked lists as the structures are built, then the obvious preprocessing time of O(N21gN) can be reduced to

P(N,Z)=O(N2).
The 2-level 2-range can of course be generalized to an /-level 2-range, a structure consisting of l levels. On the first level there is one block containing N TM units of N 1-1/1 points each. The second level has N TM blocks containing N 1/l units of N ~-2/~ points, and so on. On each level we store as 1-ranges all

"'"-(12-/I ) intervals of units in each block. To answer a query we select an


appropriate covering of the query's y-range and then perform searches on those 1-ranges. In such a search we must search at most two intervals on each of the llevels, so the total cost of the search is bounded above by O(l.lgN+A). To analyze the storage cost we note that on level i we store N"-a~/l blocks, each of which contains = O(N 2/1) intervals representing at most N 1-(I+ 1)/i points.

Taking the product of the above three values gives the cost per level, and since there are altogether/-levels we have

S(N) = O(N 1+2/1).


A similar analysis shows that the preprocessing cost is of the same order. Thus we see that/-level 2-ranges allow us to reduce the preprocessing and storage costs of range searching to N 1+~ for any positive e while maintaining O(lgN+A) query time. The multilevel structure can be used to decrease the preprocessing and storage costs of k-ranges while maintaining logarithmic search time. To illustrate this we will consider 2-level 3-ranges, which are built by covering 3-ranges with 2-level 2-ranges. On the first level of such a structure we have one block of N 1/2 units representing N 1/2 points each, and we store all intervals of those units as 2-level 2-ranges. On the second level we have N ~/2 blocks of N 1/2 units (which represent one point each), and we store all intervals of units within each block as a 2-level 2-range. Any query can be answered by covering its z-range with one interval from the first level and two intervals from the second level, so we maintain the query time of

Q(N,3)=O(IgN + A).

164

J.L. Bentley and H.A. Maurer

We store O(N) 2-level 2-ranges on the first level, and that storage. On the second level we store O(N ~/2) blocks of O(N) each of size at m o s t O(N 1/2) (so those 2-ranges require at m o s t storage each). Multiplying these costs we see that the storage second level is 0(N5/2). Thus the total storage cost is

requires O(N 3) 2-level 2-ranges, O(N1/Z)2=O(N) required on the

S(N, 3)= O(N3).


Using presorting the preprocessing can also be done in cubic time. The general 2-level k-ranges are inductively built out of 2-level ( k - 1)-ranges. The k-dimensional structure is built with two levels: on the first there are N ( k - 1)-dimensional structures of size at most N each and on the second there a r e N 3/2 (k-1)-dimensional structures of size at most N U2 each. Since the total storage query cost increases by at most a factor of three at each dimension, we have for any fixed k

Q(N,k)=O(lgN + A).
One can also show that the preprocessing and storage costs grow by a factor of most N for each dimension "added", so we know that

P(N, k) = S(N, k) = O(Nk).


The above generalization of 2-level k-ranges can also be applied to/-level kranges. As we "add" each new dimension we increase the query time by a factor of at most 21 and increase the preprocessing and storage costs by a factor of O(N2/l). By choosing l as a function of k and e, for any fixed values of k and ~ > 0 we can obtain a structure with performance

P(N,k)=S(N,k)=O(NI+~),
Q (N, k) = O (lg N + A).

and

3.3. Nonoverlapping k-Ranges


The /-level overlapping k-ranges of Sect. 3.2 provided logarithmic search time while their preprocessing and space requirements were O(N 1+~). In this section we will investigate nonoverlapping/-level k-ranges, which require only O(NlgN) preprocessing and linear space, but have O(N ~) query times. We will develop nonoverlapping k-ranges in this section by first presenting and analyzing planar structures, and then investigating the k-dimensional structures. The first object of our study will be the 2-level nonoverlapping 2-range. On the first level of this structure we consider one block of N 1/2 units, each unit representing a set of N U2 points contiguous in the y-direction; we then sort the points in each of those units by x-value. The second level of the structure consists of N a/2 blocks of N 1/2 units, each representing a single point. We can represent both levels of the structure by an N 1/2 by N 1/2 array: each row of the array represents a "contiguous slice" of y-values of the point set and is then

Efficient Worst-Case Data Structures for Range Searching

165

sorted by x-value. This structure requires only linear storage and can be built (by N 1/2 distinct sorts) in O(NlgN) time. Suppose now that we are to do a range search defined by an x-range and a y-range: for all the contiguous y-strips contained wholly in the y-range we can perform two binary searches to give the set of all points contained in the x-range. Since there are only N 1/2 such strips altogether and each can be searched in logarithmic time, the total cost of this step is O(N1/ZlgN+A). We can then do a simple scan over the two end y-strips (top and bottom) to see if they contain any points in both x and y ranges; this costs at most O(N l/z) to examine the 2 N 1/2 points. Thus the total cost of searching is O(N 1/zlgN) and the performance of the structure as a whole is

P(N,2)=O(NlgN), S(N,2)=O(N),
and

Q(N,Z)=O(N~/21gN + A).
Nonoverlapping 2-ranges can easily be extended to be multilevel. In the first level of a 3-level 2-range we have one block of N 1/3 units, each representing N 2/3 points and sorted by x-value. On the second level we have N 1/3 blocks, each containing N a/3 units of N 1/3 points contiguous in y (sorted by x). The third level then contains N z/3 blocks of N ~/3 units (points) each. This structure requires storage linear in N and can be built in O(NlgN) time. To answer a range query we must search at most N ~/3 units on the first level and 2N ~/3 units on each of the second and third levels. The cost of each of those searches is logarithmic (excluding the manipulation of points found), so the total cost of searching is O(NI/31gN). The obvious extension to /-level nonoverlapping 2ranges carries through without flaw and has performance

P(N,2)=O(I N lg N ) = O ( N lg N), S(N,Z)=O(1N)=O(N), Q(N, 2) = O(N x/~lg U +A).


Note that for any fixed e > 0 we can choose l > 1/e and achieve a structure with linear storage, O(N lgN) preprocessing, and O(N ~) search time. Nonoverlapping/-level 2-ranges can be generalize to nonoverlapping /-level k-ranges; for each dimension we " a d d " we use the same multilevel structure and store the units as /-level nonoverlapping (k-1)-ranges. As each dimension is added the storage remains linear and the preprocessing remains O(N lg N) (with increased constants). The search time, however, increases by a factor of N ~/~ for each added dimision. Thus by choosing l as a function of k and ~ one can achieve performances and

P(N, k) = O(N lg N), S(N,k)=O(N),


and

Q(N,k)=O(N~+A).
Note that this is for fixed e, k and l: if l is allowed to vary with N then one

166

J.L. Bentley and H.A. Maurer

achieves a tree-like structure (specifically, the range trees of Bentley [2] if l = lg N).

4. Lower Bounds

We have shown in Section 3 by using k-ranges that a k-dimensional set of N points can be stored such that range queries can be answered in time O(klnN+A). We will now demonstrate that this is optimal. For an arbitrary point set F, let R(F) be the number of different range queries possible for F. (We say that two range queries are different iff their answers are different.) Let

R(N, k ) = m a x {R(F)rF a
It is easy to see that

set of N points in k dimensions}. for the answer to a range query is

R(N, 1)=(N+I)+I,

either empty (1 such answer) or can be defined by two of N + I interpoint locations. The exact value of R(N,k) for general N and k seems more difficult to calculate. We can immediately observe that

R(N,k)<(N21) k,---

since for each

dimension there are only N + 1 essentially different positions for upper and lower bounds for a range search. One might wonder how close R(N, k) can grow to this bound; Theorem 4.1 partially answers this. Theorem 4.1.

R(N,k)> \2k] "

( N ] 2k

Proof. To

avoid complications assume N is a multiple of 2 k. Let F be the set of

all points with a single nonzero integer coordinate in the closed interval ~N--~]. Consider the set of all range queries

[-u

2k '

[L1,H~], [L2, H2],

...,

[Lk, Hk] with


range queries

-~L

i~-I

and I ~ H

i~

for i = 1 , 2 .... ,k. The

(N 2k \~]

obtained in this way clearly determine different sets of points, and the result follows. [] Corollary. For range queries on N points in k dimensions optimal for "comparison-based" methods.

O(klgN+A)

is

Proof. By

the Theorem,

R(N, k)> \ ~ ! . Hence

(N~ 2k

any algorithm for range queries = 2k lg N

based on binary decisions requires in the worst case at least lg f ~

- 2 k lg k - 2 k Ig 2 steps. Hence the 2 k lg N comparisons used for range searching in l-level k-ranges is optimal to within second-order terms. []

Efficient Worst-Case Data Structures for Range Searching

167

5. Conclusions

We have presented three variants of a new data structure (the k-range) for storing k-dimensional sets of N points and permitting fast responses to range queries. The first variant, one-level k-ranges, requires only 2k lg N comparisons per query, plus an amount of list processing proportional to A, the number of answers found. However, preprocessing and storage costs of O(N 2k-1) are prohibitively high. With the second variant, multi-level k-ranges, lookup time is still O(lg N+A), but preprocessing and storage costs are reducible to O(N ~+~) for every fixed e>0. Employing the third variant, nonoverlapping k-ranges, storage can be reduced to O(kN), preprocessing to O(N lg N) and for every e > 0 a worst case query time of O(N~+A) can be achieved. The results are summarized and compared with previously known techniques in the following table (Fig. 5.1), showing the behaviour for fixed k and large N.
Structure Naive k - d Trees Nonoverlapping k-ranges Range Trees (Overlapping) /-level k-ranges (Overlapping one level) k-ranges

P(N, k)
0 (N) O(NIgN)

S(N, k)
0 (N) O(N)

Q(N, k)
0 (N) O(Nl-a/k+A)

O(N lg N) 0 (N lgk- 1 N) O(N 1+~) O(N 2k-1)

O(N) 0 (N lgk- 1 N) O(N ~+~)


O(N 2k-l)

O(N E+ A) 0 (lgk N + A)
O(lg N + A)

O(IgN+A)

Fig. 5.1

Fast solutions to other problems involving point sets in k-dimension can also be obtained by using the data structures and techniques of this paper. Two such examples are the problem of computing the Empirical Cumulative Distribution Function (ECDF searching problem) and the Maxima searching problem discussed in detail in Bentley [-2]. For a point x, the E C D F searching problem and the maxima searching problem can be formulated as follows:

E CDF Searching Problem. Determine the number of points y in F with Yi< xi for
i=l,2,...,k.

Maxima Searching Problem. Determine if there exists a point y in F with yz > x~


for i = l , 2 , . . . , k . It is easy to see that by formulating the above problems in terms of range searching the table in Fig. 5.1 is also valid for the ECDF searching problem and the maxima searching problem. (Indeed, the contribution of A can be ignored.) This is evident for the maxima searching problem since the answer is only "yes" or "no". For the ECDF searching problem it follows from the fact that A, as the count of the number of points determined, can be obtained in O(lgN) rather than O(A) time, by storing the 1-ranges involved as sorted arrays.)

168

J.L. Bentley and H.A. Maurer

Despite the further insights into range searching gained by this paper, a number of open problems remain. Are there other data structures with a still better tradeoff between P(N, k), S(N,k), Q(N,k)? In particular, for a total of 2 k l g N comparisons is O ( N 2 k - 1 ) optimal for space and storage? Can the product P(N, k). S(N,k). Q(N,k) be reduced to O(NZlgZN)? If not, can one show lower bounds on the above product, indicating "space-time" tradeoffs. What is the situation when the dynamic case (insertion and deletion of points inbetween queries) is considered? Another problem of independent interest is the exact computation of R(N, k) of Sect. 4. Although we have shown

\~!

<R(N,k)<=

we have been

unable to compute the exact value of R(N, k) in general. We do not even know the values of R(N, 2) except for small N. In this paper we have presented (asymptotically) fast worst case methods for range searching, some of them with (asymptotically) small amount of preprocessing and storage. We do not necessarily advocate these methods for practical applications. The results do, however, suggest that it may be possible to find methods for solving range queries efficiently both as far as average and worstcase behaviour are concerned.

References
1. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509-517 (1975) 2. Bentley, J.L.: Multidimensional Divide-Conquer. Carnegie Mellon University Computer Science Department Research Review, 1978, pp. 7 24 3. Bentley, J.L., Friedmam J.H.: Data structures for range searching. Comput. Surveys (in press 1980) 4. Knuth, D.E.: The Art of Computer Programming, vol. 3. Reading, Mass.: Addison-Wesley 1973 5. Lee, D.T., Wong, C.K.: Worst case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 23-29 (1977) 6. Maurer, H.A., Ottmann, Th.: Manipulating sets of points - a survey In: Graphen, Algorithmen, Datenstrukturen: Workshop 78, Mtinchen-Wien: Hanser 1978

Received July 17, 1978/Revised September 25, 1979

Note added in proof Mr. James Saxe has tightened the bounds on R(N, k) in an article to appear in Discrete Applied Mathematics.

Potrebbero piacerti anche