Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
518
The low-dimensional case (say, for d equal to 2 or mine near neighbors by hashing the query point and
3) is well-solved [14], so the main issue is that of deal- retrieving elements stored in buckets containing that
ing with a large number of dimensions, the so-called point. We provide such locality-sensitive hash func-
\curse of dimensionality." Despite decades of inten- tions that are simple and easy to implement; they can
sive eort, the current solutions are not entirely sat- also be naturally extended to the dynamic setting, i.e.,
isfactory; in fact, for large enough d, in theory or in when insertion and deletion operations also need to be
practice, they provide little improvement over a linear supported. Although in this paper we are focused on
algorithm which compares a query to each point from Euclidean spaces, dierent LSH functions can be also
the database. In particular, it was shown in [45] that, used for other similarity measures, such as dot prod-
both empirically and theoretically, all current index- uct [5].
ing techniques (based on space partitioning) degrade Locality-Sensitive Hashing was introduced by Indyk
to linear search for suciently high dimensions. This and Motwani [24] for the purposes of devising main
situation poses a serious obstacle to the future develop- memory algorithms for nearest neighbor search; in par-
ment of large scale similarity search systems. Imagine ticular, it enabled us to achieve worst-case O(dn1=)-
for example a search engine which enables content- time for approximate nearest neighbor query over an
based image retrieval on the World-Wide Web. If the n-point database. In this paper we improve that tech-
system was to index a signicant fraction of the web, nique and achieve a signicantly improved query time
the number of images to index would be at least of of O(dn1=(1+)). This yields an approximate nearest
the order tens (if not hundreds) of million. Clearly, no neighbor algorithm running in sublinear time for any
indexing method exhibiting linear (or close to linear) > 0. Furthermore, we generalize the algorithm and
dependence on the data size could manage such a huge its analysis to the case of external memory.
data set. We support our theoretical arguments by empiri-
The premise of this paper is that in many cases cal evidence. We performed experiments on two data
it is not necessary to insist on the exact answer; in- sets. The rst contains 20,000 histograms of color
stead, determining an approximate answer should suf- images, where each histogram was represented as a
ce (refer to Section 2 for a formal denition). This point in d-dimensional space, for d up to 64. The sec-
observation underlies a large body of recent research ond contains around 270,000 points representing tex-
in databases, including using random sampling for his- ture information of blocks of large aerial photographs.
togram estimation [8] and median approximation [33], All our tables were stored on disk. We compared
using wavelets for selectivity estimation [34] and ap- the performance of our algorithm with the perfor-
proximate SVD [25]. We observe that there are many mance of the Sphere/Rectangle-tree (SR-tree) [28], a
applications of nearest neighbor search where an ap- recent data structure which was shown to be com-
proximate answer is good enough. For example, it parable to or signicantly more ecient than other
often happens (e.g., see [23]) that the relevant answers tree-decomposition-based indexing methods for spa-
are much closer to the query point than the irrele- tial data. The experiments show that our algorithm is
vant ones; in fact, this is a desirable property of a signicantly faster than the earlier methods, in some
good similarity measure. In such cases, the approxi- cases even by several orders of magnitude. It also
mate algorithm (with a suitable approximation factor) scales well as the data size and dimensionality increase.
will return the same result as an exact algorithm. In Thus, it enables a new approach to high-performance
other situations, an approximate algorithm provides similarity search | fast retrieval of approximate an-
the user with a time-quality tradeo | the user can swer, possibly followed by a slower but more accurate
decide whether to spend more time waiting for the computation in the few cases where the user is not
exact answer, or to be satised with a much quicker satised with the approximate answer.
approximation (e.g., see [5]). The rest of this paper is organized as follows. In
Section 2 we introduce the notation and give formal
The above arguments rely on the assumption that denitions of the similarity search problems. Then in
approximate similarity search can be performed much Section 3 we describe locality-sensitive hashing and
faster than the exact one. In this paper we show that show how to apply it to nearest neighbor search. In
this is indeed the case. Specically, we introduce a Section 4 we report the results of experiments with
new indexing method for approximate nearest neigh- LSH. The related work is described in Section 5. Fi-
bor with a truly sublinear dependence on the data size nally, in Section 6 we present conclusions and ideas for
even for high-dimensional data. Instead of using space future research.
partitioning, it relies on a new method called locality-
sensitive hashing (LSH). The key idea is to hash the
points using several hash functions so as to ensure that,
2 Preliminaries
for each function, the probability of collision is much We use lpd to denote the Euclidean space <d under the
higher for objects which are close to each other than lp norm, i.e., when the length of a vector (x1 ; : : :xd ) is
for those which are far apart. Then, one can deter- dened as (jx1jp + : : : + jxd jp)1=p . Further, dp(p; q) =
519
jjp , qjjp denotes the distance between the points p and for both data sets). Moreover, in most cases (i.e., for
q in lpd . We use H d to denote the Hamming metric 67% of the queries in the rst set and 73% in the sec-
space of dimension d, i.e., the space of binary vectors ond set) the nearest neighbors under l1 and l2 norms
of length d under the standard Hamming metric. We were exactly the same. This observation is interest-
use dH (p; q) denote the Hamming distance, i.e., the ing in its own right, and can be partially explained
number of bits on which p and q dier. via the theorem by Figiel et al (see [19] and references
The nearest neighbor search problem is dened as therein). They showed analytically that by simply ap-
follows: plying scaling and random rotation to the space l2 ,
we can make the distances induced by the l1 and l2
Denition 1 (Nearest Neighbor Search (NNS)) norms almost equal up to an arbitrarily small factor.
Given a set P of n objects represented as points in a It seems plausible that real data is already randomly
normed space lpd , preprocess P so as to eciently an- rotated, thus the dierence between l1 and l2 norm
swer queries by nding the point in P closest to a query is very small. Moreover, for the data sets for which
point q. this property does not hold, we are guaranteed that
after performing scaling and random rotation our al-
The denition generalizes naturally to the case gorithms can be used for the l2 norm with arbitrarily
where we want to return K > 1 points. Specically, in small loss of precision.
the K -Nearest Neighbors Search (K -NNS), we wish to As far as the second assumption is concerned,
return the K points in the database that are closest to clearly all coordinates can be made positive by prop-
the query point. The approximate version of the NNS erly translating the origin of <d . We can then con-
problem is dened as follows: vert all coordinates to integers by multiplying them
Denition 2 (-Nearest Neighbor Search ( -NNS)) by
est
a suitably large number and rounding to the near-
integer. It can be easily veried that by choosing
Given a set P of points in a normed space lpd , prepro- proper
cess P so as to eciently return a point p 2 P for any be madeparameters, the error induced by rounding can
given query point q, such that d(q; p) (1 + )d(q; P ), ation the minimum small.
arbitrarily Notice that after this oper-
interpoint distance is 1.
where d(q; P ) is the distance of q to the its closest point
in P . 3.1 Locality-Sensitive Hashing
Again, this denition generalizes naturally to nd- In this section we present locality-sensitive hashing
ing K > 1 approximate nearest neighbors. In the Ap- (LSH). This technique was originally introduced by
proximate K -NNS problem, we wish to nd K points Indyk and Motwani [24] for the purposes of devising
p1 ; : : :; pK such that the distance of pi to the query q is main memory algorithms for the -NNS problem. Here
at most (1+ ) times the distance from the ith nearest we give an improved version of their algorithm. The
point to q. new algorithm is in many respects more natural than
the earlier one: it does not require the hash buckets to
3 The Algorithm store only one point; it has better running time guar-
In this section we present ecient solutions to the ap- antees; and, the analysis is generalized to the case of
proximate versions of the NNS problem. Without sig- secondary memory.
nicant loss of generality, we will make the following Let C be the largest coordinate in all points in P .
two assumptions about the data: Then, as per [29], we can embed P into the Hamming
cube H d with d0 = Cd, by transforming each point
0
1. the distance is dened by the l1 norm (see com- p = (x1 ; : : :xd ) into a binary vector
ments below),
v(p) = UnaryC (x1) : : : UnaryC (xd );
2. all coordinates of points in P are positive integers.
The rst assumption is not very restrictive, as usu- where UnaryC (x) denotes the unary representation of
ally there is no clear advantage in, or even dierence x, i.e., is a sequence of x ones followed by C , x zeroes.
between, using l2 or l1 norm for similarity search. For Fact 1 For any pair of points p; q with coordinates in
example, the experiments done for the Webseek [43] the set f1 : : :C g,
project (see [40], chapter 4) show that comparing color
histograms using l1 and l2 norms yields very similar d1(p; q) = dH (v(p); v(q)):
results (l1 is marginally better). Both our data sets
(see Section 4) have a similar property. Specically, That is, the embedding preserves the distances be-
we observed that a nearest neighbor of an average tween the points. Therefore, in the sequel we can
query point computed under the l1 norm was also an concentrate on solving -NNS in the Hamming space
-approximate neighbor under the l2 norm with an av- H d . However, we emphasize that we do not need to
0
erage value of less than 3% (this observation holds actually convert the data to the unary representation,
520
which could be expensive when C is large; in fact, all
our algorithms can be made to run in time indepen- Algorithm Preprocessing
dent on C . Rather, the unary representation provides Input A set of points P ,
us with a convenient framework for description of the l (number of hash tables),
algorithms which would be more complicated other- Output Hash tables Ti, i = 1; : : :; l
wise. Foreach i = 1; : : :; l
We dene the hash functions as follows. For an inte- Initialize hash table Ti by generating
ger l to be specied later, choose l subsets I1 ; : : :; Il of a random hash function gi ()
f1; : : :; d0g. Let pjI denote the projection of vector p on Foreach i = 1; : : :; l
the coordinate set I , i.e., we compute pjI by selecting Foreach j = 1; : : :; n
the coordinate positions as per I and concatenating Store point pj on bucket gi (pj ) of hash table Ti
the bits in those positions. Denote gj (p) = pjIj . For
the preprocessing, we store each p 2 P in the bucket
gj (p), for j = 1; : : :; l. As the total number of buckets Figure 1: Preprocessing algorithm for points already
may be large, we compress the buckets by resorting embedded in the Hamming cube.
to standard hashing. Thus, we use two levels of hash-
ing: the LSH function maps a point p to bucket gj (p),
and a standard hash function maps the contents of Algorithm Approximate Nearest Neighbor Query
these buckets into a hash table of size M . The maxi- Input A query point q,
mal bucket size of the latter hash table is denoted by K (number of appr. nearest neighbors)
B . For the algorithm's analysis, we will assume hash- Access To hash tables Ti , i = 1; : : :; l
ing with chaining, i.e., when the number of points in generated by the preprocessing algorithm
a bucket exceeds B , a new bucket (also of size B ) is Output K (or less) appr. nearest neighbors
allocated and linked to and from the old bucket. How- S
ever, our implementation does not employ chaining, Foreach i = 1; : : :; l
but relies on a simpler approach: if a bucket in a given S S [ fpoints found in gi(q) bucket of table Tig
index is full, a new point cannot be added to it, since Return the K nearest neighbors of q found in set S
it will be added to some other index with high prob- /* Can be found by main memory linear search */
ability. This saves us the overhead of maintaining the
link structure.
The number n of points, the size M of the hash Figure 2: Approximate Nearest Neighbor query an-
table, and the maximum bucket size B are related by swering algorithm.
the following equation: Although we are mainly interested in the I/O com-
n plexity of our scheme, it is worth pointing out that
M = ; the hash functions can be eciently computed if the
B
data set is obtained by mapping l1d into d0-dimensional
where is the memory utilization parameter, i.e., the Hamming space. Let p be any point from the data set
ratio of the memory allocated for the index to the size and let p0 denote its image after the mapping. Let I
of the data set. be the set of coordinates and recall that we need to
To process a query q, we search all indices compute p0jI . For i = 1; : : :; d, let Iji denote, in sorted
g1(q); : : : ; gl (q) until we either encounter at least c l order, the coordinates in I which correspond to the
points (for c specied later) or use all l indices. Clearly, ith coordinate of p. Observe, that projecting p0 on Iji
the number of disk accesses is always upper bounded results in a sequence of bits which is monotone, i.e.,
by the number of indices, which is equal to l. Let consists of a number, say oi , of ones followed by ze-
p1 ; : : :; pt be the points encountered in the process. ros. Therefore, in order to represent p0I it is sucient
For Approximate K -NNS, we output the K points pi to compute oi for i = 1; : : :; d. However, the latter
closest to q; in general, we may return fewer points if task is equivalent to nding the number of elements
the number of points encountered is less than K . in the sorted array Iji which are smaller than a given
It remains to specify the choice of the subsets Ij . value, i.e., the ith coordinate of p. This can be done
For each j 2 f1; : : :; lg, the set Ij consists of k ele- via binary search in log C time, or even in constant
ments from f1; : : :; d0g sampled uniformly at random time using a precomputed array of C bits. Thus, the
with replacement. The optimal value of k is chosen to total time needed to compute the function is either
maximize the probability that a point p \close" to q O(d log C ) or O(d), depending on resources used. In
will fall into the same bucket as q, and also to mini- our experimental section, the value of C can be made
mize the probability that a point p0 \far away" from q very small, and therefore we will resort to the second
will fall into the same bucket. The choice of the values method.
of l and k is deferred to the next section. For quick reference we summarize the preprocessing
521
and query answering algorithms in Figures 1 and 2. r and . Then we show how to apply the solution to
this problem to solve -NNS.
3.2 Analysis of Locality-Sensitive Hashing Denote by P 0 the set of all points p0 2 P such that
The principle behind our method is that the probabil- d(q; p0) > r2 . We observe that the algorithm correctly
ity of collision of two points p and q is closely related solves the (r; )-Neighbor problem if the following two
to the distance between them. Specically, the larger properties hold:
the distance, the smaller the collision probability. This P1 If there exists p such that p 2 B(q; r1), then
intuition is formalized as follows [24]. Let D(; ) be a
gj (p ) = gj (q) for some j = 1; : : :; l.
distance function of elements from a set S , and for
any p 2 S let B(p; r) denote the set of elements from P2 The total number of blocks pointed to by q and
S within the distance r from p. containing only points from P 0 is less than cl.
Denition 3 A family H of functions from S to U Assume that H is a (r1 ; r2; p1; p2)-sensitive family;
is called (r ; r ; p ; p )-sensitive for D(; ) if for any dene = lnln 11=p
=p2 . The correctness of the LSH algo-
1 2 1 2 1
q; p 2 S rithm follows from the following theorem.
if p 2 B(q; r ) then PrH [h(q) = h(p)] p ,
Theorem 1 Setting k = log1=p2 (n=B ) and l = Bn
1 1 ,
if p 2= B(q; r ) then PrH [h(q) = h(p)] p .
2 2 guarantees that properties P1 and P2 hold with prob-
ability at least 12 , e1 0:132.
In the above denition, probabilities are considered
with respect to the random choice of a function h from Remark 1 Note that by repeating the LSH algorithm
the family H. In order for a locality-sensitive family O(1= ) times, we can amplify the probability of success
to be useful, it has to satisfy the inequalities p1 > p2 in at least one trial to 1 , , for any > 0.
and r1 < r2 .
Observe that if D(; ) is the Hamming distance Proof: Let property P1 hold with probability P ,
dH (; ), then the family of projections on one coor-
1
and property P2 hold with probability P . We will
dinate is locality-sensitive. More specically: show that both P1 and P2 are large. Assume that there
2
where the functions hi1 ; : : :; hik are randomly chosen By setting l = Bn , we bound from above the prob-
,
from H with replacement. As before, we choose l such ability that gj (p ) 6= gj (q) for all j = 1; : : :; l by 1=e.
functions g1 ; : : :; gl . In the case when the family Hd is
0
Thus the probability that one such gj exists is at least
used, i.e., each function selects one bit of an argument, P1 1 , 1=e.
the resulting values of gj (p) are essentially equivalent
to pjIj . Therefore, the probability that both properties P1
We now show that the LSH algorithm can be used to and P2 hold is at least 1 , [(1 , P1) + (1 , P2 )] =
solve what we call the (r; )-Neighbor problem: deter- P1 + P2 , 1 21 , 1e . The theorem follows. 2
mine whether there exists a point p within a xed dis- In the following we consider the LSH family for the
tance r1 = r of q, or whether all points in the database Hamming metric of dimension d0 as specied in Fact 2.
are at least a distance r2 = r(1+ ) away from q; in the For this case, we show that 1+1 assuming that
rst case, the algorithm is required to return a point r < lnd n ; the latter assumption can be easily satised
0
p0 within distance at most (1 + )r from q. In par- by increasing the dimensionality by padding a su-
ticular, we argue that the LSH algorithm solves this ciently long string of 0s at the end of each point's rep-
problem for a proper choice of k and l, depending on resentation.
522
Fact 3 Let r < d . If p1 = 1, r and p2 = 1, r(1+) ,
0
produce answers of good quality. This can be ex-
n d d 0 0
ln 1=p2 ln 1,(1+1)r=d ln(1 , (1 + )r=d0) same parameter r is likely to work for a vast majority
0
of queries. Therefore in the experimental section we
Multiplying both the numerator and the denominator adopt a xed choice of r and therefore also of k and l.
by dr , we obtain:
0
4 Experiments
= d
ln(1 , r=d0)
d
r
0
In order to upper bound , we need to bound U from as a point in d-dimensional space, for d up to 64. The
below and L from above; note that both U and L are second one contains around 270,000 points of dimen-
negative. To this end we use the following inequali- sion 60 representing texture information of blocks of
ties [35]: large large aerial photographs. We describe the data
sets in more detail later in the section.
(1 , (1 + )r=d0)d =r < e,(1+)
0
We decided not to use randomly-chosen synthetic
data in our experiments. Though such data is often
and used to measure the performance of exact similarity
1 , dr0
d =r 0
1 :
> e,1 1 , 0 search algorithms, we found it unsuitable for evalua-
d =r tion of approximate algorithms for the high data di-
Therefore, mensionality. The main reason is as follows. Assume
a data set consisting of points chosen independently
U ln(e,1 (1 , d 1=r )) at random from the same distribution. Most distri-
<
0
butions (notably uniform) used in the literature as-
L ln e,(1+) sume that all coordinates of each point are chosen in-
,1 + ln(1 , d 1=r ) dependently. In such a case, for any pair of points p; q
=
0
nearest neighbor.
Implementation. We implement the LSH algo-
n < n1=(1+)n, ln(1,1= ln n) rithm as specied in Section 3. The LSH functions
= n1=(1+)(1 , 1= ln n), ln n = O(n1=(1+)) can be computed as described in Section 3.1. Denote
the resulting vector of coordinates by (v1 ; : : :; vk ). For
2 the second level mapping we use functions of the form
We now return to the -NNS problem. First, we
observe that we could reduce it to the (r; )-Neighbor h(v1 ; : : :; vk) = a1 v1 + + ad vk mod M;
problem by building several data structures for the
latter problem with dierent values of r. More specif- where M is the size of the hash table and a1; : : :; ak
ically, we could explore r equal to r0, r0(1 + ), are random numbers from interval [0 : : :M , 1]. These
r0(1 + )2; : : :; rmax , where r0 and rmax are the small- functions can be computed using only 2k , 1 oper-
est and the largest possible distance between the query ations, and are suciently random for our purposes,
and the data point, respectively. We remark that the i.e., give low probability of collision. Each second level
number of dierent radii could be further reduced [24] bucket is then directly mapped to a disk block. We as-
at the cost of increasing running time and space re- sumed that each block is 8KB of data. As each coordi-
quirement. On the other hand, we observed that in nate in our data sets can be represented using 1 byte,
practice choosing only one value of r is sucient to we can store up to 8192=d d-dimensional points per
523
block. Therefore, we assume the bucket size B = 100
for d = 64 or d = 60, B = 300 for d = 27 and B = 1000 Point set distance distribution
for d = 8.
For the SR-tree, we used the implementation by 1
block.
Normalized frequency
Performance measures. The goal of our exper-
iments was to estimate two performance measures:
0.6
E=
1 X dLSH
jQj query q2Q d ;
Normalized frequency
0.8
524
overlapping tiles of size 64 times 64, from which the
feature vectors are computed. alpha=2, n=19000, d=64, k=700
Error
the size of S2 is 1000. The set S1 forms a database of 0.3
525
alpha = 2, 1−NNS alpha=2, n=19000, d=64, 1−NNS
SR−Tree
20 LSH, error=.02 0.25
LSH, error=.05
LSH, error=.1 Error=.05
LSH, error=.2 0.2 Error=.1
15
Disk accesses
Miss ratio
0.15
10
0.1
5 0.05
0
0 0.1 0.2 0.5 1 1.9
0.1 0.2 0.5 1 1.9 Number of database points 4
Number of database points 4 x 10
Miss ratio
0.3
20
15 0.2
10
0.1
5
0
0 0.1 0.2 0.5 1 1.9
0.1 0.2 0.5 1 1.9 Number of database points 4
Number of database points 4 x 10
(b) Approximate 10-NNS
x 10
(b) Approximate 10-NNS
Figure 5: Number of indices vs. data size. Figure 6: Miss ratio vs. data size.
4.3 Texture features
to around 1% for n = 19; 000 .
The experiments with texture feature data were de-
Dependence on Dimension. We performed the signed to measure the performance of the LSH algo-
simulations for d = 23 ; 33 and 43 ; the choice of d's was rithm on large data sets; note that the size of the
limited to cubes of natural numbers because of the way texture le (270,000 points) is an order of magnitude
the data has been created. Again, we performed the larger than the size of the histogram data set (20,000
comparison for Approximate 1-NNS and Approximate points). The rst step (i.e., the choice of the number
10-NNS; the results are shown on Figure 7. Note that of sampled bits k) was very similar to the previous
LSH scales very well with the increase of dimension- experiment, therefore we skip the detailed description
ality: for E = 5% the change from d = 8 to d = 64 here. We just state that we assumed that the number
increases the number of indices only by 2. The miss of sampled bits k = 65, with other parameters being:
ratio was always below 6% for all dimensions. the storage overhead = 1, block size B = 100, and
This completes the comparison of LSH with SR- the number of nearest neighbors equal to 10. As stated
tree. For a better understanding of the behavior of above, the value of n was equal to 270,000.
LSH, we performed an additional experiment on LSH We varied the number of indices from 3 to 100,
only. Figure 8 presents the performance of LSH when which resulted in error from 50% to 15% (see Fig-
the number of nearest neighbors to retrieve vary from ure 9 (a)). The shape of the curve is similar as in
1 to 100. the previous experiment. The miss ratio was roughly
526
alpha = 2, 1−NNS alpha=2, n=19000, d=64
20 15
18 SR−Tree Error=.05
LSH, error=.02 Error=.1
16 Error=.2
LSH, error=.05
14 LSH, error=.1
LSH, error=.2 10
Disk accesses
Disk accesses
12
10
8
5
6
0 0
8 27 64 1 10 20 50 100
Dimensions Number of nearest neighbors
(a) Approximate 1-NNS
Figure 8: Number of indices vs. number of nearest
alpha = 2, 10−NNS
35
neighbors.
30 SR−Tree
LSH, error=.02
while incurring some error.
The query cost versus error tradeo obtained in this
LSH, error=.05
25
LSH, error=.1
LSH, error=.2 way (for the entire data set) is depicted on Figure 9;
Disk accesses
20
we also include a similar graph for LSH.
Observe that using random sampling results in con-
15
siderable speed-up for the SR-tree algorithm, while
keeping the error relatively low. However, even in
10
this case the LSH algorithm oers considerably out-
performs SR-trees, being up to an order of magnitude
5
faster.
5 Previous Work
0
8 27 64
Dimensions
(b) Approximate 10-NNS There is considerable literature on various versions of
the nearest neighbor problem. Due to lack of space we
Figure 7: Number of indices vs. dimension. omit detailed description of related work; the reader
is advised to read [39] for a survey of a variety of data
4% for 3 indices, 1% for 5 indices, and 0% otherwise. structures for nearest neighbors in geometric spaces,
including variants of k-d trees, R-trees, and structures
To compare with SR-tree, we implemented that lat- based on space-lling curves. The more recent results
ter on random subsets of the whole data set of sizes are surveyed in [41]; see also an excellent survey by [4].
from 10; 000 to 200; 000. For n = 200; 000 the average Recent theoretical work in nearest neighbor search is
number of blocks accessed per query by SR-tree was brie
y surveyed in [24].
1310, which is one to two orders of magnitude larger
than the number of blocks accessed by our algorithm 6 Conclusions
(see Figure 9 (b) where we show the running times of
LSH for eective error 15%). Observe though that an We presented a novel scheme for approximate simi-
SR-tree computes exact answers while LSH provides larity search based on locality-sensitive hashing. We
only an approximation. Thus in order to perform an compared the performance of this technique and SR-
accurate evaluation of LSH, we decided to compare tree, a good representative of tree-based spatial data
it with a modied SR-tree algorithm which produces structures. We showed that by allowing small error
approximate answers. The modication is simple: in- and additional storage overhead, we can considerably
stead of running SR-tree on the whole data set, we run improve the query time. Experimental results also in-
it on a randomly chosen subset of it. In this way we dicate that our scheme scales well to even a large num-
achieve a speed-up of the algorithm (as the random ber of dimensions and data size. An additional advan-
sample of the data set is smaller than the original set) tage of our data structure is that its running time is
527
Performance vs error
References
450 [1] S. Arya, D.M. Mount, and O. Narayan, Account-
SR−Tree ing for boundary eects in nearest-neighbor searching.
400 LSH
Discrete and Computational Geometry, 16 (1996), pp.
350 155{176.
300 [2] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silver-
Disk accesses
150
[3] J.L. Bentley. Multidimensional binary search trees
100 used for associative searching. Communications of the
50
ACM, 18 (1975), pp. 509{517.
[4] S. Berchtold and D.A. Keim. High-dimensional Index
0
10 15 20 25 30 35 40 45 50 Structures. In Proceedings of SIGMOD, 1998, p. 501.
Error (%) See http://www.informatik.uni-halle.de/ keim/
(a) SIGMOD98Tutorial.ps.gz
1400 [5] E. Cohen. M. Datar, S. Fujiwara, A. Gionis, P. Indyk,
R. Motwani, J. D. Ullman. C. Yang. Finding Interest-
1200 SR−Tree
LSH ing Associations without Support Pruning. Technical
Number of Disk Accesses
528
[16] C. Faloutsos and D.W. Oard. A Survey of Informa- [32] B. S. Manjunath and W. Y. Ma. Texture features
tion Retrieval and Filtering Methods. Technical Re- for browsing and retrieval of large image data. IEEE
port CS-TR-3514, Department of Computer Science, Transactions on Pattern Analysis and Machine Intel-
University of Maryland, College Park, 1995. ligence, (Special Issue on Digital Libraries), 18 (8),
[17] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, pp. 837-842.
Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, [33] G.S. Manku, S. Rajagopalan, and B.G. Lindsay. Ap-
D. Petkovic, D. Steele, and P. Yanker. Query by im- proximate Medians and other Quantiles in One Pass
age and video content: the QBIC system. IEEE Com- and with Limited Memory. In Proceedings of SIG-
puter, 28 (1995), pp. 23{32. MOD'98, pp. 426{435.
[18] J.K. Friedman, J.L. Bentley, and R.A. Finkel. An al- [34] Y. Matias, J.S. Vitter, and M. Wang. Wavelet-based
gorithm for nding best matches in logarithmic ex- Histograms for Selectivity Estimations. In Proceedings
pected time. ACM Transactions on Mathematical of SIGMOD'98, pp. 448{459.
Software, 3 (1977), pp. 209{226. [35] R. Motwani and P. Raghavan. Randomized Algo-
[19] T. Figiel, J. Lindenstrauss, V. D. Milman. The di- rithms. Cambridge University Press, 1995.
mension of almost spherical sections of convex bodies. [36] M. Otterman. Approximate matching with high di-
Acta Math. 139 (1977), no. 1-2, 53{94. mensionality R-trees. M.Sc. Scholarly paper, Dept. of
[20] A. Gersho and R.M. Gray. Vector Quantization and Computer Science, Univ. of Maryland, College Park,
Data Compression. Kluwer, 1991. MD, 1992.
[21] T. Hastie and R. Tibshirani. Discriminant adaptive [37] A. Pentland, R.W. Picard, and S. Sclaro. Photo-
nearest neighbor classication. In Proceedings of the book: tools for content-based manipulation of image
First International Conference on Knowledge Discov- databases. In Proceedings of the SPIE Conference on
ery & Data Mining, 1995, pp. 142{149. Storage and Retrieval of Image and Video Databases
[22] H. Hotelling. Analysis of a complex of statistical vari- II, 1994.
ables into principal components. Journal of Educa- [38] G. Salton and M.J. McGill. Introduction to Modern
tional Psychology, 27 (1933). pp. 417{441. Information Retrieval. McGraw-Hill Book Company,
[23] P. Indyk, G. Iyengar, N. Shivakumar. Finding pirated New York, NY, 1983.
video sequences on the Internet. Technical Report, [39] H. Samet. The Design and Analysis of Spatial Data
Computer Science Department, Stanford University. Structures. Addison-Wesley, Reading, MA, 1989.
[24] P. Indyk and R. Motwani. Approximate Nearest [40] J. R. Smith. Integrated Spatial and Feature Im-
Neighbor { Towards Removing the Curse of Dimen- age Systems: Retrieval, Analysis and Compression.
sionality. In Proceedings of the 30th Symposium on Ph.D. thesis, Columbia University, 1997. Available at
Theory of Computing, 1998, pp. 604{613. ftp://ftp.ctr.columbia.edu/CTR-Research/advent/
[25] K.V. Ravi Kanth, D. Agrawal, A. Singh. \Dimension- public/public/jrsmith/thesis
ality Reduction for Similarity Searching in Dynamic [41] T. Sellis, N. Roussopoulos, and C. Faloutsos. Multi-
Databases". In Proceedings of SIGMOD'98, 166 { 176. dimensional Access Methods: Trees Have Grown Ev-
[26] K. Karhunen. U ber lineare Methoden in der erywhere. In Proceedings of the 23rd International
Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fen- Conference on Very Large Data Bases, 1997, 13{15.
nicae, Ser. A137, 1947. [42] A.W.M. Smeulders and R. Jain, eds. Image Databases
[27] V. Koivune and S. Kassam. Nearest neighbor lters and Multi-media Search. Proceedings of the First
for multivariate data. IEEE Workshop on Nonlinear International Workshop, IDB-MMS'96, Amsterdam
Signal and Image Processing, 1995. University Press, Amsterdam, 1996.
[28] N. Katayama and S. Satoh. The SR-tree: an in- [43] J.R. Smith and S.F. Chang. Visually Searching
dex structure for high-dimensional nearest neighbor the Web for Content. IEEE Multimedia 4 (1997):
queries. In Proc. SIGMOD'97, pp. 369{380. The pp. 12{20. See also http://disney.ctr.columbia.
code is available from http://www.rd.nacsis.ac.jp/ edu/WebSEEk
~katayama/homepage/research/srtree/English.html [44] G. Wyszecki and W.S. Styles. Color science: concepts
[29] N. Linial, E. London, and Y. Rabinovich. The geome- and methods, quantitative data and formulae. John
try of graphs and some of its algorithmic applications. Wiley and Sons, New York, NY, 1982.
In Proceedings of 35th Annual IEEE Symposium on [45] R. Weber, H. Schek, and S. Blott. A quantitative
Foundations of Computer Science, 1994, pp. 577{591. analysis and performance study for Similarity Search
[30] M. Loeve. Fonctions aleastoires de second ordere. Methods in High Dimensional Spaces. In Proceedings
Processus Stochastiques et mouvement Brownian, Her- of the 24th International Conference on Very Large
mann, Paris, 1948. Data Bases (VLDB), 1998, pp. 194-205.
[31] B. S. Manjunath. Airphoto dataset. http://vivaldi.
ece.ucsb.edu/Manjunath/research.htm
529