Sei sulla pagina 1di 12

Similarity Search in High Dimensions via Hashing

Aristides Gionis  Piotr Indyky Rajeev Motwaniz


Department of Computer Science
Stanford University
Stanford, CA 94305
fgionis,indyk,rajeevg@cs.stanford.edu

from the database so as to ensure that the prob-


ability of collision is much higher for objects that
Abstract are close to each other than for those that are far
apart. We provide experimental evidence that our
The nearest- or near-neighbor query problems method gives signi cant improvement in running
arise in a large variety of database applications, time over other methods for searching in high-
usually in the context of similarity searching. Of dimensional spaces based on hierarchical tree de-
late, there has been increasing interest in build- composition. Experimental results also indicate
ing search/index structures for performing simi- that our scheme scales well even for a relatively
larity search over high-dimensional data, e.g., im- large number of dimensions (more than 50).
age databases, document collections, time-series
databases, and genome databases. Unfortunately, 1 Introduction
all known techniques for solving this problem fall A similarity search problem involves a collection of ob-
prey to the \curse of dimensionality." That is, jects (e.g., documents, images) that are characterized
the data structures scale poorly with data dimen- by a collection of relevant features and represented
sionality; in fact, if the number of dimensions as points in a high-dimensional attribute space; given
exceeds 10 to 20, searching in k-d trees and re- queries in the form of points in this space, we are re-
lated structures involves the inspection of a large quired to nd the nearest (most similar) object to the
fraction of the database, thereby doing no better query. The particularly interesting and well-studied
than brute-force linear search. It has been sug- case is the d-dimensional Euclidean space. The prob-
gested that since the selection of features and the lem is of major importance to a variety of applications;
choice of a distance metric in typical applications some examples are: data compression [20]; databases
is rather heuristic, determining an approximate and data mining [21]; information retrieval [11, 16, 38];
nearest neighbor should suce for most practi- image and video databases [15, 17, 37, 42]; machine
cal purposes. In this paper, we examine a novel learning [7]; pattern recognition [9, 13]; and, statistics
scheme for approximate similarity search based and data analysis [12, 27]. Typically, the features of
on hashing. The basic idea is to hash the points the objects of interest are represented as points in <d
 Supported by NAVY N00014-96-1-1221 grant and NSF and a distance metric is used to measure similarity of
Grant IIS-9811904. objects. The basic problem then is to perform indexing
ySupported by Stanford Graduate Fellowship and NSF NYI or similarity searching for query objects. The number
Award CCR-9357849.
zSupported by ARO MURI Grant DAAH04-96-1-0007, NSF
of features (i.e., the dimensionality) ranges anywhere
Grant IIS-9811904, and NSF Young Investigator Award CCR- from tens to thousands. For example, in multimedia
9357849, with matching funds from IBM, Mitsubishi, Schlum- applications such as IBM's QBIC (Query by Image
berger Foundation, Shell Foundation, and Xerox Corporation. Content), the number of features could be several hun-
Permission to copy without fee all or part of this material is dreds [15, 17]. In information retrieval for text doc-
granted provided that the copies are not made or distributed for uments, vector-space representations involve several
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
thousands of dimensions, and it is considered to be a
given that copying is by permission of the Very Large Data Base dramatic improvement that dimension-reduction tech-
Endowment. To copy otherwise, or to republish, requires a fee niques, such as the Karhunen-Loeve transform [26, 30]
and/or special permission from the Endowment. (also known as principal components analysis [22] or
Proceedings of the 25th VLDB Conference, latent semantic indexing [11]), can reduce the dimen-
Edinburgh, Scotland, 1999. sionality to a mere few hundreds!

518
The low-dimensional case (say, for d equal to 2 or mine near neighbors by hashing the query point and
3) is well-solved [14], so the main issue is that of deal- retrieving elements stored in buckets containing that
ing with a large number of dimensions, the so-called point. We provide such locality-sensitive hash func-
\curse of dimensionality." Despite decades of inten- tions that are simple and easy to implement; they can
sive e ort, the current solutions are not entirely sat- also be naturally extended to the dynamic setting, i.e.,
isfactory; in fact, for large enough d, in theory or in when insertion and deletion operations also need to be
practice, they provide little improvement over a linear supported. Although in this paper we are focused on
algorithm which compares a query to each point from Euclidean spaces, di erent LSH functions can be also
the database. In particular, it was shown in [45] that, used for other similarity measures, such as dot prod-
both empirically and theoretically, all current index- uct [5].
ing techniques (based on space partitioning) degrade Locality-Sensitive Hashing was introduced by Indyk
to linear search for suciently high dimensions. This and Motwani [24] for the purposes of devising main
situation poses a serious obstacle to the future develop- memory algorithms for nearest neighbor search; in par-
ment of large scale similarity search systems. Imagine ticular, it enabled us to achieve worst-case O(dn1=)-
for example a search engine which enables content- time for approximate nearest neighbor query over an
based image retrieval on the World-Wide Web. If the n-point database. In this paper we improve that tech-
system was to index a signi cant fraction of the web, nique and achieve a signi cantly improved query time
the number of images to index would be at least of of O(dn1=(1+)). This yields an approximate nearest
the order tens (if not hundreds) of million. Clearly, no neighbor algorithm running in sublinear time for any
indexing method exhibiting linear (or close to linear)  > 0. Furthermore, we generalize the algorithm and
dependence on the data size could manage such a huge its analysis to the case of external memory.
data set. We support our theoretical arguments by empiri-
The premise of this paper is that in many cases cal evidence. We performed experiments on two data
it is not necessary to insist on the exact answer; in- sets. The rst contains 20,000 histograms of color
stead, determining an approximate answer should suf- images, where each histogram was represented as a
ce (refer to Section 2 for a formal de nition). This point in d-dimensional space, for d up to 64. The sec-
observation underlies a large body of recent research ond contains around 270,000 points representing tex-
in databases, including using random sampling for his- ture information of blocks of large aerial photographs.
togram estimation [8] and median approximation [33], All our tables were stored on disk. We compared
using wavelets for selectivity estimation [34] and ap- the performance of our algorithm with the perfor-
proximate SVD [25]. We observe that there are many mance of the Sphere/Rectangle-tree (SR-tree) [28], a
applications of nearest neighbor search where an ap- recent data structure which was shown to be com-
proximate answer is good enough. For example, it parable to or signi cantly more ecient than other
often happens (e.g., see [23]) that the relevant answers tree-decomposition-based indexing methods for spa-
are much closer to the query point than the irrele- tial data. The experiments show that our algorithm is
vant ones; in fact, this is a desirable property of a signi cantly faster than the earlier methods, in some
good similarity measure. In such cases, the approxi- cases even by several orders of magnitude. It also
mate algorithm (with a suitable approximation factor) scales well as the data size and dimensionality increase.
will return the same result as an exact algorithm. In Thus, it enables a new approach to high-performance
other situations, an approximate algorithm provides similarity search | fast retrieval of approximate an-
the user with a time-quality tradeo | the user can swer, possibly followed by a slower but more accurate
decide whether to spend more time waiting for the computation in the few cases where the user is not
exact answer, or to be satis ed with a much quicker satis ed with the approximate answer.
approximation (e.g., see [5]). The rest of this paper is organized as follows. In
Section 2 we introduce the notation and give formal
The above arguments rely on the assumption that de nitions of the similarity search problems. Then in
approximate similarity search can be performed much Section 3 we describe locality-sensitive hashing and
faster than the exact one. In this paper we show that show how to apply it to nearest neighbor search. In
this is indeed the case. Speci cally, we introduce a Section 4 we report the results of experiments with
new indexing method for approximate nearest neigh- LSH. The related work is described in Section 5. Fi-
bor with a truly sublinear dependence on the data size nally, in Section 6 we present conclusions and ideas for
even for high-dimensional data. Instead of using space future research.
partitioning, it relies on a new method called locality-
sensitive hashing (LSH). The key idea is to hash the
points using several hash functions so as to ensure that,
2 Preliminaries
for each function, the probability of collision is much We use lpd to denote the Euclidean space <d under the
higher for objects which are close to each other than lp norm, i.e., when the length of a vector (x1 ; : : :xd ) is
for those which are far apart. Then, one can deter- de ned as (jx1jp + : : : + jxd jp)1=p . Further, dp(p; q) =

519
jjp , qjjp denotes the distance between the points p and for both data sets). Moreover, in most cases (i.e., for
q in lpd . We use H d to denote the Hamming metric 67% of the queries in the rst set and 73% in the sec-
space of dimension d, i.e., the space of binary vectors ond set) the nearest neighbors under l1 and l2 norms
of length d under the standard Hamming metric. We were exactly the same. This observation is interest-
use dH (p; q) denote the Hamming distance, i.e., the ing in its own right, and can be partially explained
number of bits on which p and q di er. via the theorem by Figiel et al (see [19] and references
The nearest neighbor search problem is de ned as therein). They showed analytically that by simply ap-
follows: plying scaling and random rotation to the space l2 ,
we can make the distances induced by the l1 and l2
De nition 1 (Nearest Neighbor Search (NNS)) norms almost equal up to an arbitrarily small factor.
Given a set P of n objects represented as points in a It seems plausible that real data is already randomly
normed space lpd , preprocess P so as to eciently an- rotated, thus the di erence between l1 and l2 norm
swer queries by nding the point in P closest to a query is very small. Moreover, for the data sets for which
point q. this property does not hold, we are guaranteed that
after performing scaling and random rotation our al-
The de nition generalizes naturally to the case gorithms can be used for the l2 norm with arbitrarily
where we want to return K > 1 points. Speci cally, in small loss of precision.
the K -Nearest Neighbors Search (K -NNS), we wish to As far as the second assumption is concerned,
return the K points in the database that are closest to clearly all coordinates can be made positive by prop-
the query point. The approximate version of the NNS erly translating the origin of <d . We can then con-
problem is de ned as follows: vert all coordinates to integers by multiplying them
De nition 2 (-Nearest Neighbor Search ( -NNS)) by
est
a suitably large number and rounding to the near-
integer. It can be easily veri ed that by choosing
Given a set P of points in a normed space lpd , prepro- proper
cess P so as to eciently return a point p 2 P for any be madeparameters, the error induced by rounding can
given query point q, such that d(q; p)  (1 + )d(q; P ), ation the minimum small.
arbitrarily Notice that after this oper-
interpoint distance is 1.
where d(q; P ) is the distance of q to the its closest point
in P . 3.1 Locality-Sensitive Hashing
Again, this de nition generalizes naturally to nd- In this section we present locality-sensitive hashing
ing K > 1 approximate nearest neighbors. In the Ap- (LSH). This technique was originally introduced by
proximate K -NNS problem, we wish to nd K points Indyk and Motwani [24] for the purposes of devising
p1 ; : : :; pK such that the distance of pi to the query q is main memory algorithms for the -NNS problem. Here
at most (1+ ) times the distance from the ith nearest we give an improved version of their algorithm. The
point to q. new algorithm is in many respects more natural than
the earlier one: it does not require the hash buckets to
3 The Algorithm store only one point; it has better running time guar-
In this section we present ecient solutions to the ap- antees; and, the analysis is generalized to the case of
proximate versions of the NNS problem. Without sig- secondary memory.
ni cant loss of generality, we will make the following Let C be the largest coordinate in all points in P .
two assumptions about the data: Then, as per [29], we can embed P into the Hamming
cube H d with d0 = Cd, by transforming each point
0

1. the distance is de ned by the l1 norm (see com- p = (x1 ; : : :xd ) into a binary vector
ments below),
v(p) = UnaryC (x1) : : : UnaryC (xd );
2. all coordinates of points in P are positive integers.
The rst assumption is not very restrictive, as usu- where UnaryC (x) denotes the unary representation of
ally there is no clear advantage in, or even di erence x, i.e., is a sequence of x ones followed by C , x zeroes.
between, using l2 or l1 norm for similarity search. For Fact 1 For any pair of points p; q with coordinates in
example, the experiments done for the Webseek [43] the set f1 : : :C g,
project (see [40], chapter 4) show that comparing color
histograms using l1 and l2 norms yields very similar d1(p; q) = dH (v(p); v(q)):
results (l1 is marginally better). Both our data sets
(see Section 4) have a similar property. Speci cally, That is, the embedding preserves the distances be-
we observed that a nearest neighbor of an average tween the points. Therefore, in the sequel we can
query point computed under the l1 norm was also an concentrate on solving -NNS in the Hamming space
-approximate neighbor under the l2 norm with an av- H d . However, we emphasize that we do not need to
0

erage value of  less than 3% (this observation holds actually convert the data to the unary representation,

520
which could be expensive when C is large; in fact, all
our algorithms can be made to run in time indepen- Algorithm Preprocessing
dent on C . Rather, the unary representation provides Input A set of points P ,
us with a convenient framework for description of the l (number of hash tables),
algorithms which would be more complicated other- Output Hash tables Ti, i = 1; : : :; l
wise. Foreach i = 1; : : :; l
We de ne the hash functions as follows. For an inte- Initialize hash table Ti by generating
ger l to be speci ed later, choose l subsets I1 ; : : :; Il of a random hash function gi ()
f1; : : :; d0g. Let pjI denote the projection of vector p on Foreach i = 1; : : :; l
the coordinate set I , i.e., we compute pjI by selecting Foreach j = 1; : : :; n
the coordinate positions as per I and concatenating Store point pj on bucket gi (pj ) of hash table Ti
the bits in those positions. Denote gj (p) = pjIj . For
the preprocessing, we store each p 2 P in the bucket
gj (p), for j = 1; : : :; l. As the total number of buckets Figure 1: Preprocessing algorithm for points already
may be large, we compress the buckets by resorting embedded in the Hamming cube.
to standard hashing. Thus, we use two levels of hash-
ing: the LSH function maps a point p to bucket gj (p),
and a standard hash function maps the contents of Algorithm Approximate Nearest Neighbor Query
these buckets into a hash table of size M . The maxi- Input A query point q,
mal bucket size of the latter hash table is denoted by K (number of appr. nearest neighbors)
B . For the algorithm's analysis, we will assume hash- Access To hash tables Ti , i = 1; : : :; l
ing with chaining, i.e., when the number of points in generated by the preprocessing algorithm
a bucket exceeds B , a new bucket (also of size B ) is Output K (or less) appr. nearest neighbors
allocated and linked to and from the old bucket. How- S 
ever, our implementation does not employ chaining, Foreach i = 1; : : :; l
but relies on a simpler approach: if a bucket in a given S S [ fpoints found in gi(q) bucket of table Tig
index is full, a new point cannot be added to it, since Return the K nearest neighbors of q found in set S
it will be added to some other index with high prob- /* Can be found by main memory linear search */
ability. This saves us the overhead of maintaining the
link structure.
The number n of points, the size M of the hash Figure 2: Approximate Nearest Neighbor query an-
table, and the maximum bucket size B are related by swering algorithm.
the following equation: Although we are mainly interested in the I/O com-
n plexity of our scheme, it is worth pointing out that
M = ; the hash functions can be eciently computed if the
B
data set is obtained by mapping l1d into d0-dimensional
where is the memory utilization parameter, i.e., the Hamming space. Let p be any point from the data set
ratio of the memory allocated for the index to the size and let p0 denote its image after the mapping. Let I
of the data set. be the set of coordinates and recall that we need to
To process a query q, we search all indices compute p0jI . For i = 1; : : :; d, let Iji denote, in sorted
g1(q); : : : ; gl (q) until we either encounter at least c  l order, the coordinates in I which correspond to the
points (for c speci ed later) or use all l indices. Clearly, ith coordinate of p. Observe, that projecting p0 on Iji
the number of disk accesses is always upper bounded results in a sequence of bits which is monotone, i.e.,
by the number of indices, which is equal to l. Let consists of a number, say oi , of ones followed by ze-
p1 ; : : :; pt be the points encountered in the process. ros. Therefore, in order to represent p0I it is sucient
For Approximate K -NNS, we output the K points pi to compute oi for i = 1; : : :; d. However, the latter
closest to q; in general, we may return fewer points if task is equivalent to nding the number of elements
the number of points encountered is less than K . in the sorted array Iji which are smaller than a given
It remains to specify the choice of the subsets Ij . value, i.e., the ith coordinate of p. This can be done
For each j 2 f1; : : :; lg, the set Ij consists of k ele- via binary search in log C time, or even in constant
ments from f1; : : :; d0g sampled uniformly at random time using a precomputed array of C bits. Thus, the
with replacement. The optimal value of k is chosen to total time needed to compute the function is either
maximize the probability that a point p \close" to q O(d log C ) or O(d), depending on resources used. In
will fall into the same bucket as q, and also to mini- our experimental section, the value of C can be made
mize the probability that a point p0 \far away" from q very small, and therefore we will resort to the second
will fall into the same bucket. The choice of the values method.
of l and k is deferred to the next section. For quick reference we summarize the preprocessing

521
and query answering algorithms in Figures 1 and 2. r and . Then we show how to apply the solution to
this problem to solve -NNS.
3.2 Analysis of Locality-Sensitive Hashing Denote by P 0 the set of all points p0 2 P such that
The principle behind our method is that the probabil- d(q; p0) > r2 . We observe that the algorithm correctly
ity of collision of two points p and q is closely related solves the (r; )-Neighbor problem if the following two
to the distance between them. Speci cally, the larger properties hold:
the distance, the smaller the collision probability. This P1 If there exists p such that p 2 B(q; r1), then
intuition is formalized as follows [24]. Let D(; ) be a 
gj (p ) = gj (q) for some j = 1; : : :; l.
distance function of elements from a set S , and for
any p 2 S let B(p; r) denote the set of elements from P2 The total number of blocks pointed to by q and
S within the distance r from p. containing only points from P 0 is less than cl.
De nition 3 A family H of functions from S to U Assume that H is a (r1 ; r2; p1; p2)-sensitive family;
is called (r ; r ; p ; p )-sensitive for D(; ) if for any de ne  = lnln 11=p
=p2 . The correctness of the LSH algo-
1 2 1 2 1
q; p 2 S rithm follows from the following theorem.
 if p 2 B(q; r ) then PrH [h(q) = h(p)]  p ,
Theorem 1 Setting k = log1=p2 (n=B ) and l = Bn 
1 1 , 
 if p 2= B(q; r ) then PrH [h(q) = h(p)]  p .
2 2 guarantees that properties P1 and P2 hold with prob-
ability at least 12 , e1  0:132.
In the above de nition, probabilities are considered
with respect to the random choice of a function h from Remark 1 Note that by repeating the LSH algorithm
the family H. In order for a locality-sensitive family O(1= ) times, we can amplify the probability of success
to be useful, it has to satisfy the inequalities p1 > p2 in at least one trial to 1 ,  , for any  > 0.
and r1 < r2 .
Observe that if D(; ) is the Hamming distance Proof: Let property P1 hold with probability P ,
dH (; ), then the family of projections on one coor-
1
and property P2 hold with probability P . We will
dinate is locality-sensitive. More speci cally: show that both P1 and P2 are large. Assume that there
2

Fact 2 Let S be H d (the d0-dimensional Ham-


0 exists a point p within distance r1 of q; the proof is
ming cube) and D(p; q) = dH (p; q) for p; q 2
quite similar otherwise. Set k = log1=p2 (n=B ). The
H d . Then for any r,  > 0, the family Hd =
probability that g(p0) = g(q) for p 2 P , B(q; r2) is at
most pk2 = Bn . Denote the set of all points p0 2= B(q; r2 )
0
0

fhi : hi ((b1; : : :; bd )) = bi; for i = 1; : : :; d0g is


0
by P 0. The expected number of blocks allocated for gj
r; r(1 + ); 1 , dr ; 1 , r(1+
0 d
) -sensitive.
0 which contain exclusively points from P 0 does not ex-
ceed 2. The expected number of such blocks allocated
We now generalize the algorithm from the previ- for all gj is at most 2l. Thus, by the Markov inequal-
ous section to an arbitrary locality-sensitive family ity [35], the probability that this number exceeds 4l is
H. Thus, the algorithm is equally applicable to other less than 1=2. If we choose c = 4, the probability that
locality-sensitive hash functions (e.g., see [5]). The the property P2 holds is P2 > 1=2.
generalization is simple: the functions g are now de- Consider now the probability of gj (p ) = gj (q).
ned to be of the form Clearly, it is bounded from below by
gi(p) = (hi1 (p); hi2 (p); : : :; hik (p)); pk1 = p1
log1=p
2
n=B
= (n=B ), log 1=p2 = (n=B ), :
log 1=p1

where the functions hi1 ; : : :; hik are randomly chosen By setting l = Bn  , we bound from above the prob-
, 
from H with replacement. As before, we choose l such ability that gj (p ) 6= gj (q) for all j = 1; : : :; l by 1=e.
functions g1 ; : : :; gl . In the case when the family Hd is
0
Thus the probability that one such gj exists is at least
used, i.e., each function selects one bit of an argument, P1  1 , 1=e.
the resulting values of gj (p) are essentially equivalent
to pjIj . Therefore, the probability that both properties P1
We now show that the LSH algorithm can be used to and P2 hold is at least 1 , [(1 , P1) + (1 , P2 )] =
solve what we call the (r; )-Neighbor problem: deter- P1 + P2 , 1  21 , 1e . The theorem follows. 2
mine whether there exists a point p within a xed dis- In the following we consider the LSH family for the
tance r1 = r of q, or whether all points in the database Hamming metric of dimension d0 as speci ed in Fact 2.
are at least a distance r2 = r(1+ ) away from q; in the For this case, we show that   1+1  assuming that
rst case, the algorithm is required to return a point r < lnd n ; the latter assumption can be easily satis ed
0

p0 within distance at most (1 + )r from q. In par- by increasing the dimensionality by padding a su-
ticular, we argue that the LSH algorithm solves this ciently long string of 0s at the end of each point's rep-
problem for a proper choice of k and l, depending on resentation.

522
Fact 3 Let r < d . If p1 = 1, r and p2 = 1, r(1+) ,
0
produce answers of good quality. This can be ex-
n d d 0 0

plained as in [10] where it was observed that the distri-


ln
ln 1=p1
then  = ln 1=p2
 1
1+
.
bution of distances between a query point and the data
Proof: Observe that set in most cases does not depend on the speci c query
point, but on the intrinsic properties of the data set.
ln 1 =p1 ln 1,1r=d ln(1 , r=d0) : Under the assumption of distribution invariance, the
= = =
0

ln 1=p2 ln 1,(1+1)r=d ln(1 , (1 + )r=d0) same parameter r is likely to work for a vast majority
0
of queries. Therefore in the experimental section we
Multiplying both the numerator and the denominator adopt a xed choice of r and therefore also of k and l.
by dr , we obtain:
0

4 Experiments
 = d
ln(1 , r=d0)
d
r
0

In this section we report the results of our experiments


0
r ln(1 , (1 + )r=d ) with locality-sensitive hashing method. We performed
0

, r=d0)d =r = U experiments on two data sets. The rst one contains


= ln(1ln(1
0

, (1 + )r=d0)d =r L up to 20,000 histograms of color images from COREL


Draw library, where each histogram was represented
0

In order to upper bound , we need to bound U from as a point in d-dimensional space, for d up to 64. The
below and L from above; note that both U and L are second one contains around 270,000 points of dimen-
negative. To this end we use the following inequali- sion 60 representing texture information of blocks of
ties [35]: large large aerial photographs. We describe the data
sets in more detail later in the section.
(1 , (1 + )r=d0)d =r < e,(1+)
0
We decided not to use randomly-chosen synthetic
data in our experiments. Though such data is often
and   used to measure the performance of exact similarity

1 , dr0
d =r 0
1 :
> e,1 1 , 0 search algorithms, we found it unsuitable for evalua-
d =r tion of approximate algorithms for the high data di-
Therefore, mensionality. The main reason is as follows. Assume
a data set consisting of points chosen independently
U ln(e,1 (1 , d 1=r )) at random from the same distribution. Most distri-
<
0
butions (notably uniform) used in the literature as-
L ln e,(1+) sume that all coordinates of each point are chosen in-
,1 + ln(1 , d 1=r ) dependently. In such a case, for any pair of points p; q
=
0

,(1 + ) the distance d(p; q) is sharply concentrated around the


ln(1 , d 1=r ) mean; for example, for the uniform distribution over
= 1=(1 + ) , 1 + 
0
the unit cube, the expected distance p is O(d), while
the standard deviation is only O( d). Thus almost
< 1=(1 + ) , ln(1 , 1= ln n) all pairs are approximately within the same distance,
so the notion of approximate nearest neighbor is not
where the last step uses the assumptions that  > 0 meaningful | almost every point is an approximate
and r < lnd n . We conclude that
0

nearest neighbor.
Implementation. We implement the LSH algo-
n < n1=(1+)n, ln(1,1= ln n) rithm as speci ed in Section 3. The LSH functions
= n1=(1+)(1 , 1= ln n), ln n = O(n1=(1+)) can be computed as described in Section 3.1. Denote
the resulting vector of coordinates by (v1 ; : : :; vk ). For
2 the second level mapping we use functions of the form
We now return to the -NNS problem. First, we
observe that we could reduce it to the (r; )-Neighbor h(v1 ; : : :; vk) = a1  v1 +    + ad  vk mod M;
problem by building several data structures for the
latter problem with di erent values of r. More specif- where M is the size of the hash table and a1; : : :; ak
ically, we could explore r equal to r0, r0(1 + ), are random numbers from interval [0 : : :M , 1]. These
r0(1 + )2; : : :; rmax , where r0 and rmax are the small- functions can be computed using only 2k , 1 oper-
est and the largest possible distance between the query ations, and are suciently random for our purposes,
and the data point, respectively. We remark that the i.e., give low probability of collision. Each second level
number of di erent radii could be further reduced [24] bucket is then directly mapped to a disk block. We as-
at the cost of increasing running time and space re- sumed that each block is 8KB of data. As each coordi-
quirement. On the other hand, we observed that in nate in our data sets can be represented using 1 byte,
practice choosing only one value of r is sucient to we can store up to 8192=d d-dimensional points per

523
block. Therefore, we assume the bucket size B = 100
for d = 64 or d = 60, B = 300 for d = 27 and B = 1000 Point set distance distribution

for d = 8.
For the SR-tree, we used the implementation by 1

Katayama, available from his web page [28]. As above,


we allow it to store about 8192 coordinates per disk 0.8

block.

Normalized frequency
Performance measures. The goal of our exper-
iments was to estimate two performance measures:
0.6

speed (for both SR-tree and LSH) and accuracy (for


LSH). The speed is measured by the number of disk 0.4

blocks accessed in order to answer a query. We count


all disk accesses, thus ignoring the issue of caching. 0.2
Observe that in case of LSH this number is easy to
predict as it is clearly equal to the number of indices
used. As the number of indices also determines the 0
0 1000 2000 3000 4000 5000 6000 7000 8000

storage overhead, it is a natural parameter to opti- Interpoint distance

(a) Color histograms


mize.
The error of LSH is measured as follows. Follow- Texture data set point distribution
ing [2] we de ne (for the Approximate 1-NNS problem)
the e ective error as 1

E=
1 X dLSH
jQj query q2Q d ;
Normalized frequency
0.8

where dLSH denotes the distance from a query point q 0.6

to a point found by LSH, d is the distance from q to


the closest point, and the sum is taken of all queries 0.4

for which a nonempty index was found. We also mea-


sure the (small) fraction of queries for which no non- 0.2
empty bucket was found; we call this quantity miss
ratio. For the Approximate K -NNS we measure sep-
arately the distance ratios between the closest points 0
0 100 200 300 400 500 600 700
found to the nearest neighbor, the 2nd closest one to Interpoint distance
the 2nd nearest neighbor and so on; then we average (b) Texture features
the ratios. The miss ratio is de ned to be the fraction
of cases when less than K points were found.
Data Sets. Our rst data set consists of 20,000 Figure 3: The pro les of the data sets.
histograms of color thumbnail-sized images of vari-
ous contents taken from the COREL library. The integers, as assumed in Section 3. The distribution of
histograms were extracted after transforming the pix- interpoint distances in our point sets is shown in Fig-
els of the images to the 3-dimensional CIE-Lab color ure 3. Both graphs were obtained by computing all
space [44]; the property of this space is that the dis- interpoint distances of random subsets of 200 points,
tance between each pair of points corresponds to the normalizing the maximal value to 1.
perceptual dissimilarity between the colors that the The second data set contains 275,465 feature vec-
two points represent. Then we partitioned the color tors of dimension 60 representing texture information
space into a grid of smaller cubes, and given an image, of blocks of large aerial photographs. This data set
we create the color histogram of the image by counting was provided by B.S. Manjunath [31, 32]; its size and
how many pixels fall into each of these cubes. By di- dimensionality \provides challenging problems in high
viding each axis into u intervals we obtain a total of u3 dimensional indexing" [31]. These features are ob-
cubes. For most experiments, we assumed u = 4 ob- tained from Gabor ltering of the image tiles. The
taining a 64-dimensional space. Each histogram cube Gabor lter bank consists of 5 scales and 6 orien-
(i.e., color) then corresponds to a dimension of space tations of lters, thus the total number of lters is
representing the images. Finally, quantization is per- 5  6 = 30. The mean and standard deviation of each
formed in order to t each coordinate in 1 byte. For ltered output are used to constructed the feature vec-
each point representing an image each coordinate ef- tor (d = 30  2 = 60). These texture features are ex-
fectively counts the number of the image's pixels of a tracted from 40 large air photos. Before the feature
speci c color. All coordinates are clearly non-negative extraction, each airphoto is rst partitioned into non-

524
overlapping tiles of size 64 times 64, from which the
feature vectors are computed. alpha=2, n=19000, d=64, k=700

Query Sets. The diculty in evaluating similarity


searching algorithms is the lack of a publicly available
0.6

database containing typical query points. Therefore,


we had to construct the query set from the data set 0.5

itself. Our construction is as follows: we split the data


set randomly into two disjoint parts (call them S1 and 0.4

S2 ). For the rst data set the size of S1 is 19,000 while

Error
the size of S2 is 1000. The set S1 forms a database of 0.3

images, while the rst 500 points from S2 (denoted


by Q) are used as query points (we use the other 500 0.2

points for various veri cation purposes). For the sec-


ond data set we chose S1 to be of size 270,000, and 0.1

we use 1000 of the remaining 5,465 points as a query


set. The numbers are slightly di erent for the scala- 0

bility experiments as they require varying the size of


1 2 3 4 5 6 7 8 9 10
Number of indices

the data set. In this case we chose a random subset of


S1 of required size.
4.1 Experimental Results Figure 4: Error vs. the number of indices.
In this section we describe the results of our experi- shown on Figure 4. As expected, one index is not su-
ments. For both data sets they consist essentially of cient to achieve reasonably small error | the e ective
the following three steps. In the rst phase we have to error can easily exceed 50%. The error however de-
make the following choice: the value of k (the number creases very fast as l increases. This is due to the fact
of sampled bits) to choose for a given data set and the that the probabilities of nding empty bucket are in-
given number of indices l in order to minimize the ef- dependent for di erent indices and therefore the prob-
fective error. It turned out that the optimal value of k ability that all buckets are empty decays exponentially
is essentially independent of n and d and thus we can in l.
use the same value for di erent values of these param- In order to compare the performance of LSH with
eters. In the second phase, we estimate the in uence SR-tree, we computed (for a variety of data sets) the
of the number of indices l on the error. Finally, we minimal number of indices needed to achieve a speci-
measure the performance of LSH by computing (for ed value of error E equal to 2%, 5% , 10% or 20%.
a variety of data sets) the minimal number of indices Then we investigated the performance of the two al-
needed to achieve a speci ed value of error. When ap- gorithms while varying the dimension and data size.
plicable, we also compare this performance with that Dependence on Data Size. We performed the
of SR-trees. simulations for d = 64 and the data sets of sizes 1000,
4.2 Color histograms 2000, 5000, 10000 and 19000. To achieve better under-
standing of scalability of our algorithms, we did run
For this data set, we performed several experiments the experiments twice: for Approximate 1-NNS and
aimed at understanding the behavior of LSH algorithm for Approximate 10-NNS. The results are presented
and its performance relative to SR-tree. As mentioned on Figure 5.
above, we started with an observation that the optimal Notice the strongly sublinear dependence exhibited
value of sampled bits k is essentially independent of n by LSH: although for small E = 2% it matches SR-tree
and d and approximately equal to 700 for d = 64. The for n = 1000 with 5 blocks accessed (for K = 1), it
lack of dependence on n can be explained by the fact requires 3 accesses more for a data set 19 times larger.
that the smaller data sets were obtained by sampling At the same time the I/O activity of SR-tree increases
the large one and therefore all of the sets have similar by more than 200%. For larger errors the LSH curves
structure; we believe the lack of dependence on d is also are nearly at, i.e., exhibit little dependence on the size
in uenced by the structure of the data. Therefore the of the data. Similar or even better behavior occurs for
following experiments were done assuming k = 700. Approximate 10-NNS.
Our next observation was that the value of stor- We also computed the miss ratios, i.e., the fraction
age overhead does not exert much in uence over the of queries for which no answer was found. The results
performance of the algorithm (we tried 's from the are presented on Figure 6. We used the parameters
interval [2; 5]); thus, in the following we set = 2. from the previous experiment. On can observe that
In the next step we estimated the in uence of l on for say E = 5% and Approximate 1-NNS, the miss
E . The results (for n = 19; 000, d = 64, K = 1) are ratios are quite high (10%) for small n, but decrease

525
alpha = 2, 1−NNS alpha=2, n=19000, d=64, 1−NNS

SR−Tree
20 LSH, error=.02 0.25
LSH, error=.05
LSH, error=.1 Error=.05
LSH, error=.2 0.2 Error=.1
15
Disk accesses

Miss ratio
0.15

10
0.1

5 0.05

0
0 0.1 0.2 0.5 1 1.9
0.1 0.2 0.5 1 1.9 Number of database points 4
Number of database points 4 x 10

(a) Approximate 1-NNS


x 10
(a) Approximate 1-NNS alpha=2, n=19000, d=64, 10−NNS
alpha = 2, 10−NNS
40
SR−Tree
35 LSH, error=.02 0.5
LSH, error=.05
30 LSH, error=.1 Error=.05
Error=.1
LSH, error=.2 0.4
25
Disk accesses

Miss ratio

0.3
20

15 0.2

10
0.1
5

0
0 0.1 0.2 0.5 1 1.9
0.1 0.2 0.5 1 1.9 Number of database points 4
Number of database points 4 x 10
(b) Approximate 10-NNS
x 10
(b) Approximate 10-NNS

Figure 5: Number of indices vs. data size. Figure 6: Miss ratio vs. data size.
4.3 Texture features
to around 1% for n = 19; 000 .
The experiments with texture feature data were de-
Dependence on Dimension. We performed the signed to measure the performance of the LSH algo-
simulations for d = 23 ; 33 and 43 ; the choice of d's was rithm on large data sets; note that the size of the
limited to cubes of natural numbers because of the way texture le (270,000 points) is an order of magnitude
the data has been created. Again, we performed the larger than the size of the histogram data set (20,000
comparison for Approximate 1-NNS and Approximate points). The rst step (i.e., the choice of the number
10-NNS; the results are shown on Figure 7. Note that of sampled bits k) was very similar to the previous
LSH scales very well with the increase of dimension- experiment, therefore we skip the detailed description
ality: for E = 5% the change from d = 8 to d = 64 here. We just state that we assumed that the number
increases the number of indices only by 2. The miss of sampled bits k = 65, with other parameters being:
ratio was always below 6% for all dimensions. the storage overhead = 1, block size B = 100, and
This completes the comparison of LSH with SR- the number of nearest neighbors equal to 10. As stated
tree. For a better understanding of the behavior of above, the value of n was equal to 270,000.
LSH, we performed an additional experiment on LSH We varied the number of indices from 3 to 100,
only. Figure 8 presents the performance of LSH when which resulted in error from 50% to 15% (see Fig-
the number of nearest neighbors to retrieve vary from ure 9 (a)). The shape of the curve is similar as in
1 to 100. the previous experiment. The miss ratio was roughly

526
alpha = 2, 1−NNS alpha=2, n=19000, d=64
20 15

18 SR−Tree Error=.05
LSH, error=.02 Error=.1
16 Error=.2
LSH, error=.05
14 LSH, error=.1
LSH, error=.2 10
Disk accesses

Disk accesses
12

10

8
5
6

0 0
8 27 64 1 10 20 50 100
Dimensions Number of nearest neighbors
(a) Approximate 1-NNS
Figure 8: Number of indices vs. number of nearest
alpha = 2, 10−NNS
35
neighbors.
30 SR−Tree
LSH, error=.02
while incurring some error.
The query cost versus error tradeo obtained in this
LSH, error=.05
25
LSH, error=.1
LSH, error=.2 way (for the entire data set) is depicted on Figure 9;
Disk accesses

20
we also include a similar graph for LSH.
Observe that using random sampling results in con-
15
siderable speed-up for the SR-tree algorithm, while
keeping the error relatively low. However, even in
10
this case the LSH algorithm o ers considerably out-
performs SR-trees, being up to an order of magnitude
5
faster.
5 Previous Work
0
8 27 64
Dimensions
(b) Approximate 10-NNS There is considerable literature on various versions of
the nearest neighbor problem. Due to lack of space we
Figure 7: Number of indices vs. dimension. omit detailed description of related work; the reader
is advised to read [39] for a survey of a variety of data
4% for 3 indices, 1% for 5 indices, and 0% otherwise. structures for nearest neighbors in geometric spaces,
including variants of k-d trees, R-trees, and structures
To compare with SR-tree, we implemented that lat- based on space- lling curves. The more recent results
ter on random subsets of the whole data set of sizes are surveyed in [41]; see also an excellent survey by [4].
from 10; 000 to 200; 000. For n = 200; 000 the average Recent theoretical work in nearest neighbor search is
number of blocks accessed per query by SR-tree was brie y surveyed in [24].
1310, which is one to two orders of magnitude larger
than the number of blocks accessed by our algorithm 6 Conclusions
(see Figure 9 (b) where we show the running times of
LSH for e ective error 15%). Observe though that an We presented a novel scheme for approximate simi-
SR-tree computes exact answers while LSH provides larity search based on locality-sensitive hashing. We
only an approximation. Thus in order to perform an compared the performance of this technique and SR-
accurate evaluation of LSH, we decided to compare tree, a good representative of tree-based spatial data
it with a modi ed SR-tree algorithm which produces structures. We showed that by allowing small error
approximate answers. The modi cation is simple: in- and additional storage overhead, we can considerably
stead of running SR-tree on the whole data set, we run improve the query time. Experimental results also in-
it on a randomly chosen subset of it. In this way we dicate that our scheme scales well to even a large num-
achieve a speed-up of the algorithm (as the random ber of dimensions and data size. An additional advan-
sample of the data set is smaller than the original set) tage of our data structure is that its running time is

527
Performance vs error
References
450 [1] S. Arya, D.M. Mount, and O. Narayan, Account-
SR−Tree ing for boundary e ects in nearest-neighbor searching.
400 LSH
Discrete and Computational Geometry, 16 (1996), pp.
350 155{176.
300 [2] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silver-
Disk accesses

man, and A. Wu. An optimal algorithm for approx-


250 imate nearest neighbor searching, In Proceedings of
the 5th Annual ACM-SIAM Symposium on Discrete
Algorithms, 1994, pp. 573{582.
200

150
[3] J.L. Bentley. Multidimensional binary search trees
100 used for associative searching. Communications of the
50
ACM, 18 (1975), pp. 509{517.
[4] S. Berchtold and D.A. Keim. High-dimensional Index
0
10 15 20 25 30 35 40 45 50 Structures. In Proceedings of SIGMOD, 1998, p. 501.
Error (%) See http://www.informatik.uni-halle.de/ keim/
(a) SIGMOD98Tutorial.ps.gz
1400 [5] E. Cohen. M. Datar, S. Fujiwara, A. Gionis, P. Indyk,
R. Motwani, J. D. Ullman. C. Yang. Finding Interest-
1200 SR−Tree
LSH ing Associations without Support Pruning. Technical
Number of Disk Accesses

Report, Computer Science Department, Stanford Uni-


1000
versity.
800 [6] T.M. Chan. Approximate Nearest Neighbor Queries
Revisited. In Proceedings of the 13th Annual
600 ACM Symposium on Computational Geometry, 1997,
pp. 352-358.
400
[7] S. Cost and S. Salzberg. A weighted nearest neighbor
algorithm for learning with symbolic features. Ma-
chine Learning, 10 (1993), pp. 57{67.
200

0 [8] S. Chaudhuri, R. Motwani and V. Narasayya. \Ran-


dom Sampling for Histogram Construction: How
0 50 100 150 200
Data Set Size
(b) much is enough?". In Proceedings of SIGMOD'98,
pp. 436{447.
[9] T.M. Cover and P.E. Hart. Nearest neighbor pat-
Figure 9: (a) number of indices vs. error and (b) num- tern classi cation. IEEE Transactions on Information
ber of indices vs. size. Theory, 13 (1967), pp. 21{27.
essentially determined in advance. All these properties [10] P. Ciaccia, M. Patella and P. Zezula, A cost model
make LSH a suitable candidate for high-performance for similarity queries in metric spaces In Proceedings
and real-time systems. of PODS'98, pp. 59{68.
In recent work [5, 23], we explore applications of [11] S. Deerwester, S. T. Dumais, T.K. Landauer,
LSH-type techniques to data mining and search for G.W. Furnas, and R.A. Harshman. Indexing by latent
copyrighted video data. Our experience suggests that semantic analysis. Journal of the Society for Informa-
there is a lot of potential for further improvement of tion Sciences, 41 (1990), pp. 391{407.
the performance of the LSH algorithm. For example, [12] L. Devroye and T.J. Wagner. Nearest neighbor meth-
our data structures are created using a randomized ods in discrimination. Handbook of Statistics, vol. 2,
procedure. It would be interesting if there was a more P.R. Krishnaiah and L.N. Kanal, eds., North-Holland,
systematic method for performing this task; such a 1982.
method could take additional advantage of the struc- [13] R.O. Duda and P.E. Hart. Pattern Classi cation and
ture of the data set. We also believe that investiga- Scene Analysis. John Wiley & Sons, NY, 1973.
tion of hybrid data structures obtained by merging the
tree-based and hashing-based approaches is a fruitful [14] H. Edelsbrunner. Algorithms in Combinatorial Geom-
direction for further research. etry. Springer-Verlag, 1987.
[15] C. Faloutsos, R. Barber, M. Flickner, W. Niblack,
D. Petkovic, and W. Equitz. Ecient and e ective
querying by image content. Journal of Intelligent In-
formation Systems, 3 (1994), pp. 231{262.

528
[16] C. Faloutsos and D.W. Oard. A Survey of Informa- [32] B. S. Manjunath and W. Y. Ma. Texture features
tion Retrieval and Filtering Methods. Technical Re- for browsing and retrieval of large image data. IEEE
port CS-TR-3514, Department of Computer Science, Transactions on Pattern Analysis and Machine Intel-
University of Maryland, College Park, 1995. ligence, (Special Issue on Digital Libraries), 18 (8),
[17] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, pp. 837-842.
Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, [33] G.S. Manku, S. Rajagopalan, and B.G. Lindsay. Ap-
D. Petkovic, D. Steele, and P. Yanker. Query by im- proximate Medians and other Quantiles in One Pass
age and video content: the QBIC system. IEEE Com- and with Limited Memory. In Proceedings of SIG-
puter, 28 (1995), pp. 23{32. MOD'98, pp. 426{435.
[18] J.K. Friedman, J.L. Bentley, and R.A. Finkel. An al- [34] Y. Matias, J.S. Vitter, and M. Wang. Wavelet-based
gorithm for nding best matches in logarithmic ex- Histograms for Selectivity Estimations. In Proceedings
pected time. ACM Transactions on Mathematical of SIGMOD'98, pp. 448{459.
Software, 3 (1977), pp. 209{226. [35] R. Motwani and P. Raghavan. Randomized Algo-
[19] T. Figiel, J. Lindenstrauss, V. D. Milman. The di- rithms. Cambridge University Press, 1995.
mension of almost spherical sections of convex bodies. [36] M. Otterman. Approximate matching with high di-
Acta Math. 139 (1977), no. 1-2, 53{94. mensionality R-trees. M.Sc. Scholarly paper, Dept. of
[20] A. Gersho and R.M. Gray. Vector Quantization and Computer Science, Univ. of Maryland, College Park,
Data Compression. Kluwer, 1991. MD, 1992.
[21] T. Hastie and R. Tibshirani. Discriminant adaptive [37] A. Pentland, R.W. Picard, and S. Sclaro . Photo-
nearest neighbor classi cation. In Proceedings of the book: tools for content-based manipulation of image
First International Conference on Knowledge Discov- databases. In Proceedings of the SPIE Conference on
ery & Data Mining, 1995, pp. 142{149. Storage and Retrieval of Image and Video Databases
[22] H. Hotelling. Analysis of a complex of statistical vari- II, 1994.
ables into principal components. Journal of Educa- [38] G. Salton and M.J. McGill. Introduction to Modern
tional Psychology, 27 (1933). pp. 417{441. Information Retrieval. McGraw-Hill Book Company,
[23] P. Indyk, G. Iyengar, N. Shivakumar. Finding pirated New York, NY, 1983.
video sequences on the Internet. Technical Report, [39] H. Samet. The Design and Analysis of Spatial Data
Computer Science Department, Stanford University. Structures. Addison-Wesley, Reading, MA, 1989.
[24] P. Indyk and R. Motwani. Approximate Nearest [40] J. R. Smith. Integrated Spatial and Feature Im-
Neighbor { Towards Removing the Curse of Dimen- age Systems: Retrieval, Analysis and Compression.
sionality. In Proceedings of the 30th Symposium on Ph.D. thesis, Columbia University, 1997. Available at
Theory of Computing, 1998, pp. 604{613. ftp://ftp.ctr.columbia.edu/CTR-Research/advent/
[25] K.V. Ravi Kanth, D. Agrawal, A. Singh. \Dimension- public/public/jrsmith/thesis
ality Reduction for Similarity Searching in Dynamic [41] T. Sellis, N. Roussopoulos, and C. Faloutsos. Multi-
Databases". In Proceedings of SIGMOD'98, 166 { 176. dimensional Access Methods: Trees Have Grown Ev-
[26] K. Karhunen. U ber lineare Methoden in der erywhere. In Proceedings of the 23rd International
Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fen- Conference on Very Large Data Bases, 1997, 13{15.
nicae, Ser. A137, 1947. [42] A.W.M. Smeulders and R. Jain, eds. Image Databases
[27] V. Koivune and S. Kassam. Nearest neighbor lters and Multi-media Search. Proceedings of the First
for multivariate data. IEEE Workshop on Nonlinear International Workshop, IDB-MMS'96, Amsterdam
Signal and Image Processing, 1995. University Press, Amsterdam, 1996.
[28] N. Katayama and S. Satoh. The SR-tree: an in- [43] J.R. Smith and S.F. Chang. Visually Searching
dex structure for high-dimensional nearest neighbor the Web for Content. IEEE Multimedia 4 (1997):
queries. In Proc. SIGMOD'97, pp. 369{380. The pp. 12{20. See also http://disney.ctr.columbia.
code is available from http://www.rd.nacsis.ac.jp/ edu/WebSEEk
~katayama/homepage/research/srtree/English.html [44] G. Wyszecki and W.S. Styles. Color science: concepts
[29] N. Linial, E. London, and Y. Rabinovich. The geome- and methods, quantitative data and formulae. John
try of graphs and some of its algorithmic applications. Wiley and Sons, New York, NY, 1982.
In Proceedings of 35th Annual IEEE Symposium on [45] R. Weber, H. Schek, and S. Blott. A quantitative
Foundations of Computer Science, 1994, pp. 577{591. analysis and performance study for Similarity Search
[30] M. Loeve. Fonctions aleastoires de second ordere. Methods in High Dimensional Spaces. In Proceedings
Processus Stochastiques et mouvement Brownian, Her- of the 24th International Conference on Very Large
mann, Paris, 1948. Data Bases (VLDB), 1998, pp. 194-205.
[31] B. S. Manjunath. Airphoto dataset. http://vivaldi.
ece.ucsb.edu/Manjunath/research.htm

529

Potrebbero piacerti anche