Sei sulla pagina 1di 4

Image Near-Duplicate Retrieval Using Local Dependencies

in Spatial-Scale Space

Xiangang Cheng Yiqun Hu Liang-Tien Chia


School of Computer School of Computer School of Computer
Engineering, Nanyang Engineering, Nanyang Engineering, Nanyang
Technological University Technological University Technological University
Singapore 639798 Singapore 639798 Singapore 639798
xiangang@pmail.ntu.edu.sg yqhu@ntu.edu.sg asltchia@ntu.edu.sg

ABSTRACT
This paper presents an efficient and effective so-
lution for retrieving Image Near-Duplicate (IND).
Different from traditional methods, we analyze the
local dependencies among region descriptors in a
spatial-scale space. Such local dependencies in spatial-
scale space(LDSS) encodes not only visual appear-
ance but also the spatial and scale co-occurrence of
them. The local dependencies are integrated over
all spatial locations and multiple scales to form the Figure 1: Examples of Image Near-Duplicate vary-
image representation, which is invariant to spatial ing in illumination, motion, scale and viewpoint.
transformation and scale change. We evaluate our
discover unauthorized or abused use of private images and
proposed LDSS method for IND retrieval using an
protect copyright as well as privacy. Also by IND(s) re-
existing benchmark as well as a new dataset ex-
trieval, one can identify if an image is artificial, composed
tracted from the keyframes of TRECVID corpus.
by other pictures [3]. Currently there are more and more
Compared to the state-of-the-art results, local de-
websites about videos. Taking Youtube for example, people
pendencies in spatial-scale space(LDSS) approach
are free to upload videos they like, however they might not
has been shown to significantly improve the accu-
give enough annotations for others to search. The IND(s)
racy of IND retrieval.
provide similarity clues for video recognizing visual events
and searching news video clips [3]. Personal photo album
Categories and Subject Descriptors can be automatically organized by grouping IND(s), which
H.3.3 [Information Search and Retrieval]: Information might be of different names. Detection and retrieval of IND
Storage and Retrieval—Retrieval models can also facilitate traditional text-based web search. If two
web pages contain any IND(s), the relevance between these
two web pages should be increased.
General Terms According to [1], IND is referred to multiple images that
Algorithms. are close to the exact duplicate of one image, but different in
scene, camera setting, photometric and digitization changes.
Keywords See Fig. 1 for example. Specifically, the scale, viewpoint and
illumination of the same scene and object(s) captured in the
local dependencies, bag-of-words, co-occurrence, image near- IND(s) can be changed by different camera settings and ren-
duplicate dering conditions. The composition of multiple objects can
be different in the IND(s) due to some editing operations.
1. INTRODUCTION In this paper, we propose a new method to model images
Image Near-Duplicate (IND) retrieval [1, 2] is very use- for IND retrieval. We represent every image by calculating
ful to the filtering, retrieval and management of multimedia descriptors for each location from multiple scales. After as-
contents. For example, IND(s) retrieval over internet can signing every descriptor to a specific label, we analyze the
dependencies among different labels in a spatial-scale joint
space. Such dependencies are assumed to be local to reduce
the effect of noise as well as the computational complex-
Permission to make digital or hard copies of all or part of this work for ity. We then integrate the local dependencies of labels over
personal or classroom use is granted without fee provided that copies are all spatial locations with multiple scales to form the image
not made or distributed for profit or commercial advantage and that copies representation. Such representation of local dependencies
bear this notice and the full citation on the first page. To copy otherwise, to encodes not only visual appearance of local regions but also
republish, to post on servers or to redistribute to lists, requires prior specific the spatial and scale co-occurrences of them, which is also
permission and/or a fee.
MM’08, October 26–31, 2008, Vancouver, British Columbia, Canada. invariant to spatial transformation and scale changes. We
Copyright 2008 ACM 978-1-60558-303-7/08/10 ...$5.00. test the proposed method for IND retrieval on two datasets

627
words and used for category classification, but the method
is not scale invariant.

3. OUR APPROACH
3.1 Multi-scale feature extraction
Instead of using interest points detectors, which are often
suboptimal choices of certain regions, we adopt a dense sam-
pling regular grid scheme in order to obtain sufficient statis-
tics of an image, both blob-like and edge-like features. Also,
we incorporate the scale factor when extracting features in
order to adapt scale changes between Near-Duplicate Im-
ages. Specifically, we use highly overlapping patches, with a
Figure 2: The framework of our proposed system. scalar sampling rate of 1.5, starting from 8  8 patches up to
patches of size 27  27. For one scale, the spatial sampling
from TRECVID corpus and the experiments show that it is step is 4 pixels.
significantly outperforms the current state-of-the-arts. Each of these patches is then described by the SIFT de-
scriptor [8], which compute distributions over gradient ori-
entations for different subpatches and is invariant and ro-
2. RELATED WORK bust across a substantial range of affine distortion, viewpoint
Although previous research about exact duplicate and copy changes, noises, and different illumination. Meanwhile, spa-
detection are mainly based on image representation using tial and scale information of every feature is kept in order
global features such as color histogram or edges, some re- to explore local dependencies.
searchers have proposed to use invariant local features to It is very complex and time-consuming to use the features
detect and retrieve INDs. Zhang and Shi [1] proposed to directly because of the high dimension. Varying in cardi-
identify IND using a stochastic attributed relational graph nality and lacking meaningful ordering of features result in
(ARG) matching. They learnt a distribution-based simi- difficulty to find a brief model to represent the whole image.
larity from the spatial relation among local interest points To address the problems, we obtain a fixed-number label
for matching ARGs. However, the matching speed of this set L by clustering techniques and assign all the features
method is slow and the parameters can only be tuned heuris- to one of the labels, which is the same process as standard
tically. Different from [1], Ke et al. [4] represented each im- “Bag-of-Words” method.
age as a set of local covariant regions each of which is charac-
terized by PCA-SIFT descriptor. Locality sensitive hashing 3.2 Local dependencies in spatial-scale space
(LSH) is proposed to design an efficient index structure for Different from “Bag-of-Words” methods, which assume all
fact point set matching. Without geometry verification, the the features are independent, we try to explore the local
robustness of this technique cannot be guaranteed for partial dependencies in the spatial-scale feature space. The spatial-
matching due to the clutter background. Recently, Zhao et scale feature space is defined as
al. [2] extended the matching of local point set by introduc-
S = i, j, σ  1  i  W, 1  j  H, σ  0 (1)
ing one-to-one symmetric property for matching and LIP-IS
index structure for fast approximate search. Although it where W and H are the width and length of the image re-
is interesting to learn IND pattern from the histogram of spectively. σ is the scale factor. The elements of the spatial-
matching orientation, the method proposed in [2] is not in- scale space correspond to the sites or configurations of each
variant to rotation and requires explicitly calculating point- dense sampled feature. After every feature being assigned
to-point matching when comparing two images. to a label, its corresponding site in S will be mapped to L
Generic image categorization approaches using “Bag-of- by function f :
Words” model is also related to IND detection and retrieval. f S
L (2)
Under this model, images are treated as documents by as-
Where function f is to find the most similar label to the
signing descriptors of local covariant regions to “visual words”.
feature of current site.
Each image is then represented by a histogram of word fre-
Co-occurrence information of features can be explored by
quency. The most critical problem of such methods is that
a neighborhood system. A neighborhood system for S is
the ambiguous visual words will introduce large number of
defined as
false matches when each region is matched independent to
others. Several methods have been proposed to improve this N = Ni  ∀i S (3)
problem by capturing the spatial arrange of visual words. Where Ni denote the set of sites neighboring i. S, N  G
Lazebnik et al. [5] extended the Pyramid Matching Kernel constitutes a graph, where S are nodes and N indicates links
(PMK) [6] by incorporating the spatial information of loca- between the nodes according to their neighboring relation-
tion regions. Two images are partitioned into increasingly ships. Let Vi denote the set of nodes in the set of S for the
fine sub-regions and PMK is used to compare corresponding graph G, excluding the node i and its neighbors Ni , we have
sub-regions. This method implicitly assumes the correspon- the following conditional independence statements:
dences between sub-regions, which is not scale or rotation
invariant. Also by utilizing spatial information of local re- Si  Vi  SNi (4)
gions, Savarese et al. [7] used correlagram to measure the That is, given its neighbors of one node, it is independent of
distribution of the proximities between all pairs of visual all the other nodes in the graph.

628
where Kk=1 pk = 1 and N is the total number of sites. And
the transition probability from node k to node l is
N
i=1 jNi δf i, kδf j, l
pkl = N
(9)
i=1 δf i, k

The term N i=1 δf i, k is the normalization factor to sat-


isfy K
l=1 pkl = 1.
Under the graph G, the match of two images I1 and I2 is
(a) (b) naturally defined as

Figure 3: (a) A second order neighborhood system K


in spatial-scale space. The dark blue node is the SimI1 , I2  = α  Ii + 1 − α  Dij (10)
center and it has 8 neighbors in the same scale and i=1 vi ,vj E
9 neighbors each in both upper and lower scales
respectively. (b) An example of graph represen- Where I is the function measuring the similarity of each
tation which integrates local dependencies in the nodes, while the function Dij measures the co-occurrence
spatial-scale space. The size of a node encodes the relationship of two nodes vi and vj in different images. α
frequency of the label associated. And the edge and 1 − α are the weighting coefficients of the two parts.
vi , vj  E indicates the co-occurrence frequency be-
tween the corresponding label pair. 4. NEAR-DUPLICATE IMAGE RETRIEVAL
With the assumption of absolute independence of features We follow a similar framework to [9, 10] for IND retrieval.
for “Bag-of-Words” method, neighborhood relationship is to- It consists of two parts: the process for database setup and
tally neglected, which results in the loss of information. In the process for handling input query. First we build the
this paper, we consider local dependencies of features. That graph representation of each image referring to Sec 3. Then
is, not only the feature itself but also inter-relations between for online query processing, the graph representation G
features in the neighborhood system are taken into consid- V, E is calculated by Equation 10 for the query image. It
eration. Exactly, we define a second order neighborhood is then used to compute the similarity between the query
system in the spatial-scale feature space, image and every image in the database. The similarity can
be calculated using any distance for two histograms, e.g. L2
N i, j, σk  = l, m, σn   l−i2 +m−j2 +n−k2 < 2 (5) distance, χ2 distance as well as EMD distance, etc. In this
as shown in Fig. 3(a). One node has 8 neighbors in the paper, we use the simplest intersection distance τ to measure
same scale and 9 neighbors each in both upper and lower the similarity, either I or Dij . Mathematically, given two
scales. Consider a site with multi-scale changes can make the feature vectors HI and HJ ,
neighborhood system robustly invariant with scale changes. V
We define the clique as a pair of neighboring sites: τ HI , HJ  =  minHv I, Hv J (11)
v=1
Cij = i, j, j Ni (6)
where Hv ċ represents the v th element of the vector. Al-
As Equ. 2, labels lm = f i and ln = f j are associated with though simple, the intersection distance can handle partial
sites i and j, and then Cij will be mapped into qlm ln , which matching. In this paper, we only use the simple intersection
shows there a transition relation from label lm to ln in the distance and achieve good performance under our proposed
neighborhood system. graph model.
3.3 Dependencies integration
In general, we explore two properties in the spatial-scale 5. EXPERIMENTS
space: the frequency of a label occurring and pairwise cliques In this section, we evaluate our proposed LDSS method on
between different labels. Both are informative and discrimi- two IND datasets which are extracted from the keyframes of
native. If we take these labels as different nodes, We can TRECVID corpus [11]. The Columbia dataset is provided by
integrate both properties into a graph G V, E. See [1], which are the keyframes extracted from TRECVID 2003
Fig. 3(b) for example. V = v1 , . . . vK corresponds to the K corpus. The NTU dataset contains the keyframes extracted
nodes, the size of which denotes the frequency of the label from TRECVID 2005 & 2006 corpus. All IND pairs as well
associated. The edge vi , vj  E denotes the dependence as non-duplicate images can be viewed and downloaded from
between node vi and node vj , which integrates the local the supplementary material. Each dataset includes 150 IND
co-occurrence frequency between the two labels over all pos- pairs (300 images) and 300 non-duplicate images. The size
sible locations in spatial-scale space. In order to compute of all the images are 352  240. To compare with [2], which
the node, we adopt the Delta function is currently state-of-art method, we adopt the same exper-
iment setup for IND retrieval. All IND pairs (300 images)
1 if i = l
δi, l =  (7) are used as queries to evaluate performance. For each query,
0 else we calculate the similarity between the query image and all
So the frequency of node k is, 599 images using the intersection between their histograms
of visual phrase frequency. A ranked list of 599 images is
1 N then produced according to their similarity to the query.
pk =  δf i, k 1  k  K (8)
N i=1 To access the retrieval performance and compare with other

629
methods, we estimate the probability of the successful top-k
retrieval P k as follows Table 1: Performance comparison with changes of
label/word size in NTU dataset
Qc 500 400 300 200 100 50
P k = (12)
Q Our LDSS (%) 92 90 90 89 86 82
where Qc is the number of queries that rank their INDs BOW (%) 88 88 86 85 81 77
within the top-k position, and Q is the total number of
queries (Q = 300).
6. CONCLUSION
In this paper, we propose to analyze the local dependen-
1
cies in spatial-scale space, which incorporate not only the ap-
pearance feature information, but also the spatial and scale
0.95
co-occurrence information. Through the integration of local
dependencies over all spatial locations for different scales,
0.9 the proposed LDSS is invariant to spatial transformation
and scale change. Experiments have shown that LDSS sig-
nificantly outperforms the state-of-art approaches for IND
P(k)

0.85
retrieval. Local dependence in spatial-scale space is proved
0.8
to be informative and discriminative. In our future work, we
will investigate the problem of mining representative pattern
from these local dependencies.
0.75
Our proposed Method
SPM

0.7
LIP−IS+OOS 7. REFERENCES
2 4 6 8 10 12 14 16 18 20
Top k
[1] D. Zhang and S. Chang. Detecting Image
Near-Duplicate by Stochastic Attribute Relational
Graph Matching with Learning. In Proceedings of
Figure 4: Top-k IND retrieval performance of LDSS
ACM Multimedia, October 2004.
and relevant methods on the Columbia dataset.
[2] Wan-Lei Zhao, Chong-Wah Ngo, Hung-Khoon Tan,
and Xiao Wu. Near-Duplicate Keyframe Identification
5.1 IND retrieval using LDSS with Interest Point Matching and Pattern Learning.
IEEE Transactions on Multimedia, August 2007.
For LDSS, we adopt SIFT descriptors to generate graph
representation based on neighborhood system. 500 labels [3] C. Ngo, W. Zhao, and Y. Jiang. Fast Tracking of
are obtained from k-means clustering algorithm on the de- Near-Duplicate Keyframes in Broadcast Domain with
scriptors. The proposed LDSS utilizes the information of Transitivity Propagation. In Proceedings of ACM
nodes frequency as well as nodes co-occurrence, by explor- Multimedia, October 2006.
ing spatial-scale space neighborhood system. We compare [4] Y. Ke, R. Sukthankar, and L. Huston. Efficient
our method with two other methods: (i) the LIP-IS+OOP Near-Duplicate Detection and Sub-image Retrieval. In
[2], which used one-to-one graph matching and it the state- Proceedings of ACM Multimedia, October 2004.
of-art approach for IND retrieval currently; (ii) the Spatial [5] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of
Pyramid Matching (SPM) method [5], which is a popular Features: Spatial Pyramid Matching for Recognizing
method for considering spatial information. Figure 4 shows Natural Scene Categories. In Proceedings of IEEE
the results of these methods for top-k IND retrieval on the Computer Society Conference on CVPR, June 2006.
Columbia dataset where k changes from 1 to 20. From the [6] K. Grauman and T. Darrell. The Pyramid Matching
figure, we can see that our proposed GMNS outperform the Kernel: Discriminative Classification with Sets of
other methods greatly. For top-1, our GMNS is nearly 5% Image Features. In Proceedings of IEEE ICCV,
higher than the LIP-IS+OOP and SPM. At the same time, October 2005.
our method maintains a high speed as SPM, while much [7] S. Savarese, J. Winn, and A. Criminisi. Discriminative
faster than LIP-IS+OOP. Similar results are achieve in the Object Class Models of Appearance and Shape by
NTU dataset. Correlations. In Proceedings of IEEE Computer
Society Conference on CVPR, June 2006.
5.2 Comparison between LDSS and BoW [8] D. G. Lowe. Distinctive Image Features from
From Fig. 3(b), we can see that if we remove all the edges, Scale-Invariant Keypoints. IJCV, November 2004.
that is, we drop all the co-occurrence information based on [9] J. Sivic and A. Zisserman. Video Google: A Text
neighborhood system, our LDSS will degenerate into stan- Retrieval Approach to Object Matching in Videos. In
dard Bag-of-Words approach. Table 1 is the experiment Proceedings of IEEE ICCV, October 2003.
results of the two methods in NTU dataset, which are the [10] D. Nistër and H. Stewënius. Scalable Recognition with
top-1 IND retrieval results while changing the size of la- a Vocabulary Tree. In IEEE Computer Society
bel/word from 500 to 50. All the time our proposed LDSS Conference on CVPR, June 2006.
outperforms standard BoW all the time. This implies that [11] TREC Video Retrieval Evaluation (TRECVID).
the assumption of independence of words in BoW loses much http://wwwnlpir. nist.gov/projects/trecvid/.
information. It is useful to explore local dependencies among
different labels.

630

Potrebbero piacerti anche