Sei sulla pagina 1di 14

1174

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Near-Duplicate Video Clip Detection Using Model-Free Semantic Concept Detection and Adaptive Semantic Distance Measurement
Hyun-seok Min, Jae Young Choi, Wesley De Neve, and Yong Man Ro, Senior Member, IEEE

Abstract Motivated by the observation that content transformations tend to preserve the semantic information conveyed by video clips, this paper introduces a novel technique for near-duplicate video clip (NDVC) detection, leveraging modelfree semantic concept detection and adaptive semantic distance measurement. In particular, model-free semantic concept detection is realized by taking advantage of the collective knowledge in an image folksonomy (which is an unstructured collection of user-contributed images and tags), facilitating the use of an unrestricted concept vocabulary. Adaptive semantic distance measurement is realized by means of the signature quadratic form distance (SQFD), making it possible to exibly measure the similarity between video shots that contain a varying number of semantic concepts, and where these semantic concepts may also differ in terms of relevance and nature. Experimental results obtained for the MIRFLICKR-25000 image set (used as a source of collective knowledge) and the TRECVID 2009 video set (used to create query and reference video clips) demonstrate that model-free semantic concept detection and SQFD can be successfully used for the purpose of identifying NDVCs. Index TermsCollective knowledge, image folksonomy, near-duplicate video clip detection, semantic concept detection, semantic distance, semantic video signature, video copy detection.

I. Introduction

IGITAL video content can be easily edited and redistributed. As a result, a high number of identical and nearidentical video clips can typically be found in personal video collections and on websites for video sharing. Identical video clips are known as duplicates whereas video clips that have been the subject of at least one transformation are referred to as near-duplicate video clips (NDVCs) [1]. The detection of NDVCs is a core requirement of several multimedia applications, including media usage monitoring, content linking on the Web, metadata propagation for annotation purposes, protection of intellectual property, and mitigation of redundancy in video search results [2], [3].

Manuscript received July 6, 2011; revised November 7, 2011; accepted December 19, 2011. Date of publication May 1, 2012; date of current version July 31, 2012. This work was supported by the Basic Science Research Program of the National Research Foundation of Korea under Research Grant 2011-0011383. This paper was recommended by Associate Editor J. Luo. The authors are with the Image and Video Systems Laboratory, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: hsmin@kaist.ac.kr; jygchoi@kaist.ac.kr; wesley.deneve@kaist.ac.kr; ymro@ee.kaist.ac.kr). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSVT.2012.2197080

NDVC detection aims at nding all matches between a query video clip and the video clips in a reference video database. To reduce the computational complexity of matching and to mitigate communication overhead, video clips are typically represented by low-dimensional video signatures. These video signatures conventionally consist of low-level visual features that have been extracted from representative key frames, for instance describing color [4] or intensity information [5]. This paper introduces a novel technique for NDVC detection that takes advantage of semantic features (i.e., semantic concepts such as face, beach, and sky). Our motivation to make use of semantic features for the purpose of NDVC detection is fourfold. First, whereas content transformations tend to signicantly modify low-level visual features, they tend to preserve the semantic information conveyed by the original video content [6], [7]. Second, according to a study on the user perception of NDVCs [1], the effectiveness of NDVC detection can likely be improved by measuring semantic similarity. Third, the semantic features extracted for the purpose of NDVC detection can be reused for the purpose of annotation. Fourth, no single type of visual feature has thus far emerged that is robust against all possible content transformations, leaving room for experimentation with other types of features. In this context, we would like to emphasize that we see the use of semantic features as complementary to the use of visual features, offering additional discriminative power. The development of our semantic approach toward the task of NDVC detection required addressing the following three research challenges. 1) Achieving high semantic coverage: Semantic concepts are commonly detected by means of model-based approaches [8]. However, given that model-based semantic concept detection typically makes use of a limited number of classiers trained by experts, model-based semantic concept detection is only able to provide a limited semantic coverage [9]. Therefore, to eliminate the need for training by experts and to facilitate a high semantic coverage, our semantic approach toward the task of NDVC detection relies on model-free semantic concept detection, taking advantage of the collective knowledge present in an image folksonomy (i.e., an unstructured collection of user-contributed images and tags [10]). To that end, we rst retrieve folksonomy images that are visually similar to representative key

1051-8215/$31.00 c 2012 IEEE

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1175

frames, and we then propagate relevant tags from the folksonomy images to the key frames. 2) Adaptive measurement of semantic similarity: When making use of model-free semantic concept detection, the number of detected semantic concepts, as well as their relevance and nature, may strongly vary from video shot to video shot. Therefore, to measure the semantic similarity between video shots, we make use of the signature quadratic form distance (SQFD). This tool, which has recently been proposed by Beecks et al. [11], [12], allows taking into account a varying semantic content complexity by enabling similarity measurement between weighted feature signatures of different dimensionality (hence the use of the description adaptive measurement in our research). 3) Achieving effective semantic concept detection: The research presented in this paper pays particular attention to the inuence of the effectiveness of model-free semantic concept detection on the effectiveness of the proposed approach for NDVC detection, given the limited effectiveness of content-based image retrieval [9] and the ambiguity of user-supplied tags [13]. To investigate the feasibility of semantic concept-based NDVC detection, experiments have been performed using two publicly available data sets: the TRECVID 2009 video set [14], used for creating a reference video database and NDVCs, and the MIRFLICKR-25000 image set [15], used as a source of collective knowledge. Our experimental results demonstrate that model-free semantic concept detection and SQFD can be successfully used for identifying NDVCs, having an effectiveness that is on par with or better than the effectiveness of NDVC detection either making use of model-based semantic concept detection, temporal ordinal measurement, features computed using the scale-invariant feature transform (SIFT), or bag-of-visual-words (BoVWs). This paper improves and extends preliminary work presented in [16] and [17]. Specically, we report integrated experimental results that are more extensive and rigorous, additionally studying the following aspects: 1) the use of more content transformations; 2) the use of different content descriptors to nd folksonomy images that are visually similar to key frames; 3) the use of different ground similarity functions; 4) the inuence of the amount of collective knowledge in an image folksonomy on the effectiveness of NDVC detection; 5) the inuence of the number of shots in a query video clip on the effectiveness of NDVC detection; and 6) the time complexity of creating and matching semantic video signatures. This paper is organized as follows. Section II reviews related work. Our novel approach for identifying NDVCs is described in Section III, discussing the creation of a semantic video signature by taking advantage of the collective knowledge in an image folksonomy, on the one hand, and the matching of semantic video signatures by means of SQFD, on the other hand. Section IV outlines our experimental setup, while Section V discusses our experimental results. Finally, Section VI provides conclusions and directions for future research.

II. Related Work In this section, we rst review research efforts related to model-free semantic concept detection. Next, we review NDVC detection techniques that make use of either visual or semantic features. Similar to [2] and [3], we consider these techniques, as well as the proposed NDVC detection technique, as complementary to text and context-based approaches. A. Model-Free Semantic Concept Detection Semantic concept detection aims at assigning humanunderstandable textual labels to images in order to describe their content. A fundamental problem in the eld of semantic concept detection is the so-called semantic gap between lowlevel visual features, as used by machines, and high-level semantic features (i.e., high-level semantic concepts), as used by humans [8]. To bridge this gap, a wide range of methods have been described in the scientic literature. Based on the need to learn a statistical distribution of visual features (model), these methods can be divided into model-based and model-free approaches [9]. Given a set of training images labeled by experts, modelbased approaches typically use classiers to learn a mapping between low-level visual features (e.g., color and texture) and high-level semantic features (e.g., human and indoor) [18], [19]. However, due to the high cost of training, the concept vocabulary supported by model-based approaches is often static and limited in size, implying that concepts not learned during training cannot be detected during testing. In contrast to model-based approaches, model-free techniques typically leverage collective knowledge for the purpose of unrestricted semantic concept detection [9]. The collective knowledge used may take the form of images and associated text harvested from a plethora of websites. For example, the authors of [20] learnt an image thesaurus from Web images and their surrounding text to bridge the semantic gap, while the authors of [19] used a variety of Web images that are loosely labeled with nonabstract WordNet nouns for the purpose of scene and object recognition. Alternatively, the collective knowledge used may take the form of image folksonomies. The latter are vast collections of user-contributed images and tags, typically harvested from a social media application (e.g., Flickr, YouTube, or Facebook). The authors of [21], for instance, proposed a model-free framework for video annotation that takes advantage of the collective knowledge available on YouTube. Specically, given a video clip to be annotated, visually similar video clips are retrieved from YouTube, subsequently using the tags associated with the video clips retrieved for the goal of tag recommendation. Note that model-free approaches are also known in the scientic literature as datadriven, classier-free, or parameter-free approaches. B. NDVC Detection Using Low-Level Visual Features The past decade has witnessed the development of numerous techniques for NDVC detection that take advantage of visual features. Taking into account the type of visual features used, video signatures for content-based NDVC detection can be divided into global and local signatures on the one hand, and

1176

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Fig. 1. Overview of NDVC detection using model-free semantic concept detection and adaptive semantic distance measurement.

the 32 semantic concepts in order to overcome a limited discriminative power. In addition, to minimize the inuence of the limited effectiveness of semantic concept detection on the effectiveness of NDVC detection, we make use of trained classiers that come with a detection effectiveness that is reasonably high (i.e., their mean average precision is higher than 0.3). Furthermore, to handle a low discriminative power when the temporal variation of the 32 semantic concepts used is low, we also make use of a hybrid video matching method, fusing visual and semantic features. Note that the use of 32 semantic concept detectors can be seen as a generalization of the research efforts outlined in [28] and [29], given that the latter only exploit face information (a face detector is among the 32 semantic concept detectors used in [30]). The research effort described in this paper can in its turn be seen as a further generalization of the research effort presented in [30], making it possible to take advantage of an unrestricted concept vocabulary. III. Proposed Technique for NDVC Detection Given a query video clip and a reference video database, Fig. 1 shows that the proposed technique for NDVC detection largely consists of two sequential steps: 1) creation of a semantic video signature for the query video clip using model-free semantic concept detection, and 2) matching of the semantic video signature of the query video clip against the semantic video signatures of the reference video clips using adaptive semantic distance measurement. Both steps are described in more detail in the following sections. A. Creation of a Semantic Video Signature Fig. 1 shows that the creation of a semantic video signature for a query video clip consists of ve sequential steps: 1) video shot segmentation; 2) visual feature extraction from representative key frames; 3) model-free semantic concept detection; 4) creation of a semantic feature signature; and 5) aggregation of semantic feature signatures into a semantic video signature. In what follows, the aforementioned steps are explained in more detail. 1) Video Shot Segmentation and Visual Feature Extraction: We perform semantic concept detection at the level of shots. Therefore, given a video clip V, we rst segment V into N different shots such that V = Si N i=1 , with Si denoting the ith shot of V. Next, we represent each shot by means of a key frame that is, in its turn, represented by low-level visual features. In Section IV-A, we provide more information regarding the technique used for video shot segmentation and representative key frame selection, while in Section IV-B, we provide more information regarding the visual features used. 2) Model-Free Semantic Concept Detection: To eliminate the need for training by experts and to facilitate a high semantic coverage, we detect semantic concepts in a key frame by means of a model-free approach, taking advantage of the collective knowledge present in an image folksonomy [10]. To that end, we make use of the tag relevance learning technique outlined in [31]. This technique estimates the relevance of image tags to the content of a seed image by means of neighbor

into spatial and temporal signatures on the other hand. Global signatures summarize a whole video frame in terms of visual characteristics such as color [22] or intensity information [23], whereas local signatures summarize a video frame by detecting and subsequently characterizing small regions-of-interest around keypoints [24], [25]. Similarly, spatial signatures refer to signatures that only make use of features extracted from a single frame, whereas temporal signatures refer to signatures that rely on multiple frames for their construction (e.g., by exploiting object [26] or camera motion [27]). For more extensive reviews of content-based video copy detection, we would like the interested reader to refer to [2], [6], and [7]. C. NDVC Detection Using High-Level Semantic Features Recently, several research efforts have started exploring the use of semantic information for the detection of NDVCs. The authors of [28], for instance, make use of the appearance and the identity of human faces to characterize a video segment, targeting ngerprinting and retrieval in large-scale databases containing motion picture and television data. In particular, pulse-like signals are used to construct a video signature, indicating the presence or absence of a human face in each frame of a video clip. As argued by [28], the use of semantic information (i.e., face information) makes the proposed video signature highly robust to different types of video processing. However, as also acknowledged by [28], when only making use of face information, the proposed approach is less suitable for characterizing sports, news, and documentary content. Similar to [28], the authors of [29] use the appearance of human faces to detect NDVCs. In particular, NDVCs are detected by fusing the output of shot matching using face and extended body information, activity subsequence matching, and nonfacial shot matching using visual similarity. In [30], we proposed an NDVC detection technique that makes use of 32 popular semantic concept detectors, previously used in [18] for the purpose of classifying personal photos. Given the use of a restricted semantic concept vocabulary, we take advantage of the temporal variation of

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1177

Fig. 3. Creation of a semantic feature signature Ai for shot Si . The larger the radius of a circle, the more relevant the semantic concept to the content of Si .

Fig. 2. Retrieval of the k -nearest visual neighbors and their associated tags from an image folksonomy F for a shot Si .

voting, assuming that tags are likely to reect objective aspects of the seed image when different persons have labeled visually similar images using the same tags. Specically, given a seed image annotated with different user-supplied tags, the tag relevance learning technique of [31] estimates the relevance of each user-supplied tag with respect to the content of the seed image by accumulating votes for the tag from visual neighbors of the seed image that have also been annotated with the tag under consideration. Our rationale to make use of the tag relevance learning technique of [31] is threefold: 1) this technique has recently attracted signicant research attention; 2) this technique is easy to use, given that it is only steered by two tag-independent parameters (i.e., the number of visual neighbors and a tag relevance threshold); and 3) this technique can be easily modied to suggest tags for nonannotated seed images. In the remainder of this section, we provide more details regarding the way we applied the tag relevance learning technique of [31] in the context of NDVC detection, rst paying attention to the retrieval of visual neighbors and their associated tags, and then paying attention to neighbor voting. Note that it is reasonable to assume that the probability of nding neighbors that are visually highly similar is, in general, high, given the vast nature of most image folksonomies [31]. Let Ii,k be a set of images that consists of the k -nearest visual neighbors of Si retrieved from the image folksonomy F by making use of low-level visual features, taking into account a unique user constraint in order to reduce voting bias (i.e., in order to avoid having a set of visual neighbors that is dominated by images and tags from a few users, each user can contribute at most one image to the set of visual neighbors [31]). We then propagate relevant tags from the images in Ii,k to Si . To that end, we make use of neighbor voting to estimate the relevance of a tag t with respect to the content of Si [31] as follows: R(Si , t ) =
IIi,k

visual neighbor the tag t was assigned to. It should also be clear that the more visual neighbors that vote for t (i.e., the higher the frequency of t in the set of visual neighbors Ii,k ), the more relevant t with respect to the content of Si . Furthermore, pr (t, k ) in (1) represents the prior frequency of t, which is approximated as follows: pr (t, k) = k |Lt | |F| (2)

where |Lt |denotes the number of images labeled with t in the entire image folksonomy F, and where the number of images in F is equal to |F|. By simultaneously taking into account the distribution of t in the set of visual neighbors Ii,k and the entire image folksonomy F, it is possible to assign a lower relevance value to frequently occurring tags that have little discriminative power (e.g., 2011 and 2012; see [31] for more details). Given (1), we consider tags to be relevant to the content of Si when these tags have a relevance value higher than a threshold tag . This can be expressed as follows: Ti = {t |t Ti,k R(Si , t ) > tag } (3)

where Ti,k denotes the set of (unique) tags assigned to the images in Ii,k and Ti denotes the set of tags propagated to Si (i.e., Ti denotes the set of tags assigned to Si ). 3) Aggregation of Semantic Feature Signatures into Semantic Video Signatures: The number of detected semantic concepts may vary from shot to shot, as well as their relevance and nature (i.e., semantic concepts may be similar but not identical). Therefore, to capture this semantic variety, Fig. 3 shows that we represent each shot Si by means of a semantic feature signature Ai as follows: Ai = ti,j , wi,j , j = 1, . . . , |Ti | (4)

where ti,j is the j th semantic concept detected for Si . We compute the weight value wi,j for ti,j as follows: wi,j = R(Si , ti,j )
|Ti | z=1

R(Si , ti,z )

(5)

v(I, t )pr (t, k)

(1)

Finally, we aggregate the semantic feature signatures Ai into a semantic video signature U for V as follows: U = A1 , A2 , . . . , AN . (6)

where I denotes an image in Ii,k and v(I, t ) represents a voting function, returning one when the visual neighbor I has been annotated with t, and returning zero otherwise. As a result, it should be clear that each occurrence of tag t in the set of visual neighbors Ii,k can be seen as a vote for t, cast by the

Algorithm 1 summarizes the way we applied the tag relevance learning technique of [31] within our semantic approach for NDVC detection.

1178

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Algorithm 1 Proposed algorithm for creating a semantic video signature

input: V (a query video clip), tag (tag relevance threshold), k (number of visual neighbors) output: U (the semantic video signature of V) Segment V into N shots such that V = Si N i=1 for each shot Si in V Select a representative key frame Construct the set of k visual neighbors Ii,k Construct the set of tags Ti,k for each tag ti Ti,k do if R(Si , ti ) > tag then Ti = Ti {ti } end for Construct the vector of weight values Wi = w1 , . . . , w|Ti | Construct the semantic feature signature Ai = ti,j , wi,j , j =1, . . . , |Ti | end for Construct the semantic video signature U = A1 , A2 , . . . , AN return U B. Matching of Semantic Video Signatures Video matching aims at determining whether a query video clip appears in a reference video clip, and if so, at what location in the reference video clip. Let us denote a query q N video clip as Vq = Si i=1 and a reference video clip as L q r V r = Sr l l=1 , with Si and Sl representing the ith and the q r lth shot of V and V , respectively. Using (6), we denote the semantic video signature of Vq and Vr as follows:
r r Uq = A1 , A2 , . . . , AN , Ur = Ar 1 , A2 , . . . , AL . q r q q q

Fig. 4. Matching two semantic feature signatures Aq and Ar by means of SQFD. The application of a ground similarity function to two semantic concepts is denoted by an arrow between the two semantic concepts. The thicker an arrow, the more semantically related the two semantic concepts under consideration.

with fsim (ti , tj ) representing a ground similarity function applied to the tags ti and tj . In Section III-C, we provide more information regarding the different ground similarity functions used in our research. Note that the elements of G are dynamically computed for each comparison of two semantic feature signatures, modeling intra- and interdependencies among the two semantic feature signatures compared [12]. We would like to point out that the use of SQFD allows taking into account that the number, the relevance, and the nature of the detected semantic concepts may strongly vary from video shot to video shot. Indeed, by making it possible to compare weighted semantic feature signatures of different dimensions, SQFD facilitates adaptive semantic distance measurement. In addition, the computational complexity of SQFD is lower than the computational complexity of the well-known earth movers distance, while offering a level of effectiveness that is highly similar [12]. C. Semantic Similarity Measures In our research, SQFD makes use of a ground similarity matrix G that is the result of applying a similarity function to the semantic concepts assigned to the two video shots under consideration. The next sections discuss several ground semantic similarity functions in more detail. Note that Section IV subsequently studies the inuence of these ground semantic similarity functions on the effectiveness of the proposed approach for NDVC detection in more detail. 1) Tag Occurrence and Co-Occurrence Statistics: The use of tag occurrence and co-occurrence statistics is a popular technique to differentiate relevant tags from nonrelevant tags [13]. This can be expressed as follows: fTC (ti , tj ) = Iti tj Iti (11)

(7)

The dissimilarity between V and V can then be measured as follows: 1 dvideo (V , V ) = min p N
q r N

dshot (Si , Sr i+p )


i=1

(8)

where p denotes the position of the video shot in the reference video clip at which dissimilarity measurement starts. If dvideo (Vq , Vr ) is smaller than a prespecied threshold video , then we assume that Vq is a near-duplicate of Vr . Furthermore, as shown in Fig. 4, the semantic dissimilarity between two video shots Sq and Sr in (8) is computed by making use of SQFD [11], [12] as follows: dshot (Sq , Sr ) = SQFD(Aq , Ar ) = Wq |Wr G Wq |Wr T
q q q w1 , . . . , w|Tq | q r r wr 1 , . . . , w|Tr |

(9)

and W = where W = denote weight vectors, and W |Wr denotes the concatenation of Wq and Wr . In addition, G denotes a ground similarity matrix of dimension (|Tq |+|Tr |)(|Tq |+|Tr |). This matrix is the result of applying a ground similarity function to the semantic concepts assigned to Sq and Sr , and where the elements of this matrix are computed as follows: gij = fsim (ti , tj ) (10)

where the numerator denotes the number of images annotated with both tag ti and tj , and where the denominator denotes the number of images annotated with tag ti . In our experiments, ti and tj were used to query the Flickr image search engine in order to reliably estimate tag occurrence and co-occurrence statistics. 2) Normalized Google Distance: The normalized Google distance (NGD) is a distance measure derived from the number

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1179

of hits returned by the Google search engine for a given set of keywords [37]. Using NGD, we measured the semantic similarity between two tags ti and tj as follows: fNGD (ti , tj ) = max(log f (ti ), log f (tj )) log f (ti , tj ) 1 log E min(log f (ti ), log f (tj )) (12)

where f (ti ), f (tj ), and f (ti , tj ) denote the number of Web pages indexed by Google containing ti , tj , and both ti and tj , respectively. E is the total number of web pages indexed by Google. Similar tags tend to have a small NGD value, while dissimilar tags tend to have a large NGD value. Note that NGD is not a metric. 3) Pointwise Mutual Information: Pointwise mutual information (PMI) is a measure of association used in information theory and statistics [38]. The PMI of two semantic concepts ti and tj quanties the discrepancy between two probabilities: 1) the probability of their coincidence given their joint distribution, and 2) the probability of their coincidence given only their individual distributions, assuming independence. Using PMI, we measured the semantic similarity between two tags ti and tj as follows: fPMI (ti , tj ) = 1 1 P (ti , tj ) + log . 2 2 log(max(P (ti ), P (tj ))) P (ti )P (tj ) (13)

In our experiments, to calculate the probability of a tag t occurring, we made use of the prior frequency of t in MIRFLICKR-25000 as follows: |It | P (t ) = (14) |I| where I denotes the set of MIRFLICKR-25000 images and It denotes the set of images in MIRFLICKR-25000 annotated with t. 4) Histogram Intersection: Histogram intersection (HI) reects the number of semantic concepts two semantic feature signatures Aq and Ar have in common. This can be expressed as follows: 1 , if ti = tj fHI (ti , tj ) = (15) A q At 0, otherwise.

Fig. 5. Example key frames, visualizing the different transformations used. (a) Original. (b) Blurring. (c) Picture-in-picture. (d) Change in brightness. (e) Cropping. (f) Mirroring. (g) Resizing. (h) Shifting.

IV. Experimental Setup This section discusses our experimental setup. We rst describe the data sets used. We subsequently explain how we realized model-based and model-free semantic concept detection. Next, we discuss how we implemented video matching. Finally, we detail how we measured the effectiveness of NDVC detection. A. Data Sets Our experiments made use of the publicly available TRECVID 2009 video set [14] to create a reference video database and NDVCs. The TRECVID 2009 video set consists of 400 video clips, having a total duration of 100 h and

containing a total of 16 324 video shots. All video clips have a resolution of 352 288 pixels and were encoded using MPEG-1. We performed shot detection by making use of the technique proposed in [32]. This shot detection technique, which was also used to provide the master shot reference for the TRECVID 2009 video set, is highly effective and has a low computational complexity. Furthermore, we used the frame in the middle of each shot as a representative key frame. We generated 140 query video clips by applying seven transformations to 20 video clips randomly selected from the reference video database, taking into account the constraint that each video clip selected has a minimum length of 60 video shots (the 20 video clips selected contain 1892 video shots in total). 1) Blurring: We blurred frames using a Gaussian kernel with a radius of 15. 2) Picture-in-picture: We inserted a picture with a size that is 30% of the size of the main frame. 3) Change in brightness: We increased the brightness by 40%. 4) Cropping: We cropped frames to 50% of their original size, not preserving the center region. 5) Mirroring: We reversed frames from the left to the right. 6) Resizing: We resized frames to 50% of their original size. 7) Shifting: We shifted frames down and to the right.

1180

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Fig. 5 visualizes the different transformations used. Note that these transformations were also used by the content-based video copy detection task of TRECVID 2009 [33]. In addition, note that we created the 140 query video clips by relying on the tools used to create the TRECVID query video clips [33]. Finally, our experiments made use of the publicly available MIRFLICKR-25000 image set as a source of collective knowledge [15]. As the name suggests, MIRFLICKR-25000 consists of 25 000 images downloaded from Flickr, annotated with a total of 223 537 tags by 9862 users (68 004 tags are unique). The highest number of images contributed by a single user is 41, and 1386 tags were assigned to at least 20 images. B. Semantic Concept Detection 1) Model-Based Semantic Concept Detection: We adopted the VIREO-374 semantic concept models to realize modelbased semantic concept detection. Each model consists of a binary SVM-based classier, using a Chi-squared kernel and trained using BoVW. For more information about VIREO-374, we would like the interested reader to refer to [34]. 2) Model-Free Semantic Concept Detection: Our experiments made use of two types of descriptors to represent the visual content of the folksonomy images and the key frames selected from the reference video clips and the NDVCs: the MPEG-7 scalable color descriptor (SCD), which is a global content descriptor, and BoVW, which is a local content descriptor. SCD represents the visual content of a folksonomy image or a key frame with a 256-D vector of color features [35]. As recommended by the MPEG-7 standard, we measured the visual distance between two SCD representations by means of the L1 metric [36]. BoVW represents the visual content of a folksonomy image or a key frame by means of visual words. In our experiments, we made use of a vocabulary of 500 visual words derived from 61 901 training images. In addition, we detected, described, and clustered interest points using difference of Gaussians (DoGs), the SIFT, and k -means clustering, respectively. Moreover, we measured the dissimilarity between two BoVW representations by means of the cosine distance. Furthermore, given the number of images in MIRFLICKR25000 on the one hand, and given that no technique is available that allows selecting an optimal number of visual neighbors on the other hand, we used the ten topmost visual neighbors of a key frame to realize tag relevance learning, trading off complexity and precision. Note that the proposed approach for NDVC detection always makes use of the semantic concepts associated with the ten best visual neighbors found, even when these visual neighbors are visually distant from the key frames under consideration. Finally, we empirically set tag to 1.6 for all experiments conducted (unless otherwise mentioned). C. Matching of Semantic Video Signatures To match semantic video signatures, we made use of a sequential sliding window, having a size equal to the number of shots in the query video clip. Also, we empirically set video to 0.4 for all experiments conducted (unless otherwise mentioned).

D. Measurement of NDVC Detection Effectiveness The effectiveness of NDVC detection was measured using the normalized detection cost ratio (NDCR) [39] as follows: NDCR = Pmiss + RFA where Pmiss = NFN NFP , RFA = Nquery Trefdata Tquery (17) (16)

with Pmiss denoting the probability of a miss and RFA denoting the false alarm rate. Furthermore, Nquery , NFN , and NFP represent the number of query video clips, false negatives, and false positives, respectively, while Trefdata and Tquery denote the total duration of the reference and the query video clips, respectively (duration is expressed in hours). In addition, is a factor that trades off the cost of missing a true positive and the cost of having to deal with a false alarm. In our experiments, following [39] and [40], we set to 2 (balanced prole in [39]).

V. Experimental Results This section investigates the effectiveness and efciency of NDVC detection using model-free semantic concept detection and SQFD. To that end, we conducted experiments: 1) to test the use of different ground similarity functions; 2) to compare the effectiveness of the proposed technique for NDVC detection with the effectiveness of NDVC detection using visual features; 3) to analyze the NDCR as a function of different types of video content; 4) to investigate the inuence of the effectiveness of semantic concept detection on the effectiveness of NDVC detection; 5) to study the inuence of the amount of collective knowledge on the effectiveness of NDVC detection; 6) to investigate the inuence of the number of shots in a query video clip on the effectiveness of NDVC detection; and 7) to assess the time complexity of creating and matching semantic video signatures. A. Inuence of Different Ground Similarity Functions In this section, we investigate how the effectiveness of NDVC detection varies as a function of the four ground similarity functions previously discussed in Section III-C. From Table I, we can observe that the use of fTC , fNGD , and fPMI signicantly outperforms the use of fHI . Also, given a certain type of visual features, we can observe that fTC outperforms all other ground similarity functions. In addition, we can observe that fTC and fPMI outperform fNGD . The latter can be attributed to the fact that fTC and fPMI make use of statistics derived from a repository of images, whereas fNGD makes use of statistics derived from a repository of text documents. As a result, throughout the remainder of this paper, we make use of fTC as our ground similarity function. B. Comparison With NDVC Detection Using Visual Features In this section, we compare the effectiveness of NDVC detection using model-free semantic concept detection and SQFD with the effectiveness of three state-of-the-art NDVC detection techniques using visual features. The latter either

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1181

TABLE I Influence of Different Ground Similarity Functions on the Effectiveness of NDVC Detection fTC 0.113 0.054 fNGD 0.152 0.092 fPMI 0.144 0.078 fHI 0.269 0.211

Proposed technique with SCD Proposed technique with BoVW

Effectiveness of NDVC detection is expressed in terms of NDCR, averaged over the seven transformations used (the lower the NDCR, the higher the effectiveness of NDVC detection).

make use of temporal ordinal measurement [41], principal component analysis (PCA)-SIFT features [42], or BoVW [43]. Temporal ordinal measurement subdivides a frame into a xed-sized grid of regions, computing the average luminance value of each region. However, whereas spatial ordinal measurement uses the spatial rank of regions in order to create a video signature (by sorting regions within a single frame according to increasing or decreasing average luminance values), temporal ordinal measurement uses the temporal rank of regions in order to create a video signature (by sorting regions over multiple frames according to increasing or decreasing average luminance values). Both NDVC detection using PCA-SIFT features and NDVC detection using BoVW detect keypoints by means of DoG. Compared to the use of conventional SIFT features, the use of PCA-SIFT features allows for a more compact representation of the image content. In particular, the dimension of a PCASIFT descriptor is 36, whereas the dimension of a conventional SIFT descriptor is 128. Our experiments with NDVC detection using BoVW followed the recommendations made in [43]; we used SIFT to describe the keypoints of key frames and we used a visual vocabulary of 20 000 words. To measure the closeness between two key frames, we used the cosine similarity. As shown by Fig. 6, irrespective of the use of BoVW or SCD, we can observe that NDVC detection using model-free semantic concept detection and SQFD outperforms the three NDVC detection techniques using visual features, especially when taking into account the average effectiveness of NDVC detection (please see the category labeled average in Fig. 6). This can be primarily attributed to the fact that the proposed technique for NDVC detection takes advantage of semantic features. Indeed, compared to visual features, semantic features tend to be more robust against content transformations. Furthermore, Fig. 6 shows that the proposed technique for NDVC detection is most effective when making use of BoVWbased image-to-video matching. Indeed, when making use of SCD-based image-to-video matching, the proposed technique for NDVC detection is more sensitive to crop, change in brightness, and shift. On the other hand, compared to the use of SCD-based image-to-video matching, the proposed technique for NDVC detection is more sensitive to blur when making use of BoVW-based image-to-video matching. We can also observe that the transformation pattern insertion has the highest impact on the effectiveness of all NDVC detection techniques studied, with NDVC detection using a temporal ordinal signature being the most sensitive. NDVC detection using a temporal ordinal signature is also highly sensitive to cropping.

Fig. 6.

Effectiveness of NDVC detection for the seven transformations used.

Finally, we can observe that NDVC detection using PCASIFT features and NDVC detection using BoVW are highly sensitive to mirroring and blur. However, we can also observe that the proposed technique for NDVC detection, when making use of BoVW, is signicantly more robust against mirroring and blur than NDVC detection using either PCA-SIFT or BoVW. This can be attributed to the fact that the proposed technique for NDVC detection only makes use of BoVW (i.e., SIFT) for retrieving images from an image folksonomy that are visually similar to the key frames extracted from a particular video clip, subsequently identifying NDVCs by making use of the semantic concepts associated with the folksonomy images retrieved for the key frames under consideration (i.e., the proposed technique for NDVC detection does not directly use SIFT to identify NDVCs). We illustrate the aforementioned observation with an example. Given a rst key frame that is the result of blurring or mirroring a second key frame, then the direct use of SIFT for detecting whether the rst key frame is a near-duplicate of the second key frame will often result in a false negative. Indeed, as shown by Fig. 7, only a low number of keypoints can typically be matched then (due to the sensitivity of SIFT to blurring and mirroring). This is different from the case where SIFT is only used for image-to-video matching, and where NDVCs are subsequently identied by making use of the semantic concepts associated with the images retrieved. Although different visual neighbors are typically obtained when SIFT is used to retrieve folksonomy images that are visually similar to the two key frames under consideration (given that SIFT is sensitive to blurring and mirroring), the semantic concepts associated with the folksonomy images retrieved will, in general, still be consistent, making it possible to detect that the rst key frame is a near-duplicate of the second key frame (true positive). This is illustrated by Fig. 8, showing that the visual neighbors still convey the same semantic information (e.g., face and portrait), despite the fact that different visual neighbors have been retrieved for the key frames used. As a result, throughout the remainder of this paper, we make use of BoVW for matching folksonomy images to key frames. C. NDCR as a Function of Different Types of Video Content To facilitate effective NDVC detection, video signatures need to be able to deal with different types of video content. Therefore, we ran an experiment to test the effectiveness of our semantic approach toward the task of NDVC detection when using different types of video content (see Fig. 9), relying on

1182

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Fig. 7. SIFT-based matching between the keypoints of key frames that have been the subject of (a) blurring (145/761) and (b) mirroring (82/761). The caption of each gure indicates the number of keypoints that could be matched over the total number of keypoints detected (for reasons of clarity, we have only visualized a fraction of the total number of keypoints that could be matched). Fig. 9. Example key frames extracted from video clips representative for each category used. (a) Documentary content. (b) News reports and talk show content. (c) Drama and movie content. (d) Miscellaneous video content.

Fig. 8.

Example key frames and visual neighbors. Fig. 10. Inuence of different types of video content on the effectiveness of NDVC detection, for both model-based and model-free semantic concept detection. The popularity of the VIREO-374 semantic concept models was measured by counting the number of video shots in the reference video database in which each semantic concept model appears [16].

both model-based and model-free semantic concept detection. Category 1) Documentary content. Category 2) News reports and talk show content. Category 3) Drama and movie content. Category 4) Miscellaneous video content (animation, sports, and other types of video content). For each category, we selected ten representative video clips from the reference video database, applying seven transformations to the video clips selected (see Section IV-A). Fig. 10 shows the effectiveness of NDVC detection for the four types of video content used and the 280 (4 categories 10 video clips 7 transformations) query video clips created. We can observe that the effectiveness of NDVC detection using model-free semantic concept detection is stable and high (all NDCR values are lower than 0.1), irrespective of the type of video content used. We can also observe that the effectiveness of NDVC detection using model-based semantic concept detection is lower than the effectiveness of NDVC detection using model-free semantic concept detection. This holds especially true for video clips belonging to Category 1 (documentary content) and Category 4 (miscellaneous video content), illustrating that NDVC detection using model-based semantic concept detection has difculties in generalizing. Although the semantic coverage of model-based semantic concept detection is limited to a static vocabulary restricted in size, model-free semantic concept detection is able to take advantage of a dynamic vocabulary not restricted in size. As such, it can be expected that model-based semantic concept detection will even have more difculties in generalizing when making use of user-generated video content. To further illustrate the difference in semantic coverage between model-based and model-free semantic concept detection, Fig. 11 shows a number of example key frames,

Fig. 11. Example key frames with detected semantic concepts (correctly detected semantic concepts have been underlined).

annotated with semantic concepts that have been detected by means of either a model-based approach (i.e., using all of the VIREO-374 semantic concept models) or a modelfree approach (i.e., using the collective knowledge present in MIRFLICKR-25000). We can observe that for the modelbased approach, semantic concepts not learned during training cannot be detected during testing, an observation that holds particularly true for animated video content. On the other hand, although some of the detected semantic concepts are not relevant, we can observe that model-free semantic concept detection allows for a higher semantic coverage.

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1183

Fig. 12. Precision of annotation versus the average number of semantic concepts detected per shot, plotted for a varying tag relevance threshold tag .

D. Inuence of Semantic Concept Detection Effectiveness In this section, we seek a better understanding of the usefulness of model-free semantic concept detection for both the task of annotation and NDVC detection. 1) Annotation: We rst investigate the inuence of incorrectly detected semantic concepts on the effectiveness of annotation. To that end, we calculate the precision of the detected semantic concepts as a function of the tag relevance threshold tag (see Section III-A), for 20 video clips randomly selected from the reference video database. Precision is computed by dividing the number of true positives per shot by the total number of true and false positives per shot, and then averaging this number over all shots and video clips used. We assume that the precision of the detected semantic concepts for the 20 reference video clips is also representative for their near-duplicate versions (given that the transformations used to create NDVCs tend to preserve semantic information). Note that we varied the value of the tag relevance threshold tag between 1.0 and 2.5 in units of 0.1. Fig. 12 allows making the following observation: the higher the tag relevance threshold, the higher the precision of annotation, and vice versa. However, we can also observe that the higher the tag relevance threshold, the lower the average number of detected semantic concepts per shot (this average includes both true and false positives), and vice versa. 2) NDVC Detection: In this section, we investigate the inuence of incorrectly detected semantic concepts on the effectiveness of NDVC detection. Fig. 13(a) shows the effectiveness of NDVC detection as a function of the precision of semantic concept detection. We can observe that the effectiveness of NDVC detection is highly robust against a low precision of semantic concept detection (where the latter is the result of using a low tag relevance threshold). When the precision is higher than 0.4, we can observe that the effectiveness of NDVC detection quickly decreases. This can be attributed to the presence of a lower number of semantic concepts (see Fig. 12), resulting in less discriminative power. Fig. 13(b) shows the effectiveness of NDVC detection as a function of the average number of semantic concepts detected per shot. We can observe that the effectiveness of NDVC detection is highly robust when detecting a high number of semantic concepts per shot (which is the case when a low tag relevance threshold is used). When the average number

Fig. 13. Effectiveness of NDVC detection. (a) As a function of the precision of semantic concept detection. (b) As a function of the average number of semantic concepts detected per shot.

of semantic concepts detected per shot is lower than ve, we can observe that the effectiveness of NDVC detection quickly decreases. Similar to what we observed for Fig. 13(a), the presence of a lower number of semantic concepts results in less discriminative power. Summarizing the experimental results presented in Figs. 12 and 13, we can conclude that the problem of detecting semantic concepts for the goal of identifying NDVCs is more relaxed than the problem of detecting semantic concepts for annotation purposes. Although incorrectly detected semantic concepts negatively affect the effectiveness of automatic annotation, they do not negatively affect the effectiveness of NDVC detection, as long as the same incorrect semantic concepts are detected for both the reference and NDVCs. To gain further insight into the inuence of the effectiveness of semantic concept detection on the effectiveness of annotation and NDVC detection, Fig. 14 shows an example key frame and its transformed versions, annotated with semantic concepts that have been detected by making use of the collective knowledge present in MIRFLICKR-25000. First, we can observe that several semantic concepts are irrelevant to the video content. This can mainly be attributed to two reasons: the limited effectiveness of content-based retrieval and the presence of noisy tags in MIRFLICKR-25000 [31]. Second, we can observe that the transformed key frames have a signicant number of detected semantic concepts in common with the original key frames. These common concepts contribute to a higher effectiveness of NDVC detection, regardless of their relevance. E. Inuence of the Amount of Collective Knowledge In this section, we investigate the inuence of the amount of collective knowledge in an image folksonomy on the effectiveness of NDVC detection.

1184

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

Fig. 15. Inuence of the size of an image folksonomy. (a) Effectiveness of NDVC detection as a function of the number of tag assignments. (b) Number of unique semantic concepts detected for the NDVCs as a function of the number of folksonomy images used.

Fig. 14. Detected semantic concepts for an example key frame and its transformed versions. Correct semantic concepts have been underlined. In addition, semantic concepts that have been detected for both the reference and near-duplicate video clips have been marked in bold.

coverage of the image folksonomy). Fig. 15(b) shows that the use of an increasing number of folksonomy images results in an increase of the number of unique semantic concepts detected for the NDVCs, and thus in a higher semantic coverage of the image folksonomy. Note that the number of unique semantic concepts detected for the NDVCs includes both correctly and incorrectly detected semantic concepts. F. Inuence of the Number of Shots in a Query Video Clip In this section, we investigate the inuence of the number of shots in a query video clip on the effectiveness of our approach for NDVC detection, given that our approach needs a few shots in order to work properly. Fig. 16 shows the effectiveness of NDVC detection as a function of the number of shots in a query video clip. To that end, we varied the number of shots in each of the 140 query video clips from 5 to 60 (given that the minimum number of shots in the 140 query video clips is 60). We can observe that the use of an increasing number of shots results in an increasing effectiveness of NDVC detection. We can also observe that the effectiveness of NDVC detection saturates as soon as each of the 140 query video clips has more than 25 video shots. G. Time Complexity of Creating and Matching Signatures In this section, we investigate the time complexity of creating and matching semantic video signatures. To that end, we conducted experiments on a PC with an Intel Pentium IV 2.4 GHz CPU processor and 2 GB of system memory, running Windows XP with a 500 GB 7200 r/min hard disc. Before discussing our experimental results, we rst would like to point out that an NDVC detection system typically consists of an ofine and an online part. The ofine part is responsible for

Given that the number of tag assignments intuitively best represents the amount of collective knowledge in an image folksonomy, Fig. 15(a) illustrates how the effectiveness of NDVC detection varies as a function of the number of tag assignments in MIRFLICKR-25000. The 25 data points plotted in Fig. 15(a) were generated as follows: starting with all 25 000 MIRFLICKR-25000 images, we incrementally removed 1000 randomly selected folksonomy images. We can observe that the use of an increasing number of tag assignments results in an increase of the effectiveness of NDVC detection. However, we can also observe that the effectiveness of NDVC detection saturates when using more than 200 000 tag assignments. This can likely be attributed to an equal distribution of the amount of collective knowledge (i.e., the number of tag assignments) over the folksonomy images. We observed a highly similar trend when plotting the effectiveness of NDVC detection as a function of the number of images in MIRFLICKR-25000. Furthermore, in this section, we investigate how the semantic coverage of the image folksonomy changes as the number of folksonomy images changes. To that end, we analyzed how the number of unique semantic concepts detected for the NDVCs changes as the number of folksonomy images changes (assuming that the higher the number of unique semantic concepts detected for the NDVCs, the higher the semantic

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1185

Fig. 16. Effectiveness of NDVC detection as a function of the number of shots in a query video clip.

Fig. 17.

Time complexity of creating semantic video signatures.

Fig. 18. Time complexity of matching semantic video signatures. (a) As a function of the average number of semantic concepts detected per shot. (b) As a function of the number of shots in a query video clip.

creating video signatures for the set of reference video clips. The online part is responsible for determining whether a query video clip is an NDVC of one of the reference video clips. Online processing typically consists of creating a video signature for the query video clip and subsequent matching of the video signature of the query clip against the video signatures of the reference video clips. This can, for instance, be done when uploading the query video clip to an online video repository. Fig. 17 shows the time complexity of creating a semantic video signature as a function of the number of shots in a query video clip, varying the number of shots in each of the 140 query video clips from 5 to 60 and also averaging the execution times obtained for the 140 query video clips. Note that the time complexity of creating a semantic video signature includes the time complexity of shot detection, visual feature extraction, and semantic concept detection. We can see that the time complexity of creating a semantic video signature is linearly dependent on the number of shots in a query video clip. In other words, the time complexity of creating a semantic video signature is O(m), where m is the number of shots processed. To study the time complexity of matching, we measured the time needed to match the 140 query video clips against 400 reference video clips (see Section IV-A) by means of a sliding window approach. Note that the average number of video shots in the 140 video clips is 94.6, and that the average duration of the query video clips is about 16 min. Furthermore, we varied the average number of semantic concepts detected per shot by varying the tag relevance threshold between 1.0 and 1.8 in units of 0.1. Fig. 18(a) shows that the time complexity of matching semantic video signatures using SQFD is dependent in a nonlinear way on the average number of semantic concepts assigned to the shots in the query and reference video clips. Specically, under the assumption that the evaluation of the ground

similarity function fTC is constant (i.e., (O(1)), the time complexity can be characterized as O((n + n)2 ), where n is the average number of semantic concepts assigned to a shot. Indeed, the number of detected semantic concepts has a direct impact on the dimensionality of the ground similarity matrix G. Furthermore, given the use of a sliding window approach and assuming that the number of semantic concepts per shot is constant, Fig. 18(b) illustrates that the time complexity of matching the semantic signature of a query video clip against the semantic signatures of the reference video clips using SQFD is linearly dependent on the number of shots in a query video clip (for this experiment, we assigned ten semantic concepts to each shot in the query and reference video clips). Note that more efcient matching could, for instance, be achieved by making use of hierarchical matching [25] or by simply making use of parallelism, distributing a query over multiple nodes. Also, for a discussion of more advanced techniques for optimized indexing and matching using SQFD, we would like the reader to refer to [45][47].

VI. Conclusion Motivated by the central observation that content transformations tend to preserve the semantic information conveyed by video clips, this paper introduced a novel technique for NDVC detection, leveraging model-free semantic concept detection and adaptive semantic distance measurement. Model-free semantic concept detection is realized by taking advantage of the collective knowledge in an image folksonomy, facilitating the use of an unrestricted vocabulary of semantic concepts. Adaptive semantic distance measurement is realized by means of SQFD, making it possible to exibly measure the similarity between video shots that contain a varying number of semantic

1186

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 8, AUGUST 2012

concepts, and where these semantic concepts may also differ in terms of relevance and nature. To investigate the feasibility of NDVC detection using model-free semantic concept detection and SQFD, experiments were performed with two publicly available data sets: TRECVID 2009, used for creating a reference video database and NDVCs, and MIRFLICKR-25000, used as a source of collective knowledge. This led to the following ndings. 1) An image folksonomy and SQFD can be effectively used for detecting NDVCs when collective knowledge is retrieved from the image folksonomy by means of BoVW and when SQFD makes use of tag occurrence and co-occurrence statistics. 2) The effectiveness of NDVC detection using model-free semantic concept detection is higher than the effectiveness of NDVC detection using model-based concept detection, mainly thanks to a higher semantic coverage of the former. 3) The effectiveness of the proposed technique for NDVC detection is on par with or better than the effectiveness of three state-of-the-art NDVC detection techniques making use of either temporal ordinal measurement, PCASIFT, or BoVW. 4) The problem of semantic concept detection for the goal of identifying NDVCs is more relaxed than the problem of semantic concept detection for the purpose of annotation. 5) The effectiveness of NDVC detection starts to stabilize when detecting more than ve semantic concepts per shot. 6) The use of an increasing amount of collective knowledge retrieved from MIRFLICKR-25000 has a positive effect on the effectiveness of NDVC detection, and where the effectiveness of NDVC detection starts to saturate when making use of over 200 000 tag assignments. 7) The creation of semantic video signatures has a linear time complexity in terms of the number of shots in a query video clip. Similarly, matching semantic video signatures by means of SQFD has a quadratic time complexity in terms of the average number of semantic concepts detected per shot and a linear time complexity in terms of the number of shots in a query video clip. Directions for future research: Several directions for future research can be identied. We plan to investigate the fusion of visual and semantic features in order to facilitate more robust NDVC detection. We also plan to study the inuence of different types of image folksonomies (i.e., different types of collective knowledge) on the effectiveness of NDVC detection.

References
[1] R. D. Oliveira, M. Cherubini, and N. Oliver, Looking at nearduplicate videos from a human-centric perspective, ACM Trans. Multimedia Comput., Commun. Applicat., vol. 6, no. 3, pp. 122, Aug. 2010. [2] P. Brasnett, S. Paschalakis, and M. Bober, Recent developments on standardisation of MPEG-7 visual signature tools, in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2010, pp. 13471352.

[3] X. Wu, C.-W. Ngo, A. Hauptmann, and H.-K. Tan, Real-time nearduplicate elimination for web-video search with content and context, IEEE Trans. Multimedia, vol. 11, no. 2, pp. 196207, Feb. 2009. [4] S. H. Kim and R.-H. Park, An efcient algorithm for video sequence matching using the modied Hausdorff distance and the directed divergence, IEEE Trans. Circuits Syst. Video Tech., vol. 12, no. 7, pp. 592596, Jul. 2002. [5] D. N. Bhat and S. K. Nayar, Ordinal measures for image correspondence, IEEE Trans. Patt. Anal. Mach. Intell., vol. 20, no. 4, pp. 415 423, Apr. 1998. [6] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford, Video copy detection: A comparative study, in Proc. ACM Int. Conf. Image Video Retrieval, 2007, pp. 371 378. [7] S. Lian, N. Nikolaidis, and H. T. Sencar, Content-based video copy detection: A survey, in Proc. Intell. Multimedia Anal. Security Applicat., 2010, pp. 253273. [8] A. Hauptmann, R. Yan, W.-H. Lin, M. Christel, and H. Wactlar, Can high level concepts ll the semantic gap in video retrieval? A case study with broadcast news, IEEE Trans. Multimedia, vol. 9, no. 5, pp. 958 966, Aug. 2007. [9] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, Annotating images by mining image search results, IEEE Trans. Patt. Anal. Mach. Intell., vol. 30, no. 11, pp. 114, Nov. 2008. [10] S. Lee, W. De Neve, K. N. Plataniotis, and Y. M. Ro, MAP-based image tag recommendation using a visual folksonomy, Patt. Recog. Lett., vol. 31, no. 19, pp. 976982, 2010. [11] C. Beecks, M. S. Uysal, and T. Seidl, Signature quadratic form distances for content-based similarity, in Proc. ACM Int. Conf. Multimedia, 2009, pp. 697700. [12] C. Beecks, M. S. Uysal, and T. Seidl, Signature quadratic form distance, in Proc. ACM Int. Conf. Image Video Retrieval, 2010, pp. 438445. [13] S. Lee, W. De Neve, and Y. M. Ro, Tag renement in an image folksonomy using visual similarity and tag co-occurrence statistics, Signal Process.: Image Commun., vol. 25, no. 10, pp. 761773, 2010. [14] P. Over, G. Awad, J. Fiscus, M. Michel, A. F. Smeaton, and W. Kraaij, TRECVID 2009: Goals, tasks, data, evaluation mechanisms and metrics, in Proc. TRECVid Workshop, 2009, pp. 142. [15] M. J. Huiskes and M. S. Lew, The MIR ickr retrieval evaluation, in Proc. ACM Int. Conf. Multimedia Inform. Retrieval, 2008, pp. 39 43. [16] H.-S. Min, J. Y. Choi, W. De Neve, and Y. M. Ro, Leveraging an image folksonomy and the signature quadratic form distance for semanticbased detection of near-duplicate video clips, in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2011, pp. 16. [17] H.-S. Min, J. Y. Choi, W. De Neve, and Y. M. Ro, Towards a better understanding of model-free semantic concept detection for annotation and near-duplicate video clip detection, in Proc. IEEE Int. Conf. Image Process., Sep. 2011, pp. 36823685. [18] S. Yang, S. K. Kim, and Y. M. Ro, Semantic home photo categorization, IEEE Trans. Circuits Syst. Video Tech., vol. 17, no. 3, pp. 324335, Mar. 2007. [19] E. Chang, G. Kingshy, G. Sychay, and G. Wu, CBSA: Contentbased soft annotation for multimodal image retrieval using Bayes point machines, IEEE Trans. Circuits Syst. Video Tech., vol. 13, no. 1, pp. 2638, Jan. 2003. [20] J. Li and J. Z. Wang, Real-time computerized annotation of pictures, IEEE Trans. Patt. Anal. Mach. Intell., vol. 30, no. 6, pp. 9851002, Jun. 2008. [21] X. Wu, C.-W. Ngo, and W.-L. Zhao, Data-driven approaches to community-contributed video applications, IEEE Multimedia, vol. 17, no. 4, pp. 5869, Oct.Dec. 2010. [22] A. Joly, O. Buisson, and C. Frelicot, Content-based copy retrieval using distortion-based probabilistic similarity search, IEEE Trans. Multimedia, vol. 9, no. 2, pp. 293306, Feb. 2007. [23] C. Kim and B. Vasudev, Spatiotemporal sequence matching for efcient video copy detection, IEEE Trans. Circuits Syst. Video Tech., vol. 15, no. 1, pp. 127131, Jan. 2005. [24] J. Zhu, C. H. Steven, R. Michael, and S. Yan, Near-duplicate keyframe retrieval by nonrigid image matching, in Proc. ACM Int. Conf. Multimedia, 2008, pp. 4150. [25] X. Wu, A. G. Hauptmann, and C.-W. Ngo, Practical elimination of nearduplicates from web video search, in Proc. ACM Int. Conf. Multimedia, 2007, pp. 218227.

MIN et al.: NEAR-DUPLICATE VIDEO CLIP DETECTION

1187

[26] T. C. Hoad and J. Zobel, Detection of video sequences using compact signatures, ACM Trans. Inform. Syst., vol. 24, no. 1, pp. 150, 2006. [27] P.-H. Wu, T. Thaipanich, and C. J. Kuo, Detecting duplicate video based on camera transitional behavior, in Proc. IEEE Int. Conf. Image Process., Nov. 2009, pp. 237240. [28] C. Cotsaces, N. Nikolaidis, and I. Pitas, Semantic video ngerprinting and retrieval using face information, Signal Process.: Image Commun., vol. 24, no. 7, pp. 598613, 2009. Ulusoy, Video copy ktunc [29] O. K uc u , M. Bas tan, U. G ud ukbay, and O. detection using multiple visual cues and MPEG-7 descriptors, J. Vis. Commun. Image Representat., vol. 21, no. 8, pp. 838849, 2010. [30] H.-S. Min, J. Y. Choi, W. De Neve, and Y. M. Ro, Bimodal fusion of low-level visual features and high-level semantic features for nearduplicate video clip detection, Signal Process.: Image Commun., vol. 26, no. 10, pp. 612627, 2011. [31] X. Li, C. G. M. Snoek, and M. Worring, Learning social tag relevance by neighbor voting, IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1310 1322, Nov. 2009. [32] C. Petersohn, Fraunhofer HHI at TRECVID 2004: Shot boundary detection system, in Proc. TRECVID Workshop, 2004, pp. 17. [33] W. Kraaij, P. Over, J. Fiscus, and A. Joly. (2009, Jun.). Final CBCD Evaluation Plan TRECVID 2009 (v1.3) [Online]. Available: http://wwwnlpir.nist.gov/projects/tv2009/Evaluation-cbcd-v1.3.htm [34] Y. Jiang, J. Yang, C.-W. Ngo, and A. G. Hauptmann, Representations of keypoint-based semantic concept detection: A comprehensive study, IEEE Trans. Multimedia, vol. 12, no. 1, pp. 4253, Jan. 2010. [35] T. Deselaers, D. Keysers, and H. Ney, Features for image retrieval: An experimental comparison, Inform. Retrieval, vol. 11, no. 2, pp. 77107, 2008. [36] A. Torralba, R. Fergus, and W. T. Freeman, 80 million tiny images: A large data set for nonparametric object and scene recognition, IEEE Trans. Patt. Anal. Mach. Intell., vol. 30, no. 11, pp. 9851002, Nov. 2008. [37] R. L. Cilibrasi and P. M. B. Vitanyi, The Google similarity distance, IEEE Trans. Know. Data Eng., vol. 19, no. 3, pp. 370383, Mar. 2007. [38] P. D. Turney, Thumbs up or thumbs down? Semantic orientation applied to unsupervised classication of reviews, in Proc. Assoc. Computat. Linguistics, 2002, pp. 417424. [39] TRECVID. (2010). CBCD Evaluation Plan TRECVID [Online]. Available: http://www-nlpir.nist.gov/projects/tv2010/Evaluation-cbcdv1.3.htm [40] A. Natsev, M. L. Hill, and J. R. Smith, Design and evaluation of an effective and efcient video copy detection system, in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2010, pp. 13531358. [41] L. Chen and F. W. M. Stentiford, Video sequence matching based on temporal ordinal measurement, Patt. Recog. Lett., vol. 29, no. 13, pp. 18241831, 2008. [42] W. L. Zhao, C. W. Ngo, H. K. Tan, and X. Wu, Near-duplicate keyframe identication with interest point matching and pattern learning, IEEE Trans. Multimedia, vol. 9, no. 5, pp. 10371048, Aug. 2007. [43] W.-L. Zhao, X. Wu, and C.-W. Ngo, On the annotation of web videos by efcient near-duplicate search, IEEE Trans. Multimedia, vol. 12, no. 5, pp. 448461, Aug. 2010. [44] R. Tavenard, L. Amsaleg, and G. Gravier, Model-based similarity estimation of multidimensional temporal sequences, Ann. Telecommun., vol. 64, no. 5, pp. 381390, Apr. 2009. [45] J. Loko c, M. L. Hetland, T. Skopal, and C. Beecks, Ptolemaic indexing of the signature quadratic form distance, in Proc. Int. Conf. Similarity Search Applicat., 2011, pp. 916. [46] C. Beecks, J. Loko c, T. Seidl, and T. Skopal, Indexing the signature quadratic form distance for efcient content-based multimedia retrieval, in Proc. ACM Int. Conf. Multimedia Retrieval, 2011, pp. 24:124:8. [47] M. Kruli s, M. J. Loko c, J. C. Beecks, C. T. Skopal, and T. Seidl, Processing the signature quadratic form distance on many-core GPU architectures, in Proc. ACM Int. Conf. Inform. Know. Manag., 2011, pp. 13621365.

Hyun-seok Min received the B.S. degree from Ajou University, Suwon, Korea, in 2005, and the M.S. degree from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2008. He is currently pursuing the Ph.D. degree with KAIST. His current research interests include video content identication, video annotation, image and video content analysis, and the semantic and social web.

Jae Young Choi received the B.S. degree from Kwangwoon University, Seoul, Korea, in 2004, and the M.S. and Ph.D. degrees from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2008 and 2011, respectively. In 2008, he was a Visiting Researcher with the University of Toronto, Toronto, ON, Canada. He is currently a Post-Doctoral Researcher with KAIST and with the University of Toronto. His current research interests include face recognition, medical image processing, pattern recognition, machine learning, computer vision, and the social web. Dr. Choi was the recipient of the Samsung Human Tech Thesis Prize in 2010 for his research on collaborative face recognition using online social network context.

Wesley De Neve received the M.S. degree in computer science and the Ph.D. degree in computer science engineering from Ghent University, Ghent, Belgium, in 2002 and 2007, respectively. He is currently a Research Assistant Professor with the Image and Video Systems Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. Prior to joining KAIST, he was a Post-Doctoral Researcher with Ghent University, and with Information and Communications University, Daejeon. His current research interests include image and video coding, near-duplicate video clip detection, face recognition, privacy-protected video surveillance, and leveraging collective knowledge for the semantic analysis of image and video content.

Yong Man Ro (M92SM98) received the B.S. degree from Yonsei University, Seoul, Korea, and the M.S. and Ph.D. degrees from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. In 1987, he was a Visiting Researcher with Columbia University, New York, NY, and from 1992 to 1995, he was a Visiting Researcher with the University of California, Irvine. He was a Research Fellow with the University of California, Berkeley, in 1996, and a Visiting Professor with the University of Toronto, Toronto, ON, Canada, in 2007. He is currently a Full Professor with KAIST, chairing the Image and Video Systems Laboratory. He participated in the MPEG-7 and MPEG-21 international standardization efforts, contributing to the denition of the MPEG-7 texture descriptor, the denition of the MPEG-21 DIA visual impairment descriptors, and modality conversion. His current research interests include image and video processing, multimedia adaptation, visual data mining, image and video indexing, and multimedia security. Dr. Ro received the Young Investigator Finalist Award of ISMRM in 1992 and the Scientist of the Year Award in 2003. He was a Technical Program Committee Member of international conferences such as IWDW, WIAMIS, AIRS, and CCNC, and he was the Program Co-Chair of IWDW in 2004.

Potrebbero piacerti anche