Sei sulla pagina 1di 17

ARTICLE IN PRESS

Signal Processing: Image Communication 23 (2008) 473– 489

Contents lists available at ScienceDirect

Signal Processing: Image Communication


journal homepage: www.elsevier.com/locate/image

A compressed-domain approach for shot boundary detection on


H.264/AVC bit streams
Sarah De Bruyne , Davy Van Deursen, Jan De Cock, Wesley De Neve,
Peter Lambert, Rik Van de Walle
Department of Electronics and Information Systems, Multimedia Lab, Ghent University, IBBT, Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

a r t i c l e in fo abstract

Article history: The amount of digital video content has grown extensively during recent years, resulting
Received 21 April 2008 in a rising need for the development of systems for automatic indexing, summarization,
Accepted 29 April 2008 and semantic analysis. A prerequisite for video content analysis is the ability to discover
the temporal structure of a video sequence. In this paper, a novel shot boundary
Keywords: detection technique is introduced that operates completely in the compressed domain
H.264/AVC using the H.264/AVC video standard. As this specification contains a number of new
Shot boundary detection coding tools, the characteristics of a compressed bit stream are different from
Temporal segmentation prior video specifications. Furthermore, the H.264/AVC specification introduces new
Video analysis
coding structures such as hierarchical coding patterns, which can have a major
influence on video analysis algorithms. First, a shot boundary detection algorithm is
proposed which can be used to segment H.264/AVC bit streams based on temporal
dependencies and spatial dissimilarities. This algorithm is further enhanced to exploit
hierarchical coding patterns. As these sequences are characterized by a pyramidal
structure, only a subset of frames needs to be considered during analysis, allowing the
reduction of the computational complexity. Besides the increased efficiency, experi-
mental results also show that the proposed shot boundary detection algorithm achieves
a high accuracy.
& 2008 Elsevier B.V. All rights reserved.

1. Introduction summarization are generally accepted as the key to


quickly browse through a video sequence and to enable
Advances in multimedia coding technology, combined easy navigation between relevant segments [33]. To
with the growth of the Internet and the advent of digital extract the semantics from the video content, a system
television, have resulted in the widespread use and or agent is needed to analyze the content. Manual
availability of digital video. This rise will only be annotation is very time consuming, expensive, and often
strengthened by the increasing popularity of user-gener- unfeasible for most applications, to the point that it is
ated video content. Unfortunately, these video collections almost impossible. Therefore, automatic analysis of video
are often not catalogued and only accessible by sequential content is of significant importance.
scanning. As a consequence, technologies and tools for Since the identification of the temporal structure of
efficient browsing and retrieval of video content are video is an essential task for video indexing and retrieval,
gaining importance. Semantic key frame extraction and the first step commonly taken for video analysis is shot
boundary detection. A shot is usually conceived in the
literature as a series of interrelated consecutive pictures
 Corresponding author. Tel.: +32 9 33 14957; fax: +32 9 33 14896. taken contiguously by a single camera and representing a
E-mail address: Sarah.DeBruyne@UGent.be (S. De Bruyne). continuous action in time and space [2]. According
URL: http://multimedialab.elis.ugent.be to whether the transition between consecutive shots is

0923-5965/$ - see front matter & 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.image.2008.04.012
ARTICLE IN PRESS
474 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

abrupt or not, boundaries are classified as cuts or gradual hierarchical coding patterns during shot detection. Section
transitions. 5 provides performance results regarding the accuracy of
After identifying the shot boundaries, most of the the proposed algorithm in terms of recall and precision.
existing work focuses on the summarization of the This section also analyzes the decrease in complexity
content. One common approach for visualizing video originating from the enhanced approach. Finally, conclu-
content is a storyboard in which key frames are tiled sions are drawn in Section 6.
together to obtain a good overview of the video. Each key
frame generally represents the content of one shot and is
selected according to an analysis method that optimizes 2. Related work
the semantic coverage of the video content [4,13]. A
second approach is summarizing the video content by 2.1. Existing algorithms
selecting short, highlighted segments, resulting in a fast
preview of the content. A number of approaches have Algorithms for shot boundary detection can broadly be
been used to generate these so-called video skims, classified into two major groups, depending on whether
including rate-distortion and motion analysis [10,11,24]. the operations are performed in the pixel domain, or
These techniques often rely on MPEG standards such as whether they rely directly on compressed-domain fea-
MPEG-7 [28] and MPEG-21 [3] for the description and tures. The most common algorithms that work in the pixel
analysis of the multimedia content and the multimedia domain are based on colour histogram differences,
customization [5,10,11,27]. changes in edge characteristics, and pixel differences
In order to preserve storage space and to reduce between successive frames [12,25,36]. Out of these
bandwidth constraints, most video data in multimedia different methods, colour histogram-based algorithms
databases are stored in compressed form. To avoid are considered to be the most reliable method for the
unnecessary and time-consuming decompression during detection of abrupt transitions.
video analysis, features available in the compressed Full decompression of the video and the accompanying
domain can be used. In the past, most video content was computational overhead can be avoided by only using
stored using MPEG-1 Video, MPEG-2 Video, or MPEG-4 compressed-domain features such as transform coeffi-
Visual, which resulted in the development of video cients, macroblock types, or motion vectors. Hence, a
analysis algorithms only working on these specific coding significant number of shot boundary detection algorithms
formats. As the H.264/AVC video coding standard [32] operate in the compressed domain. In the literature, most
performs significantly better than any prior standard in of these algorithms work on MPEG-1 Video and MPEG-2
terms of coding efficiency, it can be expected that a Video.
significant amount of future video content will be The concept of DC images is defined by Yeo and Liu
encoded in this format. The compression efficiency of [34], where every 88 block is represented by the average
H.264/AVC can be attributed to a number of new coding intensity of this block in the original image. For intra-
tools, which results in compressed video data having coded macroblocks, the DC coefficient of the block is
specific characteristics. As a consequence, these charac- extracted as this represents the average energy. For inter-
teristics influence prior algorithms used in compressed coded macroblocks, these coefficients are predicted based
video analysis to a great extent. on the motion vector and the average intensity of the
In this paper, we propose a novel shot boundary referred blocks in the previous DC image. Based on these
detection algorithm working directly on H.264/AVC bit DC images, shot detection algorithms based on colour
streams by relying on macroblock and sub-macroblock histogram differences can be modified to operate in the
types, and their related motion vectors and reference compressed domain. Furthermore, the distribution of the
picture indices. By only using compressed-domain in- different macroblock types and motion information [29]
formation, this approach is characterized by a low can also be used to detect shot boundaries. For example,
computational complexity. Furthermore, this algorithm when an abrupt cut occurs at a P picture, it is expected
is enhanced to exploit hierarchical coding patterns, a new that most or all macroblocks are intra-coded since they
coding pattern introduced in the H.264/AVC specification. cannot be predicted well from prior reference frames.
By exploiting the pyramidal coding structure, only a Similarly, for an abrupt cut located at a B frame, it is
subset of frames needs to be considered during analysis. assumed that macroblocks are mainly intra-coded or
Consequently, the shot boundary detection step becomes backward-predicted from future reference pictures.
computationally less demanding, which is in line with the Compared to abrupt changes, gradual transitions are
idea of executing shot boundary detection in the com- more difficult to detect; they take place over a variable
pressed domain. number of frames and can be generated using a great
The outline of the paper is as follows. Section 2 variety of special effects. Consequently, the difference
addresses related work in the area of shot boundary between successive frames in a transition is substantially
detection. A discussion of the influence of H.264/AVC on reduced. Several techniques have been proposed in
shot boundary detection algorithms for prior video coding literature for the detection of these gradual changes in
standards is also provided. Section 3 introduces our newly the pixel domain. In [36], Zhang et al. presented a twin-
developed shot boundary detection algorithm for H.264/ comparison method. This approach takes into account the
AVC bit streams, whereas Section 4 presents an enhance- cumulative differences between frames and requires two
ment to this algorithm allowing to take into account thresholds: a higher threshold for detecting cuts and
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 475

a lower threshold for detecting gradual changes. As frames reference directions to detect shot boundaries. As a result,
are compared on a frame-to-frame basis by most shot algorithms developed for previous coding standards
boundary detection algorithms, limited differences be- cannot cope with the created gaps in the prediction chain
tween some frames may result in undetected transitions. of H.264/AVC. Furthermore, to extract reference direc-
An alternative is to compare every frame with the kth tions, a more complex approach is requisite as the display
following frames. When the transition is spread over more numbers of the reference pictures first need to be
than k frames, a plateau can be observed in the calculated in order to determine the reference direction.
dissimilarity measures [14,34]. However, selecting an Referencing order and display order are decoupled as
appropriate value for the parameter k is not trivial, as well, allowing the encoder to choose the ordering of
each gradual change can have a different duration. pictures with a higher degree of flexibility [32]. As a result,
Although most of the proposed techniques for the new coding structures such as hierarchical coding pat-
detection of gradual changes work on uncompressed terns are introduced [30]. In addition, there are no
video, compressed-domain algorithms are gaining impor- restrictions on the coding type of reference frames and a
tance. Zhang et al. presented a compressed domain single coded picture may consist of a mixture of different
approach in [37] based on DC coefficients and motion slice types. As such, the traditional concept of I, P, and B
vectors. These motion vectors are further examined to frames is replaced by a highly flexible and general concept
distinct different types of camera motion and gradual that can be exploited by an encoder for different
changes. In [1], Bescos disclosed a comparative study of purposes.1
most of the metrics used for the detection of gradual Up to now, few publications report on shot boundary
changes, both in the compressed and uncompressed detection algorithms working on H.264/AVC compressed
domain. He also proposed an algorithm for the detection bit streams. In [35] and [26], an intra-mode histogram
of abrupt as well as all types of gradual changes based on distance based on the different intra-prediction mode
DC images. A drawback of most approaches is that they directions is presented to describe content changes in I
just focus on the detection of fades and dissolves, whereas frames whereas reference directions are used for P and B
special editing effects such as wipes are hardly ever frames. In [22], the dissimilarity between I frames is
considered. measured by relying on macroblock subdivisions (i.e.,
Intra_44 and Intra_1616 prediction); P and B frames
are not taken into consideration. A combination of both
2.2. Shot boundary detection on H.264/AVC bit streams approaches is proposed by the authors in [7], where
subdivisions of intra-coded macroblocks and reference
As mentioned in the introduction, H.264/AVC contains directions of inter-coded macroblocks are used during
a number of new coding tools compared to prior standards analysis.
for digital video coding [32]. Three of these features have a Whereas previously described algorithms for H.264/
major impact on existing shot boundary detection algo- AVC principally focus on the detection of abrupt changes,
rithms: intra-prediction in the spatial domain, multi- [6] presents an algorithm for fade detection based on the
picture motion-compensated prediction, and decoupling use of explicit weighting factors. This new coding tool in
of display and referencing order. H.264/AVC improves the coding efficiency of gradual
Intra-prediction in H.264/AVC is conducted in the changes such as fades and dissolves [32]. However, a
spatial domain, by referring to neighbouring samples of drawback of algorithms relying on the use of luminance
previously decoded blocks. Two primary types of intra- weighted prediction factors is that not all encoders make
coding are supported: Intra_44 and Intra_1616. In use of this feature; it is only advantageous for a limited
the Fidelity Range Extensions of H.264/AVC, Intra_88 is number of frames, while it increases coding complexity.
also introduced. Due to intra-prediction in the spatial Furthermore, special effects during gradual changes, such
domain, DC coefficients in intra-coded pictures no longer as wipes, do not benefit from this coding technique.
represent average energy, but only represent an energy In theory, the set of motion vectors should also show
difference. Algorithms working on DC coefficients can very typical behaviour during a gradual transition [37].
therefore no longer be applied to H.264/AVC bit streams. However, an increase in the amount of intra-coded
The second feature, multiple reference picture motion macroblocks as a result of content changes makes the
compensation, allows an encoder to use a larger number characteristics of the remaining motion vectors less
of pictures as reference compared to previous standards. reliable. This rise is even strengthened in H.264/AVC by
Instead of only using the previous and following reference the introduction of spatial prediction of intra-coded
picture for prediction, the encoder can choose from macroblocks.
multiple pictures that have been decoded and marked as The approach presented in this paper is partially based
reference. In particular, the encoder is allowed to refer on previous work by the authors [7], where the analysis
further than the previous and following reference picture, relies on macroblock and sub-macroblock types, and their
resulting in vagueness about random access in the bit related motion vectors and reference indices. In this paper,
stream. Therefore, instantaneous decoding refresh (IDR)
pictures were introduced which indicate that no subse-
quent pictures in the bit stream will require references to 1
In the context of H.264/AVC, we define an I frame as a frame that
pictures prior to the IDR picture in decoding order. As the consists entirely out of I slices. P and B frames are defined in a similar
prediction chain is broken, it is insufficient to only rely on way.
ARTICLE IN PRESS
476 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

particular attention is paid to the analysis of I and IDR


frames, as these frames break the prediction chain. 0 1 2 3 4 5 6 7 8
Next, this algorithm is enhanced in order to exploit the
characteristics of hierarchical coding patterns. As elabo-
Shot 1 Shot 2 Shot 3
rated upon in Section 4, the coarse-to-fine structure leads
Display order
to the fact that a subset of lower layers can be seen as a
reduced version of the original video. Therefore, in the
0 3 1 2 5 4 7 6 8
enhanced algorithm, temporal dependencies between
pictures belonging to the base layer are first investigated Decoding order
to decide whether they belong to the same shot or not.
Fig. 1. Example of a video sequence consisting of three shots. The arrows
Only in case this outcome is insufficient to accurately
represent the available reference frames; full arrows indicate frequently
detect shot boundaries, pictures belonging to higher levels used reference frames while the dashed arrows indicate the opposite.
are further processed. As a consequence, only a subset of
pictures is taken into consideration during analysis, frames (called forward prediction) or intra-coded macro-
resulting in a lower computational cost. Note that the blocks. This is the case for frames B1 and B6 .
proposed algorithm can also be applied on conventional In prior standards such as MPEG-1 Video and MPEG-2
coding structures such as IBP and IPP. Video, macroblocks could only be predicted from a single
previous (and a single following) reference frame. As a
3. Shot boundary detection for traditional coding result, the reference direction could directly be derived
patterns from the macroblock types. As H.264/AVC allows multiple
reference picture motion compensation and decoupling of
In this section, algorithms for the detection of abrupt referencing order and display order, macroblock types
and gradual transitions in traditional coding patterns will only indicate which partitioning is applied and which
be examined together with the accompanying selection of reference lists are utilized for each macroblock partition.
thresholds. A flow chart of the algorithm is given at the Based on the index in these reference lists, information
end in order to provide a complete overview. about the reference frame can be extracted. In particular,
information about the picture order count (POC) can be
3.1. Detection of abrupt transitions utilized to determine the display number of a reference
frame. The reference direction of the macroblock partition
can then be determined by comparing the display
Within a video sequence, a strong correlation exists
numbers of the current frame and the reference frame.
between successive frames. As an encoder typically tries
When the display numbers of all reference frames of a
to exploit temporal and spatial correlation to a maximum
partition are prior (resp. subsequent) to the current frame,
extent, this results in specific characteristics of motion
forward (resp. backward) prediction is used. When
prediction information and macroblock subdivisions. By
reference frames are located before as well as after the
analyzing these patterns, shot boundary detection can be
current frame, a partition is coded using bi-directional
performed in the compressed domain.
prediction. As the smallest partition for which a reference
For the detection of abrupt transitions, we propose to
index can change is 88 pixels in size, this block size will
examine the temporal dependencies between successive
be used as the basic unit to calculate the reference
frames. Thereafter, when the prediction chain is broken as
directions present in a frame.
a result of certain I frames or IDR frames, spatial
These observations can be used to formulate two
dissimilarities are calculated as well.
conditions for the detection of shot boundaries. Let iðf i Þ,
jðf i Þ, bðf i Þ, and dðf i Þ, respectively, be the number of blocks
3.1.1. Relying on temporal dependencies coded using intra-coding, forward prediction, backward
P and B frames use temporal prediction to exploit prediction, and bi-directional prediction of the current
similarities with previously coded frames. However, when frame f i ; let f i1 be the previous frame in display order
the current frame is the starting frame of a new shot, this and let B denote the number of 8  8 blocks in a frame. By
frame will have hardly any resemblance with previously relying on temporal dependencies, a shot boundary can be
displayed frames. As a consequence, the encoder will declared at frame f i if the following condition is met:
prefer to make use of intra-coded macroblocks or macro-
blocks referring to following frames in display order 1
which are already decoded (called backward prediction). ðiðf i1 Þ þ jðf i1 ÞÞ4T inter
B
An example is given in Fig. 1, where frame B2 is the first 1
frame of a shot in display order but not in decoding order. ^ ðiðf i Þ þ bðf i ÞÞ4T inter . (1)
B
The macroblocks in this frame will therefore mainly refer
backward to frame P3 . Frame P7 , which is the first encoded Note that the percentage of bi-directional-predicted
and displayed frame of a new shot, will be mainly intra- blocks dðf i Þ is not used in the inequalities as a high value
coded. On the other hand, the last frame of a shot in for dðf i Þ typically corresponds to a low probability of a shot
display order will have hardly any correlation with boundary. When the percentages calculated in the
following frames. Therefore, this frame will mainly inequalities exceed a predefined threshold T inter , one can
contain macroblocks referring to previously displayed conclude that an abrupt shot boundary is detected.
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 477

In Section 3.1.3, we will further elaborate on the can be calculated. Instead of comparing macroblock types
determination of optimal thresholds. at corresponding positions, a window of 33 macroblocks
is selected for each macroblock. This window is located
centrally around the current macroblock, as can be seen in
3.1.2. Relying on spatial similarities in case of I frames Fig. 2. This way, movement of objects or the camera will
As I frames are coded independently of other frames, lead to less false alarms.
the second inequality in Condition 1 is always fulfilled. Let f i denote the current I frame, M i1 the correspond-
When the previous frame in display order is only allowed ing intra-prediction map of the previous frame, and MB
to use forward references, gaps in the temporal prediction the amount of macroblocks m in a frame. Define T as the
chain are introduced. In this case, both inequalities in set of possible macroblock partitions ðT ¼ fIntra_44;
Condition 1 are met, resulting in a majority of falsely Intra_88; Intra_1616gÞ. Furthermore, let F be a place-
detected shot boundaries. Indeed, every I frame that is not holder for f i and Mi1 , and n a macroblock in the
used by previously displayed frames as reference, is then associated frame or intra-prediction map. The dissim-
seen as the start of a new shot. For classic I frames, this ilarity metric O is defined as follows:
phenomenon will typically occur when using IPP patterns,
or when variable GOP sizes are used. Furthermore, as a wFm;t ¼ fnjn 2 F ^ n 2 window associated with macroblock m
result of the properties of an IDR frame, gaps in the ^ n is coded using partitioning mode tg
prediction chain will always occur for this particular type (2a)
of I frames. To overcome this problem, additional condi-
P  
tions are required in case the prediction chain is broken as  fi i1 
t2T jwm;t j  jwM
m;t j
a result of IDR frames and certain I frames. Wðf i ; mÞ ¼ , (2b)
f
As stated in Section 2.2, H.264/AVC supports several 2  jwmi j
intra-macroblock partitions, of which Intra_44 and
Intra_1616 are the most commonly used. The first 1 X
Oðf i Þ ¼ Wðf i ; mÞ. (2c)
mode is generally preferred by an encoder in case of MB m2f
i
significant detail, whereas the latter mode is more
suitable for coding smooth areas. As a result, the For I frames, when both inequalities in Condition 1 are
subdivision of an I frame in different macroblock parti- met and the dissimilarity Oðf i Þ also exceeds a predefined
tions leads to a good representation of the detail of the threshold T intra , one can conclude that the current I frame
content, as can be seen in Fig. 2. By comparing the is the starting frame of a new shot. The next section
distribution of the intra-macroblock types in the current further elaborates on the determination of the threshold
frame with previously displayed frames, dissimilarities T intra .
between frames can be calculated and therefore, changes Note that for IDR frames, the gap in the temporal
in content can be detected. prediction chain is not necessarily located at the IDR
Comparing the current I frame with the previous I frame, but can occur prior in display order. When the
frame is not recommended as the content can change decoded picture buffer is cleared as a result of the IDR
significantly between these two frames. For example, a frame, frames located further in the bit stream but prior in
shot boundary can be located in between at a P or B frame, display order can still refer to the IDR frame, but not to
or new objects can appear (Fig. 2). Therefore, an intra- preceding frames in decoding order. As the amount of
prediction map M i1 is constructed containing the last frames influenced by an IDR frame is typically much larger
intra-coded macroblock partitioning of each macroblock. for hierarchical coding patterns than for traditional
Every time an intra-coded macroblock has come across, structures, this will further be discussed in Section 4.2.
M i1 is updated with the new spatial information. This
map can then be used to represent the spatial distribution 3.1.3. Threshold selection for abrupt transitions
of the content of the previous frame, in spite of the fact A less extensively studied problem is the selection of
that this frame contains inter-coded macroblocks. optimal thresholds for evaluating the computed frame-to-
By comparing the current I frame with the map M i1 , frame differences. Most authors work with global thresh-
the dissimilarity between the current and previous frame olds, which remain the same over the entire sequence.

window of 3x3
macroblocks

f10 f 27 f 51
I frame P frame I frame

Fig. 2. Distribution of Intra_44 and Intra_1616 macroblocks. Although the second frame is a P frame, it is mainly intra-coded as it is the first frame of a
new shot.
ARTICLE IN PRESS
478 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

Optimal values for global thresholds are often determined represent the mean and standard deviation of the frame-
by compromising between recall and precision ratios to-frame differences. The corresponding threshold value
[14,36]. An alternative is to work with adaptive thresh- can be computed as
olds, which vary during the analysis and match the local
T ¼ m þ as, (3)
activity. The threshold for each frame is then based on the
frame-to-frame differences of surrounding frames located where a is the parameter related to the tolerated
in a corresponding search window. In general, maximum probability for missed detections and false alarms. When
values or mean and standard deviation of frame-to-frame difference values fall out of the range 0 to m þ as, they can
differences in a search window are used to determine the be considered indicators of shot boundaries. Although this
local thresholds [15,34,36]. threshold is based on the properties of the sequence, this
Selecting appropriate threshold values is a key issue in value remains the same for the entire sequence. Therefore,
successfully applying our proposed shot boundary detec- it is not capable of anticipating on the properties of the
tion algorithm. The detection of abrupt changes relies on local content.
two thresholds T inter and T intra . The first threshold, which A second method using adaptive thresholds is based on
is applied to both inequalities in Condition 1, is respon- sliding windows [15,34]. For each frame k, a window
sible for detecting gaps in the temporal prediction chain. containing N elements is created where k is located in the
For this purpose, the percentage of intra-coded, and middle of the window. When the frame-to-frame differ-
forward and backward referring blocks of the previous ence between frame k  1 and k is the window maximum
and current frame are compared to T inter . As these and b times larger than the second largest difference
percentages represent the probability of a shot boundary value, a shot boundary is detected. A major weakness of
(from 0% to 100%), global thresholds can be applied [29]. this approach is the high sensitivity of b, originating from
The optimal value for T inter is heuristically chosen by motion.
compromising between recall and precision ratios for In this paper, we have utilized a combination of both
different threshold values. Experimental results shown in approaches discussed above and applied several changes
Fig. 3 indicate that a wide range of threshold values (in to determine T intra . In particular, these techniques need to
particular, the interval ½0:68; 0:88) is qualified for mini- be adjusted as they are mainly used in uncompressed-
mizing the amount of missed detections and false alarms. domain algorithms, whereas we want to apply them on
In order to select the optimal value from this interval, the compressed-domain information. First, as the spatial
sum of the recall and precision curve is calculated, after dissimilarity can only be calculated for I frames, T intra is
which the maximum of this curve is localized. As this not based on frame-to-frame differences; instead, only
maximum corresponds to a threshold of 80%, this value is difference values for I frames are considered. To adapt
used for T inter in the remainder of this paper. Note that for T intra to the local properties of the content, a window
determining the optimal values for other variables and consisting of M elements is constructed. In contrast to the
thresholds in the paper, the same methodology based on above technique, all M elements are located before the
recall and precision curves is applied. examined frame. By making this adjustment, I frames
To evaluate the spatial dissimilarity, T intra is used to located in the future and which are part of a high-motion
compare the distribution of the different macroblock shot cannot cause missed detections. To control the
partitioning types in I frames. As the obtained results do resulting false alarms, mean and standard deviation are
not represent probabilities, but rather indicate the used. Let mO denote the mean of the dissimilarity values of
difference with previous frames, a more sophisticated the M previous I frames and sO the corresponding
technique is requisite. standard deviation. T intra can then be defined as
A technique often used in the literature is based on the
T intra ¼ mO þ asO . (4)
statistical distribution of the frame-to-frame differences
[15,36]. The obtained distribution is modelled by a The values for M and a are computed heuristically in the
Gaussian function with parameters m and s which same way as for T inter, resulting in typical values lying

Fig. 3. Influence of T inter on recall and precision ratios.


ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 479

three intra-coded opening of intra-coded opening of inter-coded structure


objects macroblocks macroblocks element

Fig. 4. Extraction of foreground and background using the mathematical morphology operation opening.

around 5 and 3, respectively. Furthermore, an upper and depicted in Figs. 4 and 5. By applying this operator, the
lower boundary is set to this threshold to avoid extreme influence of isolated blocks can be ignored.
values. For example, when the start of a shot is static, the  Determination of motion intensity: As the estimated
corresponding dissimilarity values are close to zero. foreground is intra-coded, the corresponding motion of
A small amount of motion later in the shot would this region cannot be calculated from the same frame.
therefore result in a FA. For high motion scenes, the Therefore, all frames are divided into two groups: one
standard deviation could become too large, making the group which is used for the estimation of both regions,
T intra larger than its maximum. When observing typical and one to calculate the corresponding motion. As
values for OðiÞ corresponding to abrupt transitions, frames marked as reference typically consist of a
appropriate boundaries are ½0:2; 0:6. higher amount of intra-coded macroblocks compared
to non-reference frames, the reference frames can be
used for the estimation of the foreground and back-
3.2. Detection of gradual transitions ground, whereas the latter type of frames can then be
used to measure the motion. This is for example the
In this paper, the percentage of intra-coded macro- case in Fig. 5 where the foreground and background are
blocks combined with motion vector information is used determined from P 7 and where the motion intensity is
as criterion for identifying gradual changes. First, a change computed using B5 and B6 .
in the percentage of intra-coded macroblocks is used as an The corresponding motion intensity, a concept origi-
indication for changes in content. When an increase in nating from the MPEG-7 specification [21], is then
this percentage is noticed, this can typically be attributed calculated for both the foreground (MIF ðf i Þ) and back-
to different events: abrupt changes, gradual changes, fast ground (MIB ðf i Þ). This is done by relying on the
local motion resulting from objects, or fast global motion standard deviation of the magnitude of the motion
resulting from camera movement. To identify the gradual vectors.2 The two obtained values for the foreground
changes from the abrupt changes and the motion, a and the background will therefore represent the
distinction between these different types of events needs amount of motion present in both regions. Note that
to be made. Abrupt changes can be identified by locating as I and IDR frames cannot be used to make an
gaps in the temporal prediction chain. Otherwise, motion estimation of the foreground and background, the
analysis is performed to distinguish gradual changes from previous reference frame is used as estimation instead.
motion. The different steps of the algorithm are explained  Distinction between motion and gradual changes: When
below. A flow chart of this algorithm is given in Fig. 6 in the calculated motion intensity of both foreground and
Section 3.3, containing a summary of the proposed background is high, the change in content can most
algorithm. likely be attributed to fast, global movement. High
For content containing fast local motion, the moving motion intensity in only the foreground is typically
objects will typically be coded using a large amount of connected to local motion. Otherwise, when the
intra-coded macroblocks, whereas the background re- motion intensity is low although the amount of intra-
gions will mainly be coded using temporal prediction. For coded macroblocks indicates a change in content, the
gradual changes and fast global motion, the intra-coded origin is typically a gradual change. Note that relying
macroblocks are typically scattered over the entire frames. only on the motion intensity of an entire picture is
By examining the motion intensity corresponding to the insufficient as the origin of intra-coded macroblocks is
intra-coded regions and inter-coded regions, a distinction unclear. More precisely, small moving objects can
between the local motion, global motion, and gradual result in an increase in intra-coded macroblocks, while
changes can be made. the percentage of corresponding motion vectors is too
low to influence the motion intensity of the complete
 Estimation of foreground and background: First, the picture. Therefore, without separating foreground and
regions corresponding to the foreground and back-
ground are estimated. By calculating the mathematical
morphology operation opening [16] of the intra-coded 2
Note that each motion vector is first normalized in relation to the
(resp. inter-coded) macroblocks, a rough estimation of difference between the display numbers since the distance to the
the foreground (resp. background) can be made, as different reference frames can vary.
ARTICLE IN PRESS
480 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

B5 B6

P4 P7

intra-coded estimated estimated motion intensity of motion intensity of


macroblocks in P7 foreground of P7 background of P7 foreground in B6 background in B6

Fig. 5. Example of a sequence with local motion. The estimated foreground and background of P 7 is computed based on the opening of the intra-coded
and inter-coded macroblocks in P 7 . Using these estimations, the corresponding motion intensity of both regions in B6 is calculated.

background, it is difficult to distinguish gradual The diagonal length l of a frame in pixels and the frame
transitions from local motion. rate F are applied to normalize the threshold. This
equation is defined in [18] to distinguish different degrees
in motion intensity. The value of x is selected equal to the
For the detection of gradual changes, two thresholds are
threshold dividing medium and high activity (i.e., 0.4267)
used. First, the amount of intra-coded macroblocks is
in [18]. Lower degrees of motion can typically be
examined as an increase in the number of intra-coded
compensated more easily by the encoder using temporal
macroblocks can typically be attributed to a content
prediction, which makes that these GOPs are not con-
change. In order to anticipate the local properties of the
sidered during the detection of gradual changes.
content, using an adaptive threshold is appropriate, as
explained in Section 3.1.3. Similar to the selection of T intra ,
only a certain amount of frames located before the current 3.3. Summary of the proposed shot boundary detection
frame is considered. When a content change starts (as a algorithm
result of a gradual transition or motion), the amount of
intra-coded macroblocks first increases, after which it To provide the reader with a complete overview of our
remains high for a certain time. When using future frames proposed shot boundary detection algorithm, a flow chart
as well, in combination with the original sliding window is depicted in Fig. 6. First, temporal correlation between
technique, no real peak could be detected. However, by successive frames is measured to detect gaps in the
comparing the percentage of intra-coded macroblocks in temporal prediction chain (see Section 3.1.1). When this
the current frame with previous frames, the sudden gap is a result of certain I or IDR frames, the spatial
increase can be identified. As a result, the same technique dissimilarity is calculated as well (as previously explained
as for T intra is applied to calculate T grad . In particular, the in Section 3.1.2). This way, abrupt transitions can be
percentage of intra-coded macroblocks in the previous M identified. Next, to locate the gradual changes, the
frames belonging to the base layer is used to compute the percentage of intra-coded macroblocks is compared to
mean mgrad and standard deviation sgrad . T grad can then be prior frames. An increase can typically be attributed to
defined as gradual changes as well as to local or global changes. By
analyzing the motion intensity of the estimated fore-
T grad ¼ mgrad þ asgrad . (5) ground and background, these different types of content
changes can be distinguished.
Optimal values for M and a are computed heuristically in
the same way as for T inter, by compromising between
4. Enhanced shot detection for hierarchical coding
recall and precision curves obtained for different values
patterns
for M and a, resulting in typical values lying around 5 and
3, respectively. Again, in order to avoid extreme values,
this threshold is bounded to ½0:15; 0:60 for the same In H.264/AVC, features like multiple reference picture
reason as explained in Section 3.1.3. motion compensation and decoupling of display and
Next, a conclusion must be drawn whether the change referencing order, among others, allow the creation of
in content is caused by motion or gradual transitions. arbitrary coding structures, which are not supported by
Therefore, the motion intensity of foreground and back- prior video coding standards. This flexibility in terms of
ground is compared to the following threshold: possible coding structures makes it possible to organize
the pictures in a bit stream in multiple ways. Often, this
xl flexibility is used for the creation of hierarchical coding
T motion ¼ . (6)
F patterns. Although a hierarchical coding structure usually
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 481

1
(ι ( fi −1 ) +  ( fi −1 )) > Tinter f i ∈ base layer
new frame fi if
B No if and No No transition
and
1
(ι ( fi ) + β ( fi )) > Tinter ι ( fi ) > Tgrad
B
Yes
Yes
Analysis of motion activity in
Gap in temporal foreground and background of
prediction chain intermediate non-reference frames fj

if gap is result of
MI F ( fj ) > Tmotion MI F ( f j ) < Tmotion
I or IDR frame
if MI B ( fj ) > Tmotion if and if and
Yes No MI B ( fj ) < Tmotion MI B ( f j ) < Tmotion
Yes Yes Yes
Detect abrupt
if Ω ( fi ) > Tintra
transition
fast global motion fast local motion
No Yes

Detect abrupt Detect gradual


No transition No transition No transition
transition transition

Fig. 6. Flow chart for the detection of the different types of content changes.

Layer 3 B1 B3 B5 B7

Layer 2 B2 B6

Layer 1 B4

Layer 0 I0 P8

Display order

I0 P8 B4 B2 B6 B1 B3 B5 B7

Decoding order

Fig. 7. Hierarchical coding pattern with four temporal layers.

introduces a higher end-to-end delay, it can improve layer whereas a sub-sequence is a set of coded pictures
coding efficiency and offers multi-layered temporal scal- within a sub-sequence layer. The sub-sequence informa-
ability in a straight-forward way [30,23]. Hierarchical tion SEI message maps a picture to a subsequence and
coding patterns consist of multiple layers, which result in sub-sequence layer. The sub-sequence layer characteris-
a coarse-to-fine structure, as shown in Fig. 7. The base tics SEI message and the sub-sequence characteristics SEI
layer typically consists of I and P frames at a very low message provide statistical information (e.g., average bit
frame rate, whereas the higher layers typically contain B rate) on the indicated sub-sequence layer and sub-
frames that are inserted between the frames of lower sequence. When these messages are not inserted into
layers in display order. Frames belonging to higher layers the bit stream, relying on the decoding order and display
can thus use frames of lower layers as references for the order of the pictures is also feasible to detect the
decoding process (and possibly some preceding frames of hierarchical coding structure used. However, this solution
the current layer also). is more complex.
To identify the hierarchical coding structure of a bit
stream, it is possible to rely on three supplemental
enhancement information (SEI) messages [8,19], which 4.1. Exploitation of the pyramidal structure
are defined within the scope of sub-sequences and sub-
sequence layers [31]. When applied to hierarchical coding Conventional algorithms for shot boundary detection
patterns, a sub-sequence layer corresponds to a temporal can be optimized by taking into account layered coding
ARTICLE IN PRESS
482 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

structures. As a subset of lower layers can be seen as a explained below, only a subset of frames needs to be
reduced version of the original video, pictures belonging examined in order to detect the abrupt transition.
to higher layers only need to be considered when the As I0 and P 8 belong to different shots, P 8 will typically
amount of information from the lower layers is insuffi- not use I0 as a reference. Furthermore, the other frames
cient to draw conclusions. First, temporal dependencies located after the shot boundary (i.e., B3 to B7 ) cannot be
between pictures belonging to the first layer are investi- used as a reference either since P 8 is coded before the
gated to decide whether they belong to the same shot or other frames, as illustrated in Fig. 7. As a result, the
not. Based on the outcome, pictures belonging to higher percentage of intra-coded macroblocks in P 8 is high,
layers are further processed in a recursive way or are fulfilling the first if-statement in Algorithm 1. As a
immediately discarded. consequence, the reference directions of the intermediate
frame B4 in the next layer need to be considered. For
4.1.1. Abrupt transitions this purpose, the same algorithm is executed for I0  B4
Our algorithm for the detection of abrupt transitions is and B4  P 8 . As B4 and P8 belong to the same shot and
described in more detail using the pseudo-code provided P 8 is present in the reference picture buffer of B4 , B4 is
below; it is executed for every sub-GOP in a bit stream. mainly backward-predicted using P8 . Therefore, the
This sub-GOP corresponds to two successive reference first part of the if-statement is not fulfilled, making
frames in the base layer and the intermediate frames further investigation of frames located in between B4
located in higher levels. and P 8 superfluous. The same technique is applied to
I0  B4 . As illustrated in Fig. 8, the shot boundary is
Algorithm 1 RecursiveShotBoundaryDetection (startFrame, endFrame)
if ð1B ðiðstartFrameÞ þ jðstartFrameÞÞ4T inter ^ located in between these two frames, making further
1 examination necessary. It is clear that we need to rise to
B ðiðendFrameÞ þ bðendFrameÞÞ4T inter Þ
// Temporal prediction chain is broken the highest layer of the pyramidal structure using the
if ðstartFrame þ 1 ¼ endFrameÞ proposed technique in order to detect the exact location of
// Highest level reached the shot boundary. Even then, a significant amount of
// Shot boundary detected
frames can be discarded, resulting in increased computa-
NewShotBoundary(endFrame)
else tional efficiency. This is for example the case for B1, B5 , B6 ,
// Go one level higher in the pyramidal structure and B7 .
// to find the shot boundary When the content of a sub-GOP does not change
middleFrame ¼ intermediate frame in next level
drastically, only the first level needs to be examined. On
RecursiveShotBoundaryDetection (startFrame, middleFrame)
RecursiveShotBoundaryDetection (middleFrame, endFrame) the other hand, significant motion or gradual changes will
end if typically lead to the investigation of pictures residing at a
else higher level. When the difference between pictures
// No shot boundary detected residing at the lowest layer is relatively high, the
end if
prediction can fail. However, when rising in the pyramidal
structure, the correlation between the frames becomes
The use of our recursive technique is shown in more detail more clear. Therefore, the recursive algorithm for the
in Fig. 8, where a hierarchically coded sub-GOP is detection of abrupt transitions is stopped once the
displayed containing frames from two different shots. As similarity is noticed.

B1 B3 B5 B7

B2 B6

B4

I0 P8
shot boundary
Display order
: low temporal correlation : high temporal correlation
: not considered

Fig. 8. Recursive algorithm for detecting shot boundaries in a hierarchically coded sub-GOP.
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 483

4.1.2. Gradual transitions 4.2. Spatial dissimilarity for IDR frames


For the detection of gradual transitions in H.264/AVC
bit streams with hierarchical coding patterns, this bot- As stated in Section 3.1.2, IDR frames break the
tom-up approach can also be applied to enhance the temporal prediction chain of a video sequence as they
algorithm proposed in Section 3.2. First, the percentage of indicate that no subsequent frame in the bit stream in
intra-coded macroblocks in the base layer is compared to decoding order is allowed to use frames prior to the IDR
the threshold T grad , which is calculated from previous picture as a reference. As a result of the pyramidal
frames residing in the base layer. Only when the threshold structure, this gap is located multiple frames prior to the
is exceeded, the next layer is examined as a result of the IDR frame in display order. An example is shown in Fig. 10,
content change. Based on the estimation of the foreground where frames B33 to B39 are located before the IDR frame
and background derived from the base layer (i.e., f 24 in in display order, but after the IDR frame in decoding order.
Fig. 9), the motion intensity is calculated for the When only taking into account temporal dependencies
intermediate frame in the second layer (i.e., f 22 ). As during shot boundary detection, an abrupt transition
previously explained in Section 3.2, this procedure allows between P32 and B33 will typically be falsely detected as
to make a distinction between fast global motion, fast B33 is not allowed to use P32 as a reference.
local motion, and gradual changes. However, as the In order to calculate the similarity between P 32 and B33 ,
distance between the compared frames is larger than in spatial distributions need to be examined instead of
traditional coding patterns, the amount of intra-coded temporal dependencies. For the previous frame, the
macroblocks in higher layers can still be relatively large. intra-prediction map M32 introduced in Section 3.1.2 is
When a significant amount of co-located macroblocks is utilized. However, as B33 can be predicted from following
still intra-coded, this procedure is repeated for higher frames, the corresponding spatial distribution needs
layers to make a reliable estimation of the motion to be derived as well. Therefore, a second intra-prediction
intensity. The estimation of foreground and background map M IDR;i is constructed containing a spatial representa-
is then executed for the current layer, whereas the motion tion of B33 . This map is composed of the intra-prediction
intensity is derived from the next layer. This is in line with modes of following frames in display order. Initially,
the idea used for the detection of abrupt shots, where only this map contains the intra-prediction modes of IDR40 ,
these intervals are further investigated where the infor- which is then updated with intra-prediction modes
mation to draw conclusions about the temporal decom- of macroblocks located closer to B33 . As this process
position is insufficient. takes place during the analysis of the different layers in a
To exploit the hierarchical coding structures for the bottom-up approach, IDR40 , B36 , B34 , and B33 are used
detection of gradual changes, only one threshold needs to during the update step. When a shot boundary takes
be added. In particular, a new threshold T nextLayer is place within this sub-GOP, one of these frames will be
introduced which decides whether higher layers need to mainly intra-coded, which results in overwriting the
be considered to make a good estimation of the motion prediction modes of IDR40 . Consequently, M IDR;33 will
intensity, or whether information from current frame is represent the spatial distribution of the content of B33 . To
enough to draw conclusion. Experimental results show calculate the correlation between P 32 and B33 , a new
that when more than 50% of the macroblocks of the dissimilarity metric OIDR ðiÞ is introduced. This metric is
considered regions are inter-coded, a good estimation can based on Eqs. (2a)–(2c) for the intra-prediction maps M i1
be made. and MIDR;i .

f 21 f 23

f 22

f 20 f 24

390
> Tgrad
396

Fig. 9. Example of a gradual transition in a hierarchical coding structure. Intra-coded macroblocks are represented by their original colour, whereas inter-
coded macroblocks are blanched.
ARTICLE IN PRESS
484 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

B33 B35 B37 B39

B34 B38

intra prediction map M32 B36


B20

P16 P24 P32 IDR40

Display order
P32 IDR40 B36 B34 B38 B33 B35 B37 B39

Decoding order
: temporal prediction prohibited : temporal prediction allowed

Fig. 10. The use of IDR frames results in a temporal prediction chain that is broken, as no subsequent frame in decoding order is allowed to use frames
prior to the IDR frame as reference.

During the construction of both maps, only the subset temporal prediction chain is caused by an IDR frame or
of frames is used that is needed for the analysis of the certain I frames, the spatial dissimilarity is calculated as
temporal dependencies. The other frames, which are not well, as explained in Section 4.2. As a consequence, the
considered, will typically contain very little spatial amount of FAs can be reduced.
information as they are located between highly similar Next, the algorithm responsible for detecting gradual
frames. Therefore, discarding these frames does not transitions is performed. In case the percentage of intra-
influence the accuracy of the dissimilarity measurement. coded macroblocks in the end frame increases signifi-
Note that M i is kept up to date during the analysis of the cantly, T grad is exceeded, which indicates a change in
complete bit stream, whereas MIDR;i is only constructed for content. To discover the origin of this change (i.e., gradual
sub-GOPs containing IDR frames. transition, global motion, or local motion), the foreground
As already declared in Section 3.1.2, this problem also and background of this frame are estimated, as discussed
arises for traditional coding patterns. However, explana- in Section 4.1. For both regions, the corresponding motion
tions are provided in this section as the impact of IDR intensity in the intermediate frame belonging to the next
frames is larger for hierarchical coding patterns since layer is calculated. Using T motion , the source of the content
more frames are involved. change can be determined. In particular, when there is an
increase in intra-coded macroblocks although the motion
intensity is low, a gradual transition is detected. In case
4.3. Summary of the enhanced shot boundary detection the content change is very drastic, it is possible that
algorithm working on hierarchical coding patterns motion information from the current frame is insufficient
since the high amount of intra-coded macroblocks
To give a coherent overview of the algorithm working gives a distorted view of the motion intensity. Therefore,
on hierarchical coding patterns, a summarizing flow when this percentage exceeds T nextLayer , this technique is
chart is depicted in Fig. 11. This algorithm is executed performed in a recursive way on the frames located in the
for every sub-GOP in a video sequence, which corresponds next layer.
to two successive frames belonging to the base
layer (further referred to as start frame and end frame),
and the intermediate frames located at higher layers. 5. Performance results
Note that the end frame in the current sub-GOP
will therefore correspond to the start frame in the next To evaluate the performance of our shot boundary
sub-GOP. detection algorithm, experiments have been carried out
First, the algorithm responsible for detecting abrupt on several video sequences with various characteristics in
transitions is applied. As elaborated upon in Section 4.1, terms of resolution, length, quality, and content. An
the temporal correlation between the start and end frame overview of the characteristics of the test sequences
is investigated to decide whether they belong to the same used is given in Table 1. The first two sequences
shot or not. Based on the outcome, this algorithm is are obtained from the publicly available MPEG-7 Content
recursively executed for higher layers or immediately Set [17] and represent a part of a news sequence
stopped. When arriving at the highest layer and the and a basketball sequence (i.e., V3 and V17 from [17]).
temporal correlation is still very low, the probability of a Since the quality and the resolution of these sequences
shot boundary is high. Only in case this gap in the is low, three recent, proprietary sequences were added
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 485

new sub-GOP
(startFrame,
endFrame) Abrupt transitions
1
(ι (startFrame ) + ϕ (startFrame )) > Tinter
Update Mi with B No transition in
if and No
startFrame sub-GOP
1
(ι (endFrame ) + β (endFrame )) > Tinter
B
Yes

parse intermediate frame in if (startFrame + 1 if gap is result of Construct No


No Yes Yes
next layer and update Mi = endFrame) I or IDR frame MIDR,endFrame

No

recursively repeat for


Detect abrupt
[startFrame, middleFrame] Yes if Ω IDR (endFrame ) > Tintr a
transition
and [middleFrame, endFrame]

Analysis of motion activity in if #(intra-coded MB in


parse intermediate
if ι (endFrame ) > Tgrad Yes foreground and background foreground or
frame in next layer
of middle frame fi background) > TnextLayer
No
Yes No
recursively repeat for
No transition in
[startFrame, middleFrame]
sub-GOP
and [middleFrame, endFrame] MIF ( fi ) > Tmotion MIF ( fi ) < Tmotion
if MIB ( fi ) > Tmotion if and if and
MI B ( fi ) < Tmotion MIB ( fi ) < Tmotion
Yes Yes Yes

Gradual transitions Detect gradual


No transition in No transition in
transition in
sub-GOP sub-GOP
sub-GOP

Fig. 11. Flow chart of the enhanced algorithm for the detection of the different types of content changes on hierarchical coding patterns.

Table 1
Characteristics of the test sequences

Name Resolution # Frames Frame rate (Hz) Length (s) # Transitions

Abrupt Gradual

News 1 352  288 26 000 25 1040.0 154 18


Basket 352  288 18 053 25 722.1 62 13
News 2 384  208 23 802 25 952.0 138 19
Soap 720  576 15 040 25 601.6 160 7
Trailer 848  352 3553 25 142.1 81 24

to the test set as well. The first sequence originates from a two types is made as IDR frames lead to gaps in the
news broadcast from Belgian public television, the second temporal prediction chain, which is not always the case for
sequence is part of an international television soap, and classic I frames, as discussed in Section 3.1.2.
the last sequence is the movie trailer of ‘‘Little miss First, the accuracy of the proposed algorithm is
sunshine’’.3 evaluated for different coding parameters, based on the
These test sequences were coded a number of times manually created ground truth. Next, these results are
with different hierarchical coding patterns (i.e., two, three, compared to a publicly available uncompressed-domain
and four temporal layers). These configurations correspond algorithm. Thereafter, a complexity analysis of the
to a pyramidal structure containing two (hier_2), four enhanced algorithm for hierarchical patterns is given.
(hier_4), and eight frames (hier_8), respectively. Further-
more, these different coding patterns were generated two
times: once with I frames and once with IDR frames, which 5.1. Accuracy of the proposed algorithm for shot boundary
were inserted every 32 frames. A distinction between these detection

The accuracy of the proposed algorithm is evaluated by


3
This sequence can be found on the website of Apple. comparing the obtained results against the ground truth.
ARTICLE IN PRESS
486 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

This comparison is based on the number of correct Table 2


detections (D), missed detections (MD), and false alarms Accuracy in precision (P) and recall (R) (%) of the proposed algorithm on
sequences coded with different hierarchical structures and I or IDR
(FA), expressed as recall and precision.
frames inserted every 32 frames. Additionally, the accuracy of the ‘‘TZI
D D Shotdetection TrecVID 2004’’ algorithm is depicted as comparison
Precision ¼ ; Recall ¼ . (7)
D þ FA D þ MD Name Coding pattern Abrupt Gradual
Table 2 presents the accuracy of our proposed algorithm.
P R P R
As can be deduced from the results, the accuracy for the
detection of abrupt transitions is very high. Only in case of News 1 I frames
IDR frames, a decrease in precision is observed which can hier_8 96 100 48 61
be attributed to the gaps in the temporal prediction chain hier_4 93 100 47 50
hier_2 91 100 75 67
and the intra-prediction map for calculating the spatial
dissimilarity. These FAs occur when the content slightly
IDR frames
changes as a result of slow motion. Since this motion can hier_8 91 99 53 77
generally be compensated using inter-prediction, the hier_4 88 99 40 44
intra-prediction map used for calculating the spatial hier_2 87 100 75 67
dissimilarity will not be updated accordingly. Conse-
quently, the intra-prediction map does no longer give a TZI 93 97 35 50
good representation the spatial distribution. This phe-
nomenon occurs less often for larger hierarchical struc- Basket I frames
hier_8 95 100 60 23
tures as the base layer typically contains more intra-coded
hier_4 91 100 50 15
macroblocks, which results in a more accurate intra- hier_2 94 98 60 23
prediction map. Note that the intra-prediction map is not
used in the test sequences coded with classic I frames, IDR frames
since the intermediate B frames prevent gaps in the hier_8 90 100 75 23
temporal prediction chain. Therefore, the accuracy of hier_4 90 98 50 15
hier_2 83 84 63 38
using intra-prediction maps can deduced from Table 2 by
comparing the precision and recall values for sequences
TZI 97 93 31 76
containing classic I frames with sequences containing IDR
frames.
News 2 I frames
For the gradual transitions, all false alarms are caused hier_8 99 100 76 84
by sudden changes in light intensity as the amount of hier_4 100 100 90 95
intra-coded macroblocks increases drastically, but no hier_2 98 100 100 95
significant motion is present. The missed gradual transi-
tions can for the major part be attributed to transitions IDR frames
taking place of a long period of time or to high motion hier_8 100 100 76 84
hier_4 93 98 82 95
present during the transition. Our proposed algorithm hier_2 80 99 95 95
falsely classifies these content changes as global motion,
instead of gradual changes. TZI 91 99 76 84
To compare the results of our algorithm with publicly
available algorithms, we employed the ‘‘TZI Shotdetection Soap I frames
TrecVID 2004’’ algorithm and added the accuracy results hier_8 99 100 36 57
to Table 2. This algorithm is developed within the scope of hier_4 100 100 58 100
the DELOS Network of Excellence [9], relies on uncom- hier_2 99 100 71 71

pressed-domain features for the detection of shot bound-


IDR frames
aries [20], and is also used in [14] for comparison
hier_8 99 100 43 86
purposes. When comparing the results of TZI with the hier_4 93 100 64 100
results of our algorithm in Table 2, it can be seen that our hier_2 83 99 71 71
proposed algorithm achieves a higher accuracy for the
detection of abrupt transitions. Although uncompressed- TZI 99 91 50 57
domain algorithms can rely on more features, our
algorithm exploits the decisions made by the encoder Trailer I frames
during its search for similarity with preceding frames, hier_8 100 99 92 96
hier_4 99 100 96 96
which is a computational intensive process. For gradual
hier_2 100 100 88 96
transitions, both algorithms struggle with the same
problems. Our proposed algorithm has the advantage that IDR frames
it uses features available in the compressed domain hier_8 99 98 96 96
instead of pixel-domain information, making full decom- hier_4 100 98 96 96
pression unnecessary. hier_2 100 100 92 96
Note that a pyramidal structure containing two frames
corresponds to a traditional IðBPÞ pattern. Compared to TZI 95 99 96 96
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 487

the algorithm for traditional coding patterns proposed in efficiency can be achieved by only considering a well-
Section 3, the enhanced algorithm proposed in Section 4 defined subset of frames present in the bit stream.
only investigates the second layer when the amount of Furthermore, as can be seen from the percentage of
information from the base layer is insufficient to draw analyzed frames in Table 3 and the relative execution
conclusions. As examining the second layer corresponds speed in Fig. 12, the type of content has a major impact on
to investigating all layers in a traditional coding pattern, the efficiency. For video sequences with long shots and
the accuracy of both algorithms is the same for traditional little motion such as news sequences, the gain in
coding patterns. Therefore, the results obtained for the efficiency is very high. In this case, frames in the lowest
proposed algorithm for traditional coding patterns are layer can often use each other as reference since the
reflected in the results for pyramidal coding structures content remains very similar over a long period of time. As
containing two frames. a result, higher layers only need to be examined rarely. On
the other hand, high-motion video streams such as sport
sequences demand more processing power. Since high
5.2. Complexity analysis of the proposed shot boundary motion drastically changes the content of the video, the
detection algorithm similarity between pictures in the lowest layer is small
and the amount of intra-coded macroblocks increases
The main advantage of the enhanced algorithm for shot significantly. As this latter characteristic is also present
boundary detection is the decreasing computational when gradual changes occur, higher layers need to be
complexity. As already stated in Section 4, only a subset examined in order to compute the motion intensity and to
of frames present in the video is needed to detect shot distinguish motion from gradual changes. Although the
boundaries in the compressed domain. In this section, ‘‘soap’’ and the ‘‘trailer’’ sequence do not contain a
the influence of the different coding parameters on the significant amount of motion, the results show that their
complexity of the enhanced algorithm is discussed. The complexity is also higher than the complexity of news
type of content is taken into account as this also plays an sequences. This can be attributed to the fact that the
important role. Finally, the difference in complexity of average length of the shots is much smaller. As shot
compressed- and uncompressed-domain shot boundary boundaries occur more frequently, higher layers need to
detection algorithms is discussed. be examined more often. To determine the exact location,
As can be seen from Table 3, the percentage of frames it is necessary to rise to the highest layer in the pyramidal
that needs to be analyzed during shot boundary detection structure. As a result, the amount of frames that needs to
decreases when more temporal layers are included. be processed highly increases. The same trends can be
This results from the fact that the number of frames observed for sequences containing IDR frames. Note that
present in the lowest layer decreases. Despite this strong the resolution of the video sequences also plays an
decrease in amount of processed frames, the execution important role since the execution speed is inversely
times decrease less rapidly. In some cases, one can even proportional to the size of the content.
notice an increase in execution speed for sequences It would be interesting to compare the complexity
consisting of multiple layers. This can mainly be attrib- results of the proposed algorithm with uncompressed-
uted to the fact that more frames from higher layers need domain algorithms, and in particular the TZI algorithm.
to be parsed when using more temporal layers. In However, making a decent comparison in terms of
particular, frames located in the base layer need less time complexity is non-trivial. To obtain a fair comparison
to be parsed and processed as they typically use only one between the execution speeds of both algorithms, the
reference, instead of two references for frames located in same code base should be used since the underlying
higher levels. implementation determines the results to a great extent.
Sequences containing IDR frames will generally need At best, this means that the parser used in our proposed
more processing power than sequences with I frames as algorithm is part of the complete decoder used by
spatial dissimilarity needs to be calculated as well. This the uncompressed-domain algorithm before starting the
process becomes more expensive when the number of analysis phase. Even then, the implementation of the
temporal layers increases. In particular, more intermedi- remaining part of the decoder influences the complexity.
ate frames are necessary to create the second intra- As a consequence, comparing the complexity in terms of
prediction map M IDR;i . execution speed is very questionable, and hence, not
To illustrate the gain in efficiency, the execution speed incorporated here.
of the enhanced algorithm (where only the necessary However, one can clearly recognize that algorithms
frames are parsed and analyzed) is compared to an working on compressed data are intrinsically less com-
adapted version of this enhanced algorithm where all plex. In particular, several time-consuming steps from the
frames are parsed. This is illustrated in Fig. 12 for the test decoding process such as entropy decoding, motion
sequences containing I frames, where the relative execu- compensation, inverse quantization, and inverse transfor-
tion speed of the enhanced algorithm is depicted. Indeed, mation do not need to be executed in our algorithm. Even
this graph represents the time needed to execute the the parsing of a well-defined subset of frames is
enhanced algorithm compared to the time needed to eliminated.
perform the algorithm where all frames are parsed and Of course, the complexity of the algorithms themselves
processed. From these measurements, it can be concluded also plays an important role. Uncompressed-domain
that a significant gain in execution time and therefore algorithms often compute information about edges and
ARTICLE IN PRESS
488 S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

Table 3
Complexity results for the enhanced shot detection algorithm

Name Coding pattern I frames IDR frames

Size (MB) Execution time (s) Analyzed frames (%) Size (MB) Execution time (s) Analyzed frames (%)

News 1 hier_8 141.0 83.2 26.3 144.9 114.2 23.8


hier_4 136.1 89.9 35.9 139.3 100.6 35.9
hier_2 129.5 125.8 58.7 131.3 125.3 58.6

Basket hier_8 118.4 57.8 32.7 120.7 72.3 29.7


hier_4 114.6 61.2 40.4 116.4 67.6 40.5
hier_2 109.7 84.4 60.9 110.7 84.9 60.8

News 2 hier_8 43.3 48.7 20.0 45.5 73.6 20.0


hier_4 43.7 55.9 31.4 45.0 65.4 31.5
hier_2 46.8 82.3 55.0 47.4 82.8 55.0

Soap hier_8 95.6 302.4 30.3 99.0 377.1 30.1


hier_4 91.1 280.9 38.7 93.5 295.9 38.9
hier_2 71.9 341.9 59.5 86.3 340.3 60.5

Trailer hier_8 14.1 56.0 29.4 14.9 67.0 30.9


hier_4 13.7 45.5 42.4 14.2 47.9 40.0
hier_2 13.5 58.0 60.5 13.7 58.0 60.7

50
Relative execution time (%)

45
40
35
30
25
20
15
10
5
0
Hier_8

Hier_4

Hier_2

Hier_8

Hier_4

Hier_2

Hier_8

Hier_4

Hier_2

Hier_8

Hier_4

Hier_2

Hier_8

Hier_4

Hier_2

News 1 Basket News 2 Soap Trailer

Fig. 12. Relative execution speed of the enhanced algorithm. These values represent the time needed to execute the enhanced algorithm compared to the
adapted version.

motion intensity to increase their accuracy. Compressed- a novel algorithm for shot boundary detection on H.264/
domain algorithms, on the other hand, can directly extract AVC bit streams. To identify abrupt transitions, the
this information from the bit stream as these character- algorithm examines the temporal dependencies between
istics are already exploited during the motion estimation successive frames. Since IDR frames and certain I frames
step and the search for optimal partitioning of macro- lead to gaps in this prediction chain, spatial dissimilarities
blocks in the encoding phase. As a result, the actual are considered as well. Gradual changes are detected by
analysis step of our compressed-domain algorithms is less relying on the amount of intra-coded macroblocks and
complex as well. motion intensity. Furthermore, an enhanced algorithm is
introduced for shot boundary detection on H.264/AVC bit
streams with hierarchical coding patterns. As these coding
6. Conclusions structures consist of multiple layers, a subset of lower
layers can be seen as a reduced version of the original
Temporal segmentation is a prerequisite for semantic video. Therefore, pictures residing in higher layers are
video analysis. Therefore, in this paper, we have proposed only considered in case information from lower layers is
ARTICLE IN PRESS
S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489 489

insufficient to accurately detect shot boundaries. Experi- [14] C. Grana, R. Cucchiara, Linear transition detection as a unified shot
mental results show that this algorithm achieves a high detection approach, IEEE Trans. Circuits Systems Video Technol. 17
(4) (2007) 483–489.
accuracy in terms of recall and precision. Furthermore, by [15] A. Hanjalic, Shot-boundary detection: unraveled and resolved?,
exploiting the layered structure, the computational com- IEEE Trans. Circuits Systems Video Technol. 12 (2) (2002) 90–105.
plexity is drastically reduced, resulting in an increased [16] H.J.A.M. Heijmans, Composing morphological filters, IEEE Trans.
Image Process. 6 (5) (1997) 713–723.
efficiency. [17] ISO/IEC JTC1/SC29/WGll/N2467, Description of MPEG-7 content set,
1998.
[18] ISO/IEC 15938-3, Information technology—multimedia content
Acknowledgements description interface—Part 3: visual, 2002.
[19] ITU-T and ISO/IEC JTC 1, Advanced video coding for generic
audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC,
The authors would like to thank Davy De Schrijver for 2003.
his constructive and extensive feedback. [20] A. Jacobs, A. Miene, G.T. Ioannidis, O. Herzog, Automatic shot
boundary detection combining color, edge, and motion features of
The research activities that have been described in this adjacent frames, in: TRECVID 2004 Workshop Notebook Papers,
paper were funded by Ghent University, the Interdisci- 2004, pp. 197–206.
plinary Institute for Broadband Technology (IBBT), the [21] S. Jeannin, A. Divakaran, MPEG-7 visual motion descriptors, IEEE
Trans. Circuits Systems Video Technol. 11 (6) (2001) 720–724.
Institute for the Promotion of Innovation by Science [22] S.M. Kim, J. Byun, C.S. Won, A scene change detection in H.264/AVC
and Technology in Flanders (IWT-Flanders), the Fund compression domain, in: Lecture Notes in Computer Science, vol.
for Scientific Research-Flanders (FWO-Flanders), and the 3768, Springer, Berlin, 2005, pp. 1072–1082.
[23] A. Leontaris, P.C. Cosman, Compression efficiency and delay trade-
European Union. offs for hierarchical B-pictures and pulsed-quality frames, IEEE
Trans. Image Process. 16 (7) (2007) 1726–1740.
References [24] Z. Li, G.M. Schuster, A.K. Katsaggelos, MINMAX optimal video
summarization, IEEE Trans. Circuits Systems Video Technol. 15 (10)
(2005) 1245–1256.
[1] J. Bescós, Real-time shot change detection over online MPEG-2 [25] R. Lienhart, Comparison of automatic shot boundary detection
video, IEEE Trans. Circuits Systems Video Technol. 14 (4) (2004) algorithms, in: Proceedings of the SPIE Storage and Retrieval for
475–484. Image and Video Databases VII, vol. 3656, 1998, pp. 290–301.
[2] G. Boccignone, A. Chianese, V. Moscato, A. Picariello, Foveated shot [26] Y. Liu, W. Wang, W. Gao, W. Zeng, A novel compressed domain shot
detection for video segmentation, IEEE Trans. Circuits Systems segmentation algorithm on H.264/AVC, in: Proceedings of the
Video Technol. 15 (3) (2005) 365–377. IEEE International Conference on Image Processing, vol. 4, 2004,
[3] I.S. Burnett, F. Pereira, R. Van de Walle, R. Koenen, The MPEG-21 pp. 2235–2238.
Book, Wiley, New York, 2006. [27] J. Magalhães, F. Pereira, Using MPEG standards for multimedia
[4] J. Ćalić, D.P. Gibson, N.W. Campbell, Efficient layout of comic-like customization, Signal Processing: Image Communications 19 (5)
video summaries, IEEE Trans. Circuits Systems Video Technol. 17 (7) (2004) 437–456.
(2007) 931–936. [28] B.S. Manjunath, P. Salembier, T. Sikora, Introduction to MPEG-7:
[5] S.-F. Chang, A. Vetro, Video adaptation: concepts, technologies and Multimedia Content Description Interface, Wiley, New York,
open issues, Proc. IEEE 93 (1) (2005) 148–158. 2002.
[6] B. Damghanian, M.R. Hashemi, M.K. Akbari, A novel fade detection [29] S.-C. Pei, Y.-Z. Chou, Efficient MPEG compressed video analysis
algorithm on H.264/AVC compressed domain, in: Lecture Notes in using macroblock type information, IEEE Trans. Multimedia 1 (4)
Computer Science, vol. 4319, Springer, Berlin, 2006, pp. 1159–1167. (1999) 321–333.
[7] S. De Bruyne, W. De Neve, K. De Wolf, D. De Schrijver, P. Verhoeve, [30] H. Schwarz, D. Marpe, T. Wiegand, Analysis of hierarchical B
R. Van de Walle, Temporal video segmentation on H.264/AVC pictures and MCTF, in: Proceedings of the IEEE International
compressed bitstreams, in: Lecture Notes in Computer Science, vol. Conference on Multimedia and Expo, 2006, pp. 1929–1932.
4351, Springer, Berlin, 2007, pp. 1–12. [31] D. Tian, M.M. Hannuksela, M. Gabbouj, Sub-sequence video
[8] W. De Neve, D. Van Deursen, D. De Schrijver, K. De Wolf, R. Van de coding for improved temporal scalability, in: Proceedings of
Walle, Using bitstream structure descriptions for the exploitation of the IEEE International Symposium on Circuits and Systems, 2005,
multi-layered temporal scalability in H.264/AVC’s base specifica- pp. 6074–6077.
tion, in: Lecture Notes in Computer Science, vol. 3767, Springer, [32] T. Wiegand, G.J. Sullivan, G. Bjøntegaard, A. Luthra, Overview of the
Berlin, 2005, pp. 641–652. H.264/AVC video coding standard, IEEE Trans. Circuits Systems
[9] DELOS Network of Excellence on Digital Libraries hhttp://www. Video Technol. 13 (7) (2003) 560–576.
delos.info/i. [33] Z. Xiong, R. Radhakrishnan, A. Divakaran, Y. Rui, T. Huang, A Unified
[10] A. Divakaran, K.A. Peker, R. Radhakrishnan, Z. Xiong, R. Cabasson, Framework for Video Summarization, Browsing & Retrieval: With
Video summarization using MPEG-7 motion activity and audio Applications to Consumer and Surveillance Video, Academic Press,
descriptors, Technical Report TR-2003-34, Mitsubishi Electric New York, 2005.
Research Laboratories, 2003. [34] B.-L. Yeo, B. Liu, Rapid scene analysis on compressed video, IEEE
[11] P.M. Fonseca, F. Pereira, Automatic video summarization based on Trans. Circuits Systems Video Technol. 5 (6) (1995) 533–544.
MPEG-7 descriptions, Signal Processing: Image Communications 19 [35] W. Zeng, W. Gao, Shot change detection on H.264/AVC compressed
(8) (2004) 685–699. video, in: Proceedings of the IEEE International Symposium on
[12] U. Gargi, R. Kasturi, S.H. Strayer, Performance characterization of Circuits and Systems, vol. 4, 2005, pp. 3459–3462.
video-shot-change detection methods, IEEE Trans. Circuits Systems [36] H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of
Video Technol. 10 (1) (2000) 1–13. full-motion video, Multimedia Systems 1 (1) (1993) 10–28.
[13] A. Girgensohn, J. Boreczky, Time-constrained keyframe selection [37] H.J. Zhang, C.Y. Low, S.W. Smoliar, Video parsing and browsing using
technique, Multimedia Tools Appl. 11 (3) (2000) 347–358. compressed data, Multimedia Tools Appl. 1 (1) (1995) 89–111.