Sei sulla pagina 1di 14

420 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO.

3, MARCH 2010

Time-Scale Atoms Chains for Transients


Detection in Audio Signals
Vittoria Bruni, Silvia Marconi, and Domenico Vitulano, Member, IEEE

Abstract—This paper presents a novel approach for the extrac- problem can be partially alleviated by the use of proper expan-
tion of the transients content of audio signals, usually represented sion bases that are able to emphasize one component while re-
as superposition of stationary, transient, and stochastic compo- ducing the contribution of the others. For example, the tonal
nents. The proposed model exploits the predictable and peculiar
time-scale behavior of transients by modeling them as super-
component shows significant peaks in the signal power spec-
position of suitable wavelet atoms. These latter allow to predict trum [5], [19], while wavelets usually provide a compact repre-
transients information even at scales where the tonal component sentation of transients, since their good time localization [20].
is dominant. In this way it is possible to avoid, if required, the But, transients can be also detected by statistical schemes that
pre-analysis of the tonal component. Extensive experimental are based on either modeling the signal under study by (one
results show that the proposed model achieves good performances or more, known or unknown) probabilistic models [21]–[25] or
with a moderate computational effort and without any user’s
dependence. looking for “surprises” in local signal segments [27], [28]. In
these cases, such schemes can be automatized by tuning the in-
Index Terms—Audio signals, multiscale analysis, scale linking, volved parameters through techniques like Expectation-Maxi-
transients detection, wavelet atoms.
mization [29] or expensive training processes [1]. Anyway, the
problem of how to reduce (or eliminate) the guided setting of
I. INTRODUCTION parameters often remains difficult.
This paper focuses on transients detection. It is well known
ITH the increasing diffusion and development of all that they convey a lot of audio information such as rhythm,
W forms of audio and visual media, there is a growing need
of more effective tools oriented to their management. With re-
timbre identification, etc. [3], [18]. Their detection is usually
performed by subtracting the tonal component from the original
gard to audio signals, a variety of applications such as coding, signal [26], [30], but there is an alternative school of thought that
content delivery, retrieval, etc., require more and more sophisti- adopts a direct transients detection [3], [31]–[34]. In this case,
cated processing techniques with a high degree of user’s inde- multiresolution, or more generally time-scale representations,
pendence [1], [2]. These applications are usually based on mod- logically enable the enhancement of transients contribution [7],
eling audio signals as superposition of three distinct components [10], [13], [15], [35], [36]. A significant improvement is reached
[3]–[10]: transients, that correspond to abrupt changes in the if clustering and persistency properties of multiscale representa-
sound, a tonal (or steady-state) component, that characterizes tions [30], [37] are combined with harmonic atoms or matching
quasi-sinusoidal parts of the signal, and a stochastic residual, pursuit-based algorithms [7], [38], [39]. Some examples can be
that is a stationary random signal with smooth power spectrum the Molecular Discrete Cosine Transform (DCT) in [4], based
[5]. What makes difficult to separate without ambiguity each of on horizontal structures in the spectrum plane (called molecules)
the aforementioned components is their interaction in real sig- corresponding to the tonal component, or trees of coefficients in
nals as well as some correlated effects like masking, occlusion, the dyadic wavelet domain, that represent local structured signal
etc. [11]. This is the reason why there are several approaches in components [8], [40]. Anyway, transients detection remains, in
literature for their detection (see for instance [1]–[10] and refer- general, a difficult task. Empirical evidence seems to show an
ences therein). These approaches are mainly devoted to 1) detect intrinsic link between detection effectiveness and type of signal
each component individually [10]–[15], or 2) see components under exam [3], probably due to a somewhat ambiguous and
detection as a general problem to be globally solved [5], [16], strongly problem dependent definition of transients [1].
[17]—in agreement with human perception and cognition pro- The aim of this work is to perform an automatic transients de-
cesses. However, both strategies suffer from a heavy signal-de- tection by modeling their contribution as proper chains of atoms
pendent tuning of the involved parameters [1], [3], [18]. This in the time-scale domain. The energy of transients is character-
ized by an increasing monotonic decay that differs from that
Manuscript received December 17, 2008; revised August 29, 2009. Current of the tonal component [11]. Hence, transient regions can be
version published February 10, 2010. This work was supported in part by MIUR detected in correspondence to scales where their contribution
via RSTL-CNR under Grant DG.RSTL.004.009 and in part by FIRB under dominates the one of the tonal component. A suitable atomic
Grant RBNE039LLC. The associate editor coordinating the review of this man-
uscript and approving it for publication was Dr. Laurent Daudet.
approximation of transients along with a formal deterministic
V. Bruni and D. Vitulano are with the I.A.C.—C.N.R., 00185 Rome, Italy evolution law in the time scale plane enable their prediction at
(e-mail: bruni@iac.rm.cnr.it; vitulano@iac.rm.cnr.it). the remaining scales [41], [42]. This formalism provides two
S. Marconi is with the I.A.C.—C.N.R., 00185 Rome, Italy, and also with
the University “La Sapienza,” ME.MO.MAT, 00185 Rome, Italy (e-mail: mar-
main advantages if compared to alternative approaches: 1) an ef-
coni@dmmm.uniroma1.it) fective transients detection, avoiding spurious results that often
Digital Object Identifier 10.1109/TASL.2009.2032623 characterize wavelet based approaches [1], 2) a complete user’s
1558-7916/$26.00 © 2010 IEEE

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 421

Fig. 1. Top: f (left) and its continuous wavelet transform w for increasing scales (right). Bottom: mesh of w (left) and curves of its significant points
in the time-scale (u; s) plane (right).

independence regardless the signal under study. The multiscale is a locally stationary layer, composed of quasi-sinusoidal
model also allows to determine the transient contribution of atoms. is a piecewise smooth signal: it includes the attack
the original signal, without a pre-analysis of the tonal compo- of the notes, that is the increasing part of the transient area, and
nent—whenever required. In addition, the nice properties of the the onset, that corresponds to the initial point of the transient
employed atomic approximation enable a speed up of the pro- area. Finally, is a stochastic residual that is often related to
posed model, making the computational effort almost moderate. noise [1], [3].
Experimental results on various audio signals show the efficacy The goal of this section is to extract from . We exploit
of the proposed method and its potential use for different appli- the ability of the wavelet transform in characterizing signal sin-
cations. gularities in the time-scale plane [43]. In particular, a formal
The outline of this paper is as follows. Section II presents the model for the energy spreading of singularities over the time-
multiscale transients model and its numerical implementation. scale plane will be provided and its relation with transients in
Experimental results, comparative studies, and discussions are audio signals will be investigated in depth. The multiscale be-
offered in Section III. Finally, concluding remarks and guide- havior of transients energy will be then used for their detection.
lines for future work are the topic of Section IV.
A. Proposed Model
The transient layer of an isolated note can be represented with
II. MULTISCALE MODEL FOR TRANSIENTS DETECTION
the shape shown in Fig. 1 top left (see also [1]). It is a ramp signal
An audio signal can be represented as superposition of with a certain slope (attack) that is followed by a damping func-
three distinct components tion. Using a wavelet kernel with enough vanishing moments,
the damping shape contributes with nearly zero coefficients in
(1) the transformed domain. On the contrary, the onset and the end

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
422 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

Fig. 2. Ramp signal having a first order singularity at t (left) and its wavelet transform (basic atom) computed at a fixed scale using a spline biorthogonal wavelet
(right).

of the attack give a significant energy contribution to the wavelet compactly supported and symmetric mother wavelet and let
transform (see Fig. 1 top right), since they represent two singular be its support size. In [41], the trajectories in the plane
points of the original signal. In particular, the wavelet transform of the modulus maxima of , that correspond to isolated
of the transient can be approximated as fol- singularities of a piecewise regular signal , have been derived.
lows: To this aim, the set of coefficients corresponding to each singu-
larity of positive order has been modeled as a travelling atom
(2) in the plane. More precisely, is written as super-
position of a finite number of elements defined as follows:
where ,
is a symmetric and compactly supported wavelet, , (4)
that we call basic atom, is the wavelet transform of a ramp signal
with unitary slope and a singular where is the function defined in (3), , and ,
point at (see Fig. 2), i.e., respectively, are the location, the amplitude and the order of
each singular point of and is the contribution of each
of them in the wavelet transform, i.e.,

(5)

The comparison of the partial derivatives of (5) with respect to


(3) and , respectively, and , provides the following evolution
equation:
where , while and are real numbers with opposite
signs. (6)
is superposition of two basic atoms having op-
posite signs and centered at and . Moreover, the center
of mass of each atom follows continuous curves in the
plane that are characterized by a repulsive behavior, that is as where is the distance between the
faster as slower is the decay of the signal (Fig. 1. bottom). In atoms, respectively, located at and and
the following, we will prove that this behavior can be modeled .
through a precise evolution law from which it is possible to de- Hence, each atom is subjected to both a diffusive and sourcing
rive the aforementioned curves (atoms trajectories). This result effect, until it interferes with other atoms (see Fig. 4). From that
will include the more interesting and common case of not iso- point on, a transport effect regulates atoms interference till they
lated notes. In fact, a signal composed of more notes is a linear become a single one (see, for example, the second and third
superposition of functions like the one in Fig. 1, topleft. There- atom in Fig. 4). The sourcing effect is weighted by the energy
fore, transients correspond to interfering singularities (atoms) in contribution of each atom and it is connected to the order of the
the plane, as shown in Fig. 3. singularities.
1) Interfering Atoms: Let be the wavelet transform Diffusion, transport, and sourcing effects regulate the trajec-
of a generic signal , computed at scale and time using a tories of each local maximum of in the time

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 423

Fig. 3. (Top) Three different transients (solid, dashed, and dotted lines) corresponding to three isolated notes (left) and their wavelet transform computed at a
fixed scale s (right). (Middle) Superposition of the three transients above (left); its wavelet transform at the same scale (middle); curves of maxima points of the
transform (right). (Bottom) Support signal corresponding to the signal above (left); its wavelet transform (middle); curves of its maxima points (right).

Fig. 4. Piecewise linear signal having four singularity points (left); its wavelet transform computed at successive scales (middle); trajectories in the (u; s) plane
of its four representative atoms (right).

scale plane. In fact, local maxima are such that . On the other hand, by computing the partial derivative of both
Hence, , and then members of (6) with respect to and by evaluating it in corre-
spondence to the local maxima points, we have

(7)
(8)
where is the derivative of the curve with respect to ,
while and , respectively, are the partial derivative of
and with respect to . where .

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
424 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

Fig. 5. Original piecewise constant signal (left), its wavelet transform computed at successive scales (middle), trajectories in the (u; s) plane of the six represen-
tative modulus maxima of the transform (right).

Fig. 6. (Left) Transient (top) and tonal component (middle) of glockenspiel signal (bottom). (Right) Corresponding number of wavelet modulus maxima versus
scale.

By comparing (7) and (8), for each atom’s maximum we have since the first scale. Nonetheless, as the scale increases, the cou-
ples (2nd, 3rd) as well as (4th, 5th) show attraction, since they
have the same sign (see their trajectories in the plane in
Fig. 5).
2) Transients Approximation: Coming back to the transient
model, one isolated note is characterized by two singular points:
(9) the onset and the end of the attack. Hence, (2) is of the same type
whose solution is the trajectory (in the plane) of atoms’ of (5). It turns out that transients wavelet transform is super-
modulus maxima corresponding to the specified initial condi- position of two interfering atoms having opposite signs, whose
tion. behavior in the plane is described by the laws in (6) and
Hence, the singularity at is isolated till its cone of influ- (9). Additional isolated notes provide new interfering atoms that
ence, i.e., , does not intersect the one of the sin- still interact according to the previous equations. This implies
gularity at —the last term of the second member in (9) is zero that the evolution laws in (6) and (9) are able to provide a pre-
and then . Whenever the two cones intersect cise characterization of the multiscale behavior of the transient
each other, i.e., , the two singularities interfere component of an acoustic signal. In particular, (9) implies that
and their global maximum changes its location—the last term the number of atoms composing the transient layer is constant as
of (9) regulates the speed of the shift of the global maximum the scale increases, in case of repulsion or lack of interference.
of the analyzed atom. In particular, each modulus maximum of Otherwise it decreases, in case of attraction, since two single
can be attracted or repulsed by the neighboring ones. In atoms become one atom.
case of attraction, i.e., the slopes have the same sign, the two On the other hand, the quasi sinusoidal characterization of
atoms become a single atom; on the contrary, in case of repul- the tonal component yields a periodic behavior to its energy in
sion ( s have opposite sign) the two atoms do not disappear. the transformed domain. It turns out that the number of mod-
For example, the wavelet transform of the signal in Fig. 5 (left) ulus maxima of the wavelet transform of the tonal component
has six representative atoms: the couples of atoms (1st, 2nd), does not monotonically vary (see Fig. 6) as the scale increases.
(3rd, 4th), and (5th, 6th) have opposite sign and show repulsion Therefore, it is not reasonable to detect transients by thresh-

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 425

olding wavelet coefficients at arbitrarily selected (or “a priori” More precisely, if are atoms parameters at scale
fixed) scales. On the contrary, it is more convenient and suc- , then at scale we have
cessful to select them in correspondence to those scales where
the tonal component is negligible and the transient content is
well emphasized. They are the scales preserving a monotonic
non increasing number of modulus maxima (as the scales in-
creases), i.e., the ones that better emphasize the singular part where is defined in (11) using and
of the signal. It is worth stressing that these scales strongly de- is the wavelet transform operator so that
pend on the nature of the analyzed signal. Hence, they do not .
necessarily coincide with the dyadic ones and do not have, in For our purposes, is in the set of scales having a negligible
principle, any predictable relationship or rule. contribution of the tonal component, while is one of the scales
Previous concepts characterize the multiscale structure of the to use for the inversion of the transform.
transient layer by modeling the spreading of its energy at dif- Finally, some discussions about the estimation of atoms pa-
ferent resolutions. In particular, atoms energy corresponding rameters are necessary. For each selected scale , the slopes
to transients propagate from fine to coarse scales with an in- and the locations can be derived using a fast matching pur-
creasing behavior, due to their positive Lipschitz order. There- suit algorithm from , as it is described in Appendix II and
fore, evolution laws can be applied first to detect and recover in [42]. It selects atoms by processing the modulus maxima of
transients contribution at the available scales (the ones selected the transform from the largest to the smallest one. The compu-
with the previous criterion). Then, they can also be used to pre- tational effort of the pursuit algorithm is low since its dictionary
dict and recover transients content, even in correspondence to contains just one element, i.e., the basic atom. The decays
not available scales. The next section will be devoted to show can be derived by comparing the slopes of the corresponding
how this two tasks can be accomplished. atoms at two successive scales. More precisely, the slopes of
corresponding atoms at two different scales and , can be
written as and since
B. Numerical Implementation
they correspond to the same atom that has been evaluated at two
The selection of those scales able to emphasize transients different scales—see (10). A direct comparison between
with respect to the tonal component is important in order to esti- and gives , that is
mate atoms parameters ( ). They allow to solve (6) and
(9), that give transients contribution at each scale . In partic-
ular, we are interested in recovering transients contribution in (12)
correspondence to a finite number of scales that guarantee the
inversion of the transform. Nonetheless, the solution of (6) and
(9) requires an intensive and prohibitive computational effort.
Moreover, their complete solution could be useless since we are Algorithm
interested in its restriction to a limited number of scales—for ex- Let be a prefixed maximum value for the adopted scales
ample the dyadic ones. The atomic representation in (5) is still and let be a discretization step.
useful for this task as it enables transients recovering without 1) Perform the continuous wavelet transform (CWT)
really solving the evolution laws. In fact, (4) gives the wavelet of at scales , where
transform computed at scale of the signal , where .
2) Let be the vector containing the number
(10) of modulus maxima. Sort this vector in increasing
order achieving the sorted vector nMAX.
Therefore, the linearity of the wavelet transform and (5) imply 3) Let be the scale such that . Compute
that the wavelet transform at a fixed scale of a generic signal atoms locations and slopes by applying the al-
corresponds to the one of the piecewise linear signal , that gorithm described in Appendix II, where is the
we call support signal, defined as follows: input vector, and retain those atoms whose slope over-ex-
ceeds . Let be the set
of the extracted locations.
(11) 4) Set . For
• Let be the set of indices such that
. Select the scale such that .(
where —see Appendix I for the proof. is the scale having the least number of maxima among
has the same high frequency content of at scale , while the scales that satisfy the previous condition).
its low-pass residual is different, as it is depicted in Fig. 3. It • Compute atoms’ slopes and locations of and
turns out that atoms parameters computed at a fixed using the algorithm in Appendix II. The
scale (initial conditions for (9)), enable to derive at selected atoms locations must fall in the cone of influ-
any scale by performing the wavelet transform up to scale ence of each element of , i.e., such
of the support signal . that .

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
426 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

Fig. 7. Block scheme of the algorithm for transients detection that is described in Section II-C.

Estimate the corresponding decays using (12) as fol- than one detail band. This guarantees a further reduction of the
lows: computational effort. Nonetheless, this could also affect the pre-
cision of the prediction of transient layer since the eventual es-
timation error of atoms parameters in the second point of step 4
would be spread over more than one scale.

C. Complexity
• Build the support signal as in (11), using This section is devoted to the computation of the complexity
(They are the output of the pre- of the proposed algorithm. If denotes the length of the signal
vious step). and is the number of involved scales for the evaluation of the
• Compute the dyadic undecimated discrete wavelet trans- CWT, the main term of the overall complexity is linear with
form (UDWT) of and extract the detail band at scale respect to , since it is .
level , i.e., , where
is the shifted and scaled mother wavelet.
III. EXPERIMENTAL RESULTS
5) Compute the low-pass component at scale level
of , i.e., , where is the The proposed transients detection algorithm has been tested
shifted and scaled scaling function. on a populated database of different audio signals. The presen-
6) Invert the UDWT using and tation and discussions of the achieved results will be the topic
and let be the recovered signal. of the first subsection (Section III-A), while their benefits in the
A block scheme of the algorithm is in Fig. 7. subsequent extraction of the quasi-sinusoidal part will be par-
The contribution of the tonal component is less evident at tially investigated in Section III-B. Comments and discussions
scales having the smallest number of modulus maxima. For that about the reduction of the computational effort are finally in-
reason, this scale is selected first (step 3 of the algorithm) and cluded in Section III-C.
it is used for determining the time interval where it is reason-
able to look for transients contribution at the remaining scales. A. Transients Detection
This allows a faster computation of atoms parameters, since it The results presented in this paper concern four commonly
is limited to a smaller domain (see Appendix II), and a better used test audio signals (sampled at 44.1 kHz and stored at 16
precision of the results. In fact, in the last point of step 4 all the bits In the following, we will estimate the complexity of each
wavelet coefficients corresponding to the transient component step of the Algorithm presented in the previous section.
are recovered. This allows to avoid, or at least reduce, artifacts Step 1) The computation of the CWT requires convolu-
due to the rough cut off of information that characterizes thresh- tions of the signal with the mother wavelet
olding-based approaches [42]. whose support is , where , and
It is also worth stressing the fact that step 4 of the Algorithm . Hence, the number of operations for
can be relaxed. In fact, the support signal computed at scale a single scale is , while for the employed
, allows the prediction of more than one dyadic scale. In other scales it becomes ,
words, in the last point of step 4, it is possible to retain more that is .

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 427

Fig. 8. Clockwise order: Castanets, Mamavatu, Xylophone, and Glockenspiel audio signals, their spectrogram and their transient layer recovered by the proposed
algorithm.

Step 2) This step requires a linear search of the modulus twice, once for the scale and once for the sub-
maxima of the CWT for each selected scale and a sequent scale . For each detected atom, three
simultaneous sorting of the found numbers. Hence, additional operations are required for the com-
it implies , where putation of the corresponding decay through
comparisons are required for the search of modulus (12). Hence, the total amount of operations is
maxima while the remaining comparisons are .
necessary in the sorting operation. • The number of operations required by the con-
Step 3) This step requires first operations for struction of the support signal is linear with re-
sorting the amplitudes of the modulus maxima spect to the number of the found atoms and it is
of and then for the linear esti- .
mation of the parameter for each acceptable • The computation of the th dyadic band of the
atom (as it is the scalar product of the atom func- UDWT of the support signal consists of a
tion model and the transform—see the algorithm in convolution and it requires oper-
Appendix II). Hence, the total amount of operations ations.
is , where is the All the above operations are repeated times, then
number of the found atoms ( ). the total amount of operations for Step 4 is
Step 4) .
• This point corresponds to a linear search in the Step 5) It requires the convolution of with the scaling
vector nMAX and it requires comparisons in function with support , that corresponds to
the worst case. operations.
• The estimation of atoms location and slope re- Step 6) The inversion of the UDWT requires suc-
quires operations (see Step cessive convolutions with wavelets whose
3). In this case, the estimation algorithm runs support is , that is

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
428 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

Fig. 9. (Left) Four isolated notes of a xylophone: original signal (top), its spectrogram (middle), detected transients (bottom). (Right) Four subsequent tolls. In
the latter case, transient energy is comparable to that of the tonal component.

opera- ones. The detection is correct for all test signals except for ma-
tions, and one convolution with a scaling function mavatu. In this case, some transients are discarded by the algo-
with support , that is operations. rithm due to a bad estimation of their decay in the time scale
Hence, plane. On the contrary, as Fig. 13 shows, detection results for
are the operations required for this step. glockenspiel signals are comparable to the ones of the approach
Summing up, the global complexity of the proposed algorithm in [13].
is .per sample) shown in Fig. 8: castanets In order to better evaluate the performance of the proposed
(2-s length), glockenspiel (5 s), mamavatu (23 s) and xylophone model, two additional signals, shown in Fig. 9, have been con-
(1 s). In all tests, a 3/9 spline biorthogonal mother wavelet has sidered. In the first case, transients are four simple and isolated
been used, is the maximum allowed scale, while notes of a xylophone (10 s, 44.1 kHz, 16 bits). As Fig. 9 shows,
is the discretization step for the scale interval (see transients represent the most “visible” and perceptible part of
Algorithm in Section II-C). The selected signals have different the signal and then their detection is almost straightforward. On
harmonic characteristics and a different transients distribution the contrary, in the second case transients are due to four tolls
over the whole signal length. That is why the detection algo- (11 s, 44.1 kHz, 16 bits). Transients are in the first half of the
rithm identifies transients at different scales. More precisely, signal but tonal component energy hides most of their contribu-
the selected scales have been tion. Even in this case they are correctly identified, especially
for glockenspiel, for cas- if one considers that the significant quasi sinusoidal content in
tanets, for mamavatu, while a large set of scales can affect the estimation of atoms param-
for xylophone. As Fig. 8 shows, eters. In fact, one of the main peculiarities of the proposed ap-
the algorithm is able to almost precisely detect transients lo- proach is the prediction of transient contribution even at scales
cations without requiring any user’s intervention in the tuning that are dominated by the tonal component. Fig. 10 shows the
of the involved parameters. In fact, it is able to automatically wavelet transform computed at dyadic scales of glockenspiel
adapt each of its steps to the signal under examination. signal. As it can be observed, some scales mostly emphasize
There are not any objective measures for evaluating the final the tonal part. Nonetheless, transients are still recovered since
result. That is why the evaluations of five listeners, either mu- the evolution laws allow the prediction of their location and en-
sicians or with experience in the field of audio evaluation, have ergy from those scales where signal transient contribution is not
been considered. The perceptual quality of the estimated tran- negligible. The same happens in presence of noise, as shown in
sient layers is good: the rhythm is well recovered with a good Fig. 11. Glockenspiel signal has been corrupted by 30% of zero
match with both the time duration and location; noise effect is mean Gaussian noise (left) and sinusoidal noise (sum of sinu-
limited and it misses in most cases, such as castanets, glocken- soidal waves) (right). As it can be observed, detection results
spiel and xylophone. For mamavatu signal (23 s), apart from are not significantly affected by both kinds of noise. In the first
the first beats, where a transient in the percussive rhythm is case, noise is rejected since its negative decay parameters; in the
missed due to the estimation of a negative decay in the time- latter case, noise is discarded since its harmonic nature.
scale plane, the remaining beats are well recovered. In the last Finally, two additional signals have been considered even to
part (last 9 s), some noise due to the interaction with the voice show the adaptivity of the method to different kinds of subjects:
can partially alter transients reconstruction producing a faint and jazz signal (5 s, 44.1 kHz, 16 bits) and a simple speech signal
smeared sound. Table I confirms these observations. It compares (word test). As Fig. 12 shows, transients are well detected in
the number of perceived transients with the correctly detected both cases and annoying artifacts do not appear.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 429

Fig. 10. From top to bottom: Glockenspiel signal and its wavelet transform computed at dyadic scales (left). Reconstructed transient component and its wavelet
transform estimated using the proposed model (right).

Fig. 11. (Left) Glockenspiel signal corrupted by white Gaussian noise and the extracted transient component using the proposed model. (Right) Glockenspiel
signal corrupted by sinusoidal noise and the extracted transient component using the proposed model.

Fig. 12. Jazz signal (left) and Speech signal (word “test”) (right), their spectrograms and the corresponding transient component extracted by the proposed model.

B. Benefits for Tonal Component Extraction the estimated transients component from the signal. Then, we
have selected the most significant DCT coefficients of the re-
Tonal component extraction is out of the scope of the present sulting signal to extract the tonal layer. Finally, the residual com-
work. Anyway, for the sake of completeness, we have subtracted ponent has been derived. As it can be observed in Fig. 13, the

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
430 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

TABLE I
NUMBER OF PERCEIVED TRANSIENTS IN THE FOUR TEST SIGNALS IN FIG. 8, NUMBER OF CORRECTLY DETECTED
TRANSIENTS USING THE MULTISCALE DETECTION ALGORITHM AND NUMBER OF FALSE POSITIVE.

Fig. 14. Energy versus threshold value of the tonal component of Glockenspiel
Fig. 13. From top to bottom: Glockenspiel signal, transient component de-
tected by the proposed algorithm, extracted tonal component by thresholding extracted from f = 0 f f through a piecewise DCT using transients
the DCT of the nontransient signal, residual component, transient component locations (dash-dotted), a piecewise DCT using uniform intervals (dotted) and
detected by the algorithm in [13]. the global DCT (solid).

residual does not show a significant contribution in correspon-


dence to transients locations, neither the tonal layer retains more
than the necessary energy in correspondence to almost all tran-
sients locations.
A correct transients detection provides an adaptive splitting
of the signal into distinct stationary pieces. Therefore, it could
be reasonable to use this information for guiding the eventual
successive processing of the tonal component. In order to fur-
ther test the goodness of the found locations, we have split the
time interval into sub-intervals, whose extrema are the detected
transients locations. Then, a DCT-based analysis has been sepa-
rately performed in each interval. This kind of processing refines
the estimation of the tonal component since it better fits signal
characteristics. As a result, it is more robust to the thresholding
of DCT coefficients. Fig. 14 compares the energy of the tonal
component versus the threshold value respectively for a piece-
wise DCT using transients location, a piecewise DCT using uni- Fig. 15. (Top) Transient component detected using the complete version of the
form intervals and the global DCT. For the same threshold, the proposed algorithm. (Bottom) Transient component detected using a faster ver-
sion of the proposed algorithm—see Section II-C for details.
first method preserves a greater and well localized amount of
energy in the frequency domain.
that the computing time becomes 1 min and 10 s whenever just
C. Computing Time one support signal is used for predicting all the necessary dyadic
The computational effort of the algorithm is moderately low. scales in step 4 of the algorithm. However, in this case there is
Its complexity is linear with respect to the signal length: it re- less precision in the reconstructed transient layer since numer-
quires about 1 min and 45 s on a 2-GHz CPU and 4-GB RAM ical errors in the estimation of the decay exponents can affect
machine for processing a signal whose time length is 5 s, using the prediction. It turns out that less energy is given to transients.
a non-optimized Matlab implementation. It is worth outlining A possible result is shown in Fig. 15. In this case, all transients

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 431

are detected but the energy of some of them does not exactly APPENDIX II
match with the actual one: for example the 12th and 14th tran- ALGORITHM FOR THE ESTIMATION OF
sient are lower than the 11th and 13th ones. ATOMS’ SLOPE AND LOCATION
Let be the wavelet transform of a signal evaluated
IV. CONCLUSION
at scale . Let be a vector of the same dimension
In this paper, a novel multiscale method for the extraction of and let it be initialized to zero.
of the transient component in audio signals has been presented. 1) Detect the locations of modulus maxima of
The advantages of the proposed method are: 1) transients detec- and sort their amplitude in descreasing order. Store the lo-
tion without the pre-analysis of the tonal component; 2) straight- cations of the sorted extrema in the vector
forward time decomposition of the signal in which to analyze and let be its cardinality.
the tonal component; 3) formal rule for constructing trees in the 2) Set and .
wavelet domain; 4) full adaptivity and lack of user’s interaction. 3) Estimate the amplitude of the -th atom via
The proposed model is also able to process other kinds of least squares from in the domain
signals. In fact, as widely shown in [9], the hybrid structure can , using the atom’s
be recognized in the 1-D representation of the contour of a 2-D function , i.e.,
shape. The resulting signal is a locally stationary signal and it
contains information about the local changes of the shape (tran-
sients), that can correspond to both natural variations of the cur-
vature, or to the spatial discretization of the shape (tonal com-
ponent). Achieved results are encouraging and model’s refine-
ments will be investigated in the future. Part of the effort will
be devoted to further speed up the computing time of the algo- 4) Let be the estimated atom. Put
rithm in order to make it attractive for real time applications. In and
particular, the use of rational filters-based transforms is under .
investigation since they allow an adaptive sampling in the time 5) Set .
domain while still preserving a robust and not ambiguous scale 6) If then stop, else.
linking. Moreover, particular attention will be given to the esti- 7) If then set and go to 3),
mation of atoms parameters, since it could not be robust when- otherwise go to 5).
ever the tonal component and the noise significantly over-ex- 8) Set and let be the output
ceed the transient one. Some perceptual measures could be in- of the algorithm.
troduced even in the estimation of these parameters. Finally, the Note that step 1 is modified as follows whenever the algorithm
use of the same multiscale concepts in the frequency domain is used in step 4) of the algorithm in Section II-C.
will be investigated to straightforwardly catch atoms in the spec- 1) Detect locations of extrema of , select those
trum. location such that and sort
their amplitude in decreasing order. Store the location of
APPENDIX I the sorted extrema in the vector and let be
SUPPORT SIGNAL its cardinality.
Proposition: The atomic approximation of corre-
ACKNOWLEDGMENT
sponds to the wavelet details of a piecewise linear function .
Proof: Let The authors would like to thank the anonymous referees and
be the atomic representation of at scale , where the guest editor for their useful suggestions and insightful com-
. Each atom has been defined as ments that greatly contributed to the improvement of this paper.
the wavelet transform of an infinite ramp signal with a
singularity in , i.e., . REFERENCES
[1] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B.
The linearity of the wavelet transform operator gives Sandler, “A tutorial on onset detection in music signals,” IEEE Trans.
Speech Audio Process., vol. 13, no. 5, pp. 1035–1047, Sep. 2005.
[2] J. Foote, “Automatic audio segmentation using a measure of audio nov-
elty,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME2000), New
York, Jul. 2000, vol. I, pp. 452–455.
[3] L. Daudet, “A review on techniques for the extraction of transients in
musical signals,” Comput. Music Modeling and Retrieval, pp. 219–232,
2005.
[4] L. Daudet, “Sparse and structured decompositions of signals with
the molecular matching pursuit,” IEEE Trans. Audio, Speech, Lang.
Process., vol. 14, no. 5, pp. 1808–1816, Sep. 2006.
[5] L. Daudet and B. Torresani, “Hybrid representations for audiophonic
signal encoding,” Signal Process., Special Iss. Image Video Coding Be-
yond Standards, vol. 82, no. 11, pp. 1595–1617, 2002.
and [6] C. Duxbury, M. E. Davies, and M. B. Sandler, “Separation of tran-
sient information in musical audio using multiresolution analysis tech-
niques,” in Proc. 4th Int. Workshop Digital Audio Effects, Limerick,
Ireland, 2001, DAFX01.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
432 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

[7] C. Duxbury, N. Chetry, M. Sandler, and M. E. Davies, “An efficient [31] M. Goto and Y. Muraoka, “Beat tracking based on multiple agent ar-
two-stage implementation of harmonic matching pursuit,” in Proc. EU- chitecture: A real time beat tracking system for audio signals,” in Proc.
SIPCO 2004. 2nd Int. Conf. Multiagent Systems, Dec. 1996, pp. 103–110.
[8] C. Tantibundhit, J. R. Boston, C. C. Li, J. D. Durrant, S. Shaiman, [32] E. D. Scheirer, “Tempo and beat analysis of acoustic musical signals,”
K. Kovacyk, and A. El-Jaraoudi, “Speech enhancement using transient J. Acoust. Soc. Amer., vol. 103, no. 1, pp. 588–601, Jan. 1998.
speech components,” in Proc. IEEE ICASSP 2006, pp. 833–836. [33] A. Klapuri, “Sound onset detection by applying psychoacoustic knowl-
[9] V. Bruni and D. Vitulano, “Transients detection in the time scale do- edge,” in Proc. IEEE ICASSP’99, Phoenix, AZ, 1999, pp. 115–118.
main, special issue on lecture notes in computer science,” in Proc. [34] C. Duxbury, M. Sandler, and M. Davies, “A hybrid approach to musical
ICISP’08, Cherbourg-Octeville, France, Jul. 2008, vol. 5099/2008, pp. note onset detection,” in Proc. Digital Audio Effects Conf., Hamburg,
254–262. Germany, 2002, pp. 33–38.
[10] S. Molla and B. Torresani, “An hybrid audio scheme using hidden [35] K. Fitz, L. Haken, and P. Christensen, “Transient preservation under
Markov models of waveforms,” Appl. Compon. Harmonic Analysis, transformation in an additive sound model,” in Proc. Int. Comput.
vol. 18, no. 2, pp. 137–166, 2005. Music Conf., San Francisco, CA, 2000, pp. 392–395.
[11] S. Molla and B. Torresani, “Determining local transientness in audio [36] P. Sircar, K. Prasad, and B. Harshavardhan, “Analysis of multicompo-
signals,” IEEE Signal Process. Lett., vol. 11, no. 7, pp. 625–628, Jul. nent speech-like signals by continuous wavelet transform-based tech-
2004. nique,” in Proc. EUSIPCO’06, Florence, Italy, Sep. 8, 2006.
[12] F. X. Nsabimana and U. Zolzer, “Audio signal decomposition for [37] M. Davy and S. Godsill, “Detection of abrupt spectral changes using
pitch and time scaling,” in Proc. IEEE-ISCCSP, Malta, Mar. 2008, pp. support vector machines, An application to audio signal segmentation,”
1285–1290. in Proc. IEEE ICASSP’02, Orlando, FL, 2002, pp. 1313–1316.
[13] F. Jaillet and B. Torresani, “Time-frequency jigsaw puzzle: Adaptive [38] R. Grinboval and E. Bacry, “Harmonic decomposition of audio signals
multiwindow and multilayered Gabor representations,” Int. J. Wavelets with matching pursuit,” IEEE Trans. Signal Process., vol. 51, no. 1, pp.
Multiresolution Inf. Process., vol. 2, no. 5, pp. 293–316, 2007. 101–111, Jan. 2005.
[14] S. W. Hainsworth, M. D. Macleod, and P. J. Wolfe, “Analysis of reas- [39] L. Rankine, N. Stevenson, M. Mesbah, and B. Boashash, “A quantita-
signed spectrograms for musical transcription,” in Proc. IEEE Work- tive comparison of non-parametric time-frequency representations,” in
shop Applicat. Signal Process. Audio Acoust., New Paltz, NY, Oct. Proc. EUSIPCO’05, Antalya, Turkey, Sep. 4–8, 2005.
2001. [40] M. S. Crouse, R. D. Novak, and R. G. Baraniuk, “Wavelet-based
[15] S. N. Levine and J. O. Smith, III, “A Sines+Transients+Noise audio signal processing using hidden Markov models,” IEEE Trans. Signal
representation for data compression and time/pitch scale modications,” Process., Special Iss. Filter Banks, vol. 46, pp. 886–902, Apr. 1998.
in Proc. 105th Audio Eng. Soc. Conv. , 1998. [41] V. Bruni, B. Piccoli, and D. Vitulano, “A fast computation method for
[16] S. J. Godsill, A. T. Cemgil, C. Fevotte, and P. J. Wolfe, “Bayesian com- time scale signal denoising,” in Signal Image and Video Proc.. New
putational methods for sparse audio and music processing,” in Proc. York: Springer, 2008.
15th Eur. Signal Process. Conf.. EURASIP, 2007. [42] V. Bruni and D. Vitulano, “Wavelet based signal denoising via simple
[17] H. Thornburg, D. Swaminathan, T. Ingalls, and R. Leistikow, “Joint singularities approximation,” Signal Process., Elsevier Science, vol.
segmentation and temporal structure inference for partially-observed 86, pp. 859–876, 2006.
event sequences,” in Proc. 8th IEEE Int. Workshop Multimedia Signal [43] S. Mallat and W. L. Hwang, “Singularity detection and processing with
Process. (MMSP-06), Victoria, BC, Canada, 2006, pp. 41–44. wavelet,” IEEE Trans. Inf. Theory, vol. 38, pp. 617–643, 1992.
[18] S. Grofit and Y. Lavner, “Time-scale modification of audio signals
using enhanced WSOLA with management of transients,” IEEE Trans.
Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 106–115, Jan. 2008.
[19] M. Davy and S. Godsill, Bayesian Harmonic Models for Musical Pitch Vittoria Bruni received the degree in mathematics
Estimation and Analysis Cambridge Univ., Cambridge, U.K., 2002, from the University of Rome “La Sapienza,” Rome,
Tech. Rep. CUED/F-INFENG/TR.431. Italy, in 2001 and the Ph.D. degree in applied
[20] S. Mallat, A Wavelet Tour of Signal Processing. San Diego, CA: Aca- mathematics from the Department of Mathematical
demic, 1999. Methods and Models for Applied Sciences, Univer-
[21] M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: sity of Rome “La Sapienza, in 2006.
Theory and Application. Englewood Cliffs, NJ: Prentice-Hall, 1993. Her research interests include image compression,
[22] M. Basseville and A. Benveniste, “Sequential detection of abrupt denoising and restoration, wavelets theory and appli-
changes in spectral changes of digital signals,” IEEE Trans. Inf. cations, pattern recognition with applications in the
Theory, vol. IT-29, no. 5, pp. 709–724, Sep. 1983. field of cultural heritage. She is coauthor of 14 pa-
[23] T. Jehan, “Musical signal parameter estimation,” M.S. thesis, Univ. of pers published on international journals and 29 pa-
California, Berkeley, CA, 1997. pers published in the proceedings of international conferences. She is also ref-
[24] H. Thornburg and F. Gouyon, “A flexible analysis-synthesis method eree for various international journals. Since September 2001 she has been col-
for transients,” in Proc. Int. Comput. Music Conf. (ICMC-2000), Berlin, laborating with the Institute for the Application of Calculus (IAC), National
Germany, 2000, pp. 400–403. Research Council, Rome, within national and international projects. She is cur-
[25] A. von Brandt, “Detecting and estimating parameter jumps using ladder rently a Lecturer in the Department of Mathematics, degree in “Mathematical
algorithms and likelihood ratio test,” in Proc. ICASSP, Boston, MA, Models for Image and Signal Processing,” University of Rome “Tor Vergata.”
1983, pp. 1017–1020.
[26] X. Serra, Musical Signal Processing. Lisse, The Netherlands: Swets
& Zeitlinger, 1997, ch. 3.
[27] S. A. Abdallah and M. D. Plumbley, “Probability as metadata: Event Silvia Marconi was born in Rome, Italy. She
detection in music using ICA as a conditional density model,” in Proc. received the degree in mathematics and the Ph.D.
4th Int. Symp. Ind. Compon. Anal. Signal Separation (ICA2003), Nara, degree in mathematical models and methods for
Japan, 2003, pp. 233–238. technology and society from the University of Rome
[28] S. A. Abdallah and M. D. Plumbley, “If edges are the independent “La Sapienza,” Rome, Italy. She elaborated both
components of natural scenes, what are the independent components theses at the Institute for the Applications of the
of natural sounds?,” in Proc. 3rd Int. Conf. Ind. Compon. Anal. Signal Calculus (IAC), National Research Council (CNR),
Separation (ICA2001), San Diego, CA, 2001, pp. 534–539. Rome, where she also performed a collaboration
[29] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood on multiscale analysis of contours of shapes for the
from incomplete data via the EM algorithm,” J. R. Statist. Soc. Series description of blotches on photographic prints of
B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. historical interest.
[30] L. Daudet, “Transients modeling by pruned wavelet trees,” in Proc. Int. She also performed a teaching activity in a secondary school and tutoring
Comput. Music Conf., Havana, Cuba, 2001, pp. 18–21. courses in mathematical analysis in the University of Rome “La Sapienza.”

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.
BRUNI et al.: TIME-SCALE ATOMS CHAINS FOR TRANSIENTS DETECTION IN AUDIO SIGNALS 433

Domenico Vitulano (M’09) received the physics international research projects concerning cultural heritage. He is currently
degree from the University of Naples “Federico head of the CNR research line (RSTL) entitled “Multiscale Analysis for
II,”Naples, Italy, in 1993 and the M.Sc. degree Complicated Shapes Recognition” (DG.RSTL.004.009) and of the CNR
(summa cum laude) in information and communi- research line entitled “Analysis and Synthesis of Etherogenous Data for a Com-
cation technology from IIASS “R.R. Caianiello” puter-Aided Monitoring of Cultural Heritage” (PC.P03.008). He is currently a
Institute of Vietri sul Mare, Salerno, Italy, in 1997. Lecturer at the University of Rome “Tor Vergata,” Department of Mathematics,
Since 1995, he has been with the Institute for degree in “Mathematical Processing of Images and Signals.” He is author and
the Application of Calculus, National Council of coauthor of more than 50 scientific papers on journals, tutorials, monographs,
Research, Rome, Italy, and since 2001 he has been a and proceedings of workshops. He has been a referee for various international
permanent researcher. His research interests include conferences and journals.
pattern recognition, image and data compression,
indexing, and computer vision. He has been involved in various national and

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:452 UTC from IE Xplore. Restricon aply.

Potrebbero piacerti anche