Featural Processing

Audio Processing
Audio
RMS Mel-Frequency Cepstrum Coefficients (MFCCs)
Audio Visual
Mixelgrams (Hershey & Movellan, 2000)
RMS
Root Mean Square Audio feature A measure of the average amplitude of the audio signal N
1 2 ! at N t =1
Where ai is the audio amplitude (I.e., raw audio sample) at time t IKAROS module
RMSAudio
Example XML script:

RMSAudio/Example/RMSAudio_test.xml Example creates file rms.txt with RMS amplitude
Per Visual Frame RMS Amplitude of sDog.mov Audio Track
MFCCs
Implementation in IKAROS Ported from Marsyas by Eric Mislivec
Marsyas: http://sourceforge.net/projects/marsyas
Marsyas implementation based on Matlab code by Malcom Slaney

http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/ See the file mfcc.m contained in the toolbox archive See pp. 30-33 of documentation
http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/AuditoryToolboxTechReport.pdf
MFCCs (Mel-Frequency Cepstrum Coefficients)

parametric representation of a speech audio signal (Davis & Mermelstein, 1980) Objective
Compress speech data by eliminating information not pertinent to phonetic analysis
E.g., represent an interval of 6.4 ms of speech audio with 10 real numbers

With audio (mono) sampled at 44.1 kHz, this would be 282 numbers (samples)
Observation used by Davis & Mermelstein (1980)

The first six eigenvectors of the covariance matrix for Dutch vowels of three speakers, expressed in terms of 17 mel-frequency filter energies, accounted for 91.8% of the total variance (Pols, 1966)
Mel-Frequency Scale
A linear frequency spacing below low frequencies (e.g., 1,000 Hz); a logarithmic spacing above high freq. (e.g., 1,000 Hz) Filters corresponding to these spacings seem to capture phonetically important characteristics of speech (e.g., of the cochlea??)
40 mel-frequency filters; MFCC implementation by Malcolm Slaney (http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/)
Frequency Analysis of Audio

Audio samples as obtained digitally reflect amplitude Mel-frequency filters are applied to the frequency of the audio, not the amplitude Before processing with mel-frequency filters, first take Discrete Fourier Transform (DFT) of the audio
Converts to a frequency representation
DFT analysis occurs in terms of number of equally spaced bins

Each bin represents a particular frequency range DFT analysis gives the amount of energy in the audio signal that is present within the frequency range for each bin
Using Mel-Frequency Filters - 1

Applied to frequency analysis of audio
I.e., power spectrum of DFT
Result of applying mel-frequency filters is to reduce the amount of data

Instead of having a number of values same as number of bins produced by DFT, now have a number of values same as number of filters Example
Input: 512 frequency bin values Output: 40 filter responses
Using Mel-Frequency Filters - 2

Filter can be represented as a matrix
E.g., with 40 filters, and 512 frequency bins, obtain a 40x512 filter matrix
See: MFCC/mfccFilterWeights.xls
Filter is applied to DFT bin output by matrix multiplication

With DFT bin output in a 512x1 matrix (F), and mel-frequency filters in a 40x512 matrix (M), application of the filters to this frequency data is
M x F = 40x1 matrix
Algorithm to compute MFCCs

1. Compute the DFT power spectrum of the speech signal 2. Apply a Mel-frequency filter-bank to the power spectrum to get N filter responses (N=20~60) 3. Compute discrete cosine transform (DCT) of log filter-bank energies to get uncorrelated MFCCs (e.g., M=10 values)
Discrete Cosine Transform (DCT) - 1

Signals represented in terms of basis functions:
(a) First basis function: A constant component (DC) (b) Remaining basis functions: a series of successively increasing frequency components (AC)
Each of the basis functions are uncorrelated (orthogonal and orthonormal) DCT generates as many component scalar values (numbers) as present in original signal
(2i + 1)u! F(u) = C(u) 2 / M # cos f (i) 2M i=0
M "1
! f (i) = 2 / M
M "1 u=0
# C(u)cos
(2i + 1)u! F(u) 2M
Where f(i) is the original discrete valued signal, and F(u) is the transformed signal, with 1 <= u <= M
! 2 $ if u = 0 & C(u) = # 2 # & #1 otherwise & " %
Discrete Cosine Transform (DCT) - 2

A trick with DCT though is to not use all of resulting F(u) terms to represent original signal f(i) Rather, in decompression only some initial portion of the F(u) terms are used
I.e., latter, higher frequency F(u) terms are dropped
This corresponds to representing the lower frequency components of the signal, while dropping some of the higher frequency components Leads us to a second form of the inverse DCT
(2i + 1)u! ! ! f (i) = 2 / M # C(u)cos F(u) 2M u=0
P "1
With P < M, we approximate the approximation
DCT Example - 1
With our MFCC computation, the DCT is applied to the output of 40 mel-scale filters Example (log10) output of filters:
-2.586700, -2.627700, -2.086800, -2.100100, -2.319800, -1.975200, -2.178400, -2.195400, -1.953100, -2.231900, -2.021400, -1.933800, -1.966900, -1.514800, -1.646200, -1.530600, -1.488100, -2.062500, -2.286800, -2.348500, -2.538200, -2.696200, -2.764100, -2.852300, -2.950000, -2.843200, -2.454700, -2.438700, -2.655200, -2.318300, -2.457900, -3.171100, -3.413300, -2.628100, -2.558700, -3.296300, -3.576600, -3.560700, -3.462800, -3.396300
DCT is then applied, and the first 13 of the DCT components are retained
These are the MFCCs
DCT Example - 2
DCT result
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553, -0.574495, 0.279874, -0.355358, 0.257603, -0.292258, 0.079028, 0.226440, -0.369133, 0.400663, -0.348537, -0.034636, 0.321552, -0.140994, 0.129148, -0.001110, 0.211364, 0.337528, 0.264207, 0.034994, -0.117453, -0.037960, -0.082142, 0.059513, 0.011227, 0.032282, 0.017767, -0.027099
First 13 (MFCCs)
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553
DCT Example: Inverse DCT
IKAROS Example - 1
Purpose
To demonstrate the computation of MFCC parameters from samples of audio Audio data: DogAudio.wav
Running the code

Located in MFCC/MFCCAudio/Example From a Terminal
IKAROS MFCCAudio_test.xml
Generates 328 sets of MFCC coefficients (each has 13 values)
IKAROS Example - 1: Evaluation

To evaluate, compare the results of DFT on the DogAudio.wav data itself to a reconstruction of the DFT based on the MFCC output Procedure
Use praat (http://www.praat.org/) to first get a sense of what the DFT should look like Then use Mathematica to display DFT results of MFCCAudio processing Finally, use Matlab to reconstruct DFT results based on MFCC output (mfccTest.m), and use Mathematica to display reconstructed results
IKAROS Example - 1: Evaluation Results

praat
DFT
(Mathematica)
A = Import["/Users/chris/Desktop/DogAudio_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Reconstruction
(Matlab + Mathematica)
A = Import["/Users/chris/Desktop/DogAudio_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
IKAROS Example - 2
Purpose
To demonstrate the computation of MFCC parameters from samples of audio contained in a Quicktime file Audio data: sCup.mov
Running the code

Located in IKAROS/MFCC/MFCCAudio/Example From a Terminal
IKAROS MFCCAudio_test2.xml
Generates 448 sets of MFCC coefficients (each has 13 values)
IKAROS Example - 2: Evaluation Results

praat
DFT
(Mathematica)
A = Import["/Users/chris/Desktop/sCup_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Reconstruction
(Matlab + Mathematica)
A = Import["/Users/chris/Desktop/sCup_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Audio-Visual Synchrony Features

Several algorithms process audio-visual data and generate a measure of audiovisual synchrony as output E.g., Hershey & Movellan (2000), Slaney & Covell (2001), Kidron et al. (2005)
Hershey & Movellan (2000)

Compares current audio (e.g., 1/2s time interval) to all parts of current visual (e.g., same 1/2s time interval)
Separate comparisons for each part of visual
Can process raw or feature processed audio, visual Effectively spatial visual processing; no spatial audio processing
1/2s Visual Audio Time
Mutual Info Computation

| ! A(t k ) || ! V (x, y,t k ) | 1 M (x, y,t k ) = log 2 2 | ! A,V (x, y,t k ) |
Notation (see also section 2.2 of Prince & Hollich, 2005):
is covariance |X| is matrix determinant of matrix X. V(x, y, tk) is a matrix with the current visual information for region (x, y) of the visual frames (e.g., pixel at x, y) over time interval of length S (e.g., S = 15 visual frames) A(tk) is a matrix with current audio information over the same time interval (length S) A,V(x, y, tk) contains columns of V(x, y, tk) and A(tk)
Minimum mutual information is 0
Output: Mixelgram
Qualitative because mutual information is taken for each part of the visual scene (e.g., pixel) at each time step Module Mixelgram in IKAROS does this computation
Example XML file
Mixelgram/Example/Mixelgram_test.xml
Example Output
Movie of Output
Conclusions
Important to share not just ideas but also working system components (e.g., program code components) Lets have someone do Feature Processing Version #2 next year (and more developmental!)
REFERENCES
For Discrete Cosine Transform (DCT)
Li, Z. & Drew, M. S. (2004). Fundamentals of Multimedia. Upper Saddle River, NJ: Pearson Prentice Hall.
Hershey, J., & Movellan, J. (2000). Audio-vision: Using audiovisual synchrony to locate sounds. In S. A. Solla, T. K. Leen, & K. R. Mller (Eds.), Advances in Neural Information Processing Systems 12 (pp. 813-819). Cambridge, MA: MIT Press. Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. Proceedings of the IEEE Computer Society International Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 88-96. Prince, C. G. & Hollich, G. (2005). Synching models with infants: A perceptual-level model of infant audio-visual synchrony detection. Journal of Cognitive Systems Research, 6, 205-228. Internet: http://www.cprince.com/PubRes/JCSR04 Slaney, M. & Covell, M. (2001). FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Proceedings of Neural Information Processing Society 13. Cambridge, MA: MIT Press.

Featural Processing

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Featural Processing

Caricato da

Copyright:

Formati disponibili

Audio Processing

Example XML script:

Per Visual Frame RMS Amplitude of sDog.mov Audio Track

Marsyas implementation based on Matlab code by Malcom Slaney

MFCCs (Mel-Frequency Cepstrum Coefficients)

E.g., represent an interval of 6.4 ms of speech audio with 10 real numbers

Observation used by Davis & Mermelstein (1980)

40 mel-frequency filters; MFCC implementation by Malcolm Slaney (http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/)

Frequency Analysis of Audio

DFT analysis occurs in terms of number of equally spaced bins

Using Mel-Frequency Filters - 1

Result of applying mel-frequency filters is to reduce the amount of data

Using Mel-Frequency Filters - 2

Filter is applied to DFT bin output by matrix multiplication

Algorithm to compute MFCCs

Discrete Cosine Transform (DCT) - 1

(2i + 1)u! F(u) 2M

! 2 $ if u = 0 & C(u) = # 2 # & #1 otherwise & " %

Discrete Cosine Transform (DCT) - 2

With P < M, we approximate the approximation

DCT Example: Inverse DCT

Running the code

Generates 328 sets of MFCC coefficients (each has 13 values)

IKAROS Example - 1: Evaluation

IKAROS Example - 1: Evaluation Results

Running the code

Generates 448 sets of MFCC coefficients (each has 13 values)

IKAROS Example - 2: Evaluation Results

Audio-Visual Synchrony Features

Hershey & Movellan (2000)

Mutual Info Computation

Minimum mutual information is 0

Potrebbero piacerti anche