Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Audio
RMS Mel-Frequency Cepstrum Coefficients (MFCCs)
Audio Visual
Mixelgrams (Hershey & Movellan, 2000)
RMS
Root Mean Square Audio feature A measure of the average amplitude of the audio signal N
1 2 ! at N t =1
Where ai is the audio amplitude (I.e., raw audio sample) at time t IKAROS module
RMSAudio
MFCCs
Implementation in IKAROS Ported from Marsyas by Eric Mislivec
Marsyas: http://sourceforge.net/projects/marsyas
Mel-Frequency Scale
A linear frequency spacing below low frequencies (e.g., 1,000 Hz); a logarithmic spacing above high freq. (e.g., 1,000 Hz) Filters corresponding to these spacings seem to capture phonetically important characteristics of speech (e.g., of the cochlea??)
Each of the basis functions are uncorrelated (orthogonal and orthonormal) DCT generates as many component scalar values (numbers) as present in original signal
(2i + 1)u! F(u) = C(u) 2 / M # cos f (i) 2M i=0
M "1
! f (i) = 2 / M
M "1 u=0
# C(u)cos
Where f(i) is the original discrete valued signal, and F(u) is the transformed signal, with 1 <= u <= M
This corresponds to representing the lower frequency components of the signal, while dropping some of the higher frequency components Leads us to a second form of the inverse DCT
(2i + 1)u! ! ! f (i) = 2 / M # C(u)cos F(u) 2M u=0
P "1
DCT Example - 1
With our MFCC computation, the DCT is applied to the output of 40 mel-scale filters Example (log10) output of filters:
-2.586700, -2.627700, -2.086800, -2.100100, -2.319800, -1.975200, -2.178400, -2.195400, -1.953100, -2.231900, -2.021400, -1.933800, -1.966900, -1.514800, -1.646200, -1.530600, -1.488100, -2.062500, -2.286800, -2.348500, -2.538200, -2.696200, -2.764100, -2.852300, -2.950000, -2.843200, -2.454700, -2.438700, -2.655200, -2.318300, -2.457900, -3.171100, -3.413300, -2.628100, -2.558700, -3.296300, -3.576600, -3.560700, -3.462800, -3.396300
DCT is then applied, and the first 13 of the DCT components are retained
These are the MFCCs
DCT Example - 2
DCT result
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553, -0.574495, 0.279874, -0.355358, 0.257603, -0.292258, 0.079028, 0.226440, -0.369133, 0.400663, -0.348537, -0.034636, 0.321552, -0.140994, 0.129148, -0.001110, 0.211364, 0.337528, 0.264207, 0.034994, -0.117453, -0.037960, -0.082142, 0.059513, 0.011227, 0.032282, 0.017767, -0.027099
First 13 (MFCCs)
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553
IKAROS Example - 1
Purpose
To demonstrate the computation of MFCC parameters from samples of audio Audio data: DogAudio.wav
DFT
(Mathematica)
A = Import["/Users/chris/Desktop/DogAudio_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Reconstruction
(Matlab + Mathematica)
A = Import["/Users/chris/Desktop/DogAudio_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
IKAROS Example - 2
Purpose
To demonstrate the computation of MFCC parameters from samples of audio contained in a Quicktime file Audio data: sCup.mov
DFT
(Mathematica)
A = Import["/Users/chris/Desktop/sCup_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Reconstruction
(Matlab + Mathematica)
A = Import["/Users/chris/Desktop/sCup_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]
Can process raw or feature processed audio, visual Effectively spatial visual processing; no spatial audio processing
1/2s Visual Audio Time
Output: Mixelgram
Qualitative because mutual information is taken for each part of the visual scene (e.g., pixel) at each time step Module Mixelgram in IKAROS does this computation
Example XML file
Mixelgram/Example/Mixelgram_test.xml
Example Output
Movie of Output
Conclusions
Important to share not just ideas but also working system components (e.g., program code components) Lets have someone do Feature Processing Version #2 next year (and more developmental!)
REFERENCES
For Discrete Cosine Transform (DCT)
Li, Z. & Drew, M. S. (2004). Fundamentals of Multimedia. Upper Saddle River, NJ: Pearson Prentice Hall.
Hershey, J., & Movellan, J. (2000). Audio-vision: Using audiovisual synchrony to locate sounds. In S. A. Solla, T. K. Leen, & K. R. Mller (Eds.), Advances in Neural Information Processing Systems 12 (pp. 813-819). Cambridge, MA: MIT Press. Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. Proceedings of the IEEE Computer Society International Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 88-96. Prince, C. G. & Hollich, G. (2005). Synching models with infants: A perceptual-level model of infant audio-visual synchrony detection. Journal of Cognitive Systems Research, 6, 205-228. Internet: http://www.cprince.com/PubRes/JCSR04 Slaney, M. & Covell, M. (2001). FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Proceedings of Neural Information Processing Society 13. Cambridge, MA: MIT Press.