Effect of Distortion On Speaker Recognition in Telephone Line Frequency Band PDF

Acoustic Laboratory
November 2013, Caseros, Buenos Aires Province, Argentina
EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN TELEPHONE LINE FREQUENCY BAND

AGUSTN Y. ARIAS 1
1
Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina. agustin.arias@outlook.com
Abstract This paper analyze how the harmonic distortion affects the correct recognition of speakers with the objective to contribute to the advance of forensic acoustics. To achieve this task, the dynamics of different voices has been processed, gradually increasing the level of distortion, in order to simulate the actual quality of a telephone line. Several subjective listening tests were performed, where fifteen subjects had to listen to the voices processed and decide which person belongs each voice, comparing the distorted voices with reference voices without distortion. The influence of the Distortion Level and the Total Harmonic Distortion (THD) of a telephone line simulator is analyzed. In addition, vocal formants are studied using several sonograms to analyze the results of the tests. The procedures, considerations and limitations of the test process are explained as well as the conclusions and new lines of investigations are described.
1. INTRODUCTION The recognition of speakers is a broadly diversified discipline, where various systems that perform this task automatically or semi-automatically can be found. These systems are used for various purposes, for example: smart opening locks, access to private computer networks, mobile smart devices, among others. However, the objective of this study is directly related to the field of Forensic Acoustics where handling computerized techniques by a skilled expert plays a fundamental role to produce accurate results and conclusions. Forensic Acoustics is one of the most complex research environments of the Scientific Police, mainly due to the multidisciplinary nature of their different approaches of analysis, and the need to grant a continued high level of training to their experts. While the domain and use of technology and digital applications analysis, calculation or processing is essential, involving a team of experts specializing in different perspectives of study is even more unavoidable [1]. In forensic analysis of voices is common to work with poor signal quality in terms of dynamic, bandwidth, phase distortion, and several types of noise (ambient noise, digital noise, quantization noise, modulation noise, electromagnetic noise), which is mainly related to the limitations imposed by the equipment and transmission lines (whether through physical lines or wireless) [2]. All these factors are potential generators of errors in the results produced by automatic speech recognition systems, whereby the judgment and skill of the expert in charge of the analysis is of fundamental importance.
Hirsch [3] developed a study in which the performance of various automatic speaker recognition systems using voice recordings filtered between 300 and 3400 Hz and contaminated with various background noise (interior of cars, trains, restaurant, among others) was analyzed. The SNR (Signal-toNoise Ratio) of each voice + background noise was gradually modified and the results shows that the automatic speaker recognition systems are unreliable when the background noise level is greater or equal to the level of the vocal registers. In the other hand, the phase distortion produced by different elements of telephones lines (cables, filters, transformers and equalizers) were characterized by Steinberg [4] in the Bell Laboratories. Several telephone lines were measured and the technological requirements were stated. In addition, Wang and Lim [5] studied the importance of phase distortion in the context of speech enhancement through some listening tests. They used two analysis blocks in parallel to estimate the phase and magnitude spectra by using both clean and noisy speech. By altering the amount of induced noise for phase and magnitude estimation separately, the structure is capable of controlling the amount of phase and magnitude distortions independently. Using this structure and carrying out listening tests, they conclude that it is unwarranted to make an effort to more accurately estimate the phase from noisy speech. However, the effects produced by the level of distortion of a telephone line in a speaker recognition task has not been fully investigated. The objective of this research is to subjectively evaluate the difficulties that a listener has when performing a speaker recognition test using sound recordings with different levels of harmonic distortion and elimination of
1
spectral information. This manipulation of sound recordings tries to simulate the behavior of low quality telephone transmission lines with regard to transmission and reception of audio signals such as in police radio electric equipment and some mobile and landline telephones. This poor sound quality recordings are commonly used in forensic acoustics, damaging and making it much harder the correct performance of the automated or semi-automated systems of speaker recognition [6]. To quantify the different levels of distortion, the Total Harmonic Distortion (THD) parameter was chosen because it is one of the most important technical specifications that any equipment that manipulates audio signals has. A subjective speaker recognition test was designed in order to study the effects of the THD. In the test each subject had to compare filtered (telephone lines simulator) and distorted vocal records with high sound quality reference vocal records. 2. EXPERIMENTAL DESIGN Firstly, a filtering process that simulates the frequency response of a typical telephonic system was design. For this, a 6th order Butterworth band pass filter was used. The lower cutoff frequency was set to 300 Hz and the upper cutoff frequency was set to 2000 Hz as can be appreciate in Figure 1 [7]. This filtering process was also used to perform a preliminary speaker recognition test with filtered signals without distortion, in order to analyze the influence of that process in the results obtained from the main test which uses the filtering + distortion processes. Two sonograms corresponding to the word secuestro are show in Figure 2 to compare how formants are seen.
The energy of the first and second formants remain intact, which greatly favors the recognition since those are the most energetic formants that characterize the presence of vowels in the speech. As it was mentioned above, the applied filter has a lower cutoff frequency of 300 Hz, which does not interfere too much with the main formant region of vowels which are listed in Table 1 [8]. Original voice register
Filtered voice register
Figure 2. Original and filtered voice register sonograms Table 1. Main formant regions of vowels Vowel u o a e Main formant region 200 - 400 Hz 400 - 600 Hz 800 - 1200 Hz 400 -600 and 2200 - 2600 Hz
Figure 1. Telephone-like band pass filter
The spectral content elimination can be clearly appreciate in the sonogram of the filtered voice. The sonograms are a useful tool to analyze the presence of formants. Formants are own resonances of the vocal tract and represents the peak intensity in the spectrum of a sound, which corresponds to the concentration of energy that occurs at a particular frequency. Technically, the formants are frequency bands where most of the sound energy is concentrated.
Then, since the main subjective test was designed to study the degree of accuracy in recognizing speakers gradually varying the distortion, four levels of distortion were designed. The distortion system used to perform this task is a VST Plugin available in Adobe Audition, which consists of a typical limiter that allows the user to modify its transfer function. The configurations of the distortion systems were chosen as follows: Distortion 1 (D1): Low distortion. o Limiter: -20dB Distortion 2 (D2): Moderate distortion o Limiter: -30dB Distortion 3 (D3): High distortion o Limiter: -40dB Distortion 4 (D4): Very high distortion
2
Limiter: -50dB
Figure 3 shows the transfer functions of these four configurations.
the original impulse in the impulse response of the filter + distorted system. The usefulness of this descriptor is that it allows to relate the amount of harmonic energy generated by the distortion process and the energy of the input signal (original impulse). The results obtained from the analysis of each level of distortion are shown in Table 2. Original voice register
D4
D3
D4 distorted voice register

D2 D1
Figure 3. Distortion systems transfer functions
These distortion processes generate a big amount of harmonic components. The energy of those harmonics increase as the distortion level increase. It tends to impair the correct identification of formants, which may difficult the task of recognition of lowfrequency signals (vowels). In Figure 4 the sonograms of an original voice recording and the same recording with the D1 and D4 distortion levels are compared. The main formants and the amount of spectral harmonics generated is visually distinguishable in the D1 distortion sonogram, and also the reduction of the energy levels with increasing the number of harmonics is appreciated. But in the case of the D4 distortion level occurs exactly the opposite. It is very difficult to distinguish the formants because the energy level difference between the first-order formant (first resonance frequency of the vocal tract) and the first harmonic is too small (about 2dB). The same energy level difference was found between upper harmonics. In the case of the D1 distortion level, the differences found are about 5.3 and 6.4 dB. The signal energy above 3000 Hz is less than -85 dB, so it can be despised. As it was mentioned above, it was decided to use the Total Harmonic Distortion (THD) [9] to quantify the four distortion levels. This parameter is calculated according to Eq.1. =
10 10 10 10
D1 distorted voice register
Figure 4. Sonograms comparison Table 2. THD [%] of each filter + distortion system Signal energy level [dB] Distortion THD level Original Impulse Harmonics D1 D2 D3 D4 -54,1 -53,69 -52,51 -51,37 -72,8 -66,2 -60,5 -54,3 1,35% 5,61% 15,89% 50,93%
100
[%]
(1)
Where Hrms is the root mean square value of all the harmonics and IRrms is the root mean square value of
3
3. SIGNALS ADQUISITION AND
PROCESSING
3.1. Voices signals recordings
Six different voices were recorded using high quality recording equipment to obtain "clean" signals with a high signal-to-noise ratio, a wide bandwidth and low harmonic distortion. The following list details the equipment and software employed: Notebook POSITIVO BGH C570 M-Audio Fast Track audio interface RODE NT 2A microphone XLR-XLR cable Sound level meter Svantek 959 + microphone calibrator. Microphone tripod Adobe Audition o Sample rate: 44100 Hz o Resolution: 16 bits o Channel mode: Mono The recordings were performed in a room with the following acoustic characteristics: Background Noise: 42 dBA. Reverberation Time: 0,4s.
From this point, only three words (Dinero Secuestro - Polica) were processed for each voice recorded, while the other two words (Rehn Bomba) were not modify. It is because the subjective test seeks to compare originals signals with distorted signals, but the words recorded as the original signals must be different for those words corresponding to the distorted signals, in order to, again, avoid any temporal characteristics of the speeches (rhythm and rate of speech, forms of pronunciation, particular accents, among others) that may help to distinguish a particular listener. So, the two important considerations taken into account for the voices recordings and processing that allow minimize the influence of temporal characteristics were defined. Then, the four different levels of distortion were applied. Because this process consist of a typical limiter, the amplitude of the signals processed results highly attenuated. For this reason another normalization process was applied to raise de loudness of the signals but in this instance the normalization was set to 60% due to the noisiness that generates the distortion which can cause temporal hearing damage to the listener if the normalization value were set higher than 60%.
Those values were measured with the sound level meter. The background noise measurement time was 5 minutes and the reverberation time T30 was obtain using a balloon explosion and averaging the results of the 500, 1000 and 2000 Hz octave bands. In order to perform a test that represents the actual working conditions of a forensic expert, five Spanish words commonly used in forensic acoustics were recorded by each person. Those words are: Dinero (money), Secuestro (kidnapping), Polica (police), Rehn (hostage) and Bomba (bomb). Finally, to minimize the temporal characteristics that each person has in their own speech, they had to perform the recordings imitating the pronunciation of a reference speech reproduced by headphones that contains the five words defined above with a specific rhythmic. 3.2. Signals processing and considerations Once all the sound recordings of the six persons who contributed with their voices were obtain, several stages of signals processing were conducted. The goal of this step is to process some words of each speech signals to simulate the poor sound quality produced by certain telephones lines such as it was defined in Section 2. First, a process of "normalization" to 100% was applied to each recorded word to equate the levels of all recordings, avoiding loudness differences between different words and persons.
Figure 5. Block diagram of the entire signal processing.

Log-Sine sweep
BAND PASS FILTER
DISTORTION SYSTEM
INVERSE FILTER CONVOLUTION

5 harmonic 3 harmonic 1 harmonic
4 harmonic
2 harmonic
Original impulse
Figure 6. Impulse response of a filter + distortion system

4
The entire process is shown in Figure 5. As it was explained above, the words "Dinero", "Secuestro" and "Polica" were normalized to 100%, then filtered, distorted and once again normalized to 60%. The words "Rehn" and "Bomba" were only normalized to 100% because they were used as the reference signals in the test procedure. To calculate the THD values, the impulse response of each Filter + Distortion system was obtained by passing a Log-Sine sweep (between 100 and 4500 Hz) at the input of the system and then convolving the output signal by the inverse filter using the AURORA plugins [10]. As shown in Figure 6, this process allows to discriminate temporarily the different harmonics created by the distortion processes and thereby it is possible to compare the energy of those harmonic with the energy of the original impulse. 4. TEST PROCEDURE
results obtained show that ten of the twelve subjects (83.33%) were able to recognize all voices, while the remaining two subjects (16.66%) achieved four recognitions.
100%
66.66%
The subjective tests were performed by fifteen subjects, ten men and five women aged between 22 to 27 years. In all cases a pair of headphones were used in order to minimize the background noise of the rooms where the tests were performed, allowing the subjects to concentrate on the listening without any external interference. In Figure 7 a typical test table is shown. The subjects had first to listen the Reference 1 track which contains the two words that were not modify (Rehn Bomba) corresponding to one person voice. Then, they had to listen the six Voice tracks (A, B, C, D, E and F) each one containing the three distorted words (Dinero Secuestro Polica) and decide which one of the six Voice tracks corresponds to the same person of the Reference 1 track. This procedure was repeated for the rest of the Reference tracks. There were four tables like the one showed in Figure 7, each one corresponding to a different level of distortion of the Voice tracks. The first test is that one corresponding to the Distortion D4, the second one correspond to the Distortion D3 and so on. The subjects were allowed to repeat any track many times if it was necessary, and for that reason the time required to complete the test vary according to the subject. On average it took twenty minutes to complete the test.
Figure 8. Preliminary test. Amount of successful recognitions
These results indicate that the filtering process does not produce a significant difficulty to perform a successful speaker recognition task. It is considered that the various degrees of difficulty of recognizing speakers are highly related to the processes of distortion. Then, regarding the main test, the amount of successful recognitions that each subject has achieved in the four tests was studied in order to analyze the difficulty of recognizing speakers with the different levels of distortion. The results are shown in Figure 9.
100%
66.66%
50%
33.33%
16.66%
0%
Figure 7. Test table for the Distortion D1.
5. RESULTS The preliminary test (speaker recognition using filtered voices) results are shown in Figure 8. The
Figure 9. Amount of successful recognitions of each subject for each distortion level
5
As can be seen, the results obtained indicate that for the lower distortions levels there is a great amount of successful recognitions, as expected. Three cases were found in which there was no recognition, two corresponding to the Distortion D4 and the remaining to the Distortion D3. On the other hand, the number of cases in which the subject correctly matched at all recognitions is seven, all corresponding to the minimum level of distortion. In general, the results obtained for the distortion levels D4 and D3 are evenly distributed between one and four successful recognitions. There's no chance to hit five times because if a comparison is erroneous then inevitably there is another erroneous one. Analyzing the results for each distortion level the following behaviors were found: In the case of Distortion D4 the recognition was very poor, with a mean of 1.33 successful recognitions (22.6%), while the maximum number was three (50%) achieved only by one subject. Moreover, two cases in which recognition was null (0%) were found. Analyzing the distribution of the results, it is found that twelve of the fifteen subjects (80% of the total population) were able to recognize only one or two voices. On the other hand, in the test of Distortion D3 a slight improvement in the average score was obtain with an average of two successful recognitions (33.3%). Only in one case it was possible to recognize four voices which was the maximum value of successes obtained (66.6%). There was also one case of null recognition (0%). In this case the distribution of results indicates that only five subjects (33.3%) achieved a recognition between three and four voices. The Distortion D2 test results did not provide significant improvements over the previous case. The average of successful recognitions increased only to 2.53 (42.2%) and the maximum is four (66.6%) achieved by three subjects could not be overcome. Unlike the previous cases, this time no subject failed in all the recognition process, although only 46.6% of subjects achieving recognize three or more voices without exceeding the maximum value of four successful recognitions. Finally, the test of Distortion D1 got the most accurate results, with a mean of 4.87 (81.1%) of successful recognitions. Fourteen subjects (93.3%) were able to correctly identify more than three voices, from which seven (46.6%) could recognize all voices. The mean values of successful recognitions and the standard deviations associated are listed in Table 3. Another interesting result obtained is the amount of successful recognitions for each one of the six voices. To perform this analysis the number of subjects who were able to recognize each voice individually in each of the four levels of distortion was determined. The results obtained are shown in Figure 10.
Table 3. Mean and Standard Deviation values of successful recognitions for each distortion level Distortion level 1 2 3 4 Successful recognitions Mean 4,87 2,53 2,00 1,33 Standard deviation 1,13 0,99 1,07 0,82
100% 93.33% 86.66%
80%
73.33% 66.66% 60% 53.33% 46.66% 40% 33.33% 26.66% 20% 13.33% 6.66% 0%
Figure 10. Amount of successful recognitions for each
one of the six voices This results denote that for all levels of distortion, the voice of the person A (Voice A) was the one that had a greater number of successful recognitions. The Voice D also has a great amount of successful recognitions, especially in the case of the D1 where 86.66% of the subjects were able to correctly identify that person. The statistical analysis (Correlation coefficients, Standard Error, the Coefficient of Determination R square, and the p-value for a confidence interval of 95%) of the distortion level and the amount of successful recognitions for each voice is listed in Table 4. The results obtained for the Voices A, D and F indicates that the quantity of successful recognitions are tightly related with the difference of the distortion levels. The p-values obtained are much small than 0.05 for those words, and then it allows to confirm that both variables are highly correlated.
Table 4. Distortion Level. Statistical analysis Successful recognitions Distortion Voice Voice Voice Voice Voice level A B C D E 1 15 12 11 13 12 2 10 3 5 9 4 3 8 4 5 5 3 4 5 3 3 3 2 Correlation 0,983 0,770 0,894 0,990 0,875 Standard Error Determination Coefficient p-value 0,424 0,966 0,017 1,523 0,593 0,229 0,849 0,800 0,105 0,346 0,980 0,010 1,212 0,766 0,124
Voice F 10 7 5 4 0,976 0,316 0,952 0,024
The same analysis was performed relating the amount of successful recognitions with the four THD values calculated. The results are listed in Table 5. The negative correlation indicates that the higher the THD the lower the amount of successful recognitions. There is no p-value that allows to reject the null hypothesis, so it would not be correct to affirm that the amount of successful recognitions of each voice is correlated to the THD. For this reason, the THD parameter is not useful to quantify the distortion level of the designed distortion systems.
Table 5. THD. Statistical analysis Successful recognitions Voice Voice Voice Voice Voice Voice A B C D E F 15 12 11 13 12 10 10 3 5 9 4 7 8 4 5 5 3 5 5 3 3 3 2 4 -0,856 -0,532 -0,717 -0,845 -0,645 -0,804 0,068 0.732 0.144 0,116 0,283 0,468 0,076 0,513 0,283 0,074 0,715 0,154 0,110 0,416 0,354 0,049 0,674 0,195
which indicates a direct correlation between the variables. In the opposite, the correlation coefficients of the THD parameter are negative, indicating an inverse correlation between the variables. However, the results show that the THD cannot be used as an objective parameter for quantifying the degree of accuracy when performing a speaker recognition. The THD only considers the energy of the harmonics without considering the signal to noise ratio, which increases with increasing the level of distortion producing variations in the degree of perceived noisiness by the listener. The noise energy increment between syllables and words due to the second normalization process at 60% worsens the quality of vocal registers because each person's voice is masked by the noise level of the signal. Figure 11 shows two distorted voice signals (corresponding to the word secuestro as in the previous analysis) with same amplitude scale before applying the 60% normalization process: the left one has applied the D4 distortion level and the right has the D1 one. D4 D1
THD [%] 1,35 5,61 15,89 50,93 Correlation Standard Error Determination Coefficient p-value
Noise between syllables

Figure 11. Noise between syllables before the 60% normalization process
D4
D1
Finally, a comparison between the correlation coefficients of the Distortion Limit Level and THD is shown in Table 6.
Table 6. Correlation coefficients comparison Successful recognitions Correlation Coefficient D1 Voice A Voice B Voice C Voice D Voice E Voice F 5 5 4 8 3 5 D2 9 5 3 10 4 7 D3 13 11 12 15 12 10 D4 1 1 1 1 1 1 Distortion level 0,98 0,77 0,89 0,99 0,88 0,98 THD -0,86 -0,53 -0,72 -0,85 -0,65 -0,80
Noise between syllables

Figure 12. Noise between syllables after the 60% normalization process
The correlation coefficients of the Distortion Level parameter are greater than those obtained for the THD parameter for all the voices. The correlation coefficients of the Distortion Level are positives,
The intervals of silence (actually there is background noise of the original recordings) between syllables are highlighted demonstrating that those intervals have the same amplitude, no matter what is the distortion level applied, because the limiter starting value (-50dB for D4) is never reached. The noise amplitude is of the same order of the voice amplitude for the D4 signal, but it is much lower compared with
7
the D1 one. Because of this, applying the 60% normalization process, the amplitude of the noise increases in proportion to the level of distortion, as shown in Figure 12. This suggests that it is necessary to study the influence of the signal-to-noise ratio produced by the distortion process in conjunction with the THD of the system, since the latter is not sufficient to quantify the degree of successful recognitions. Finally, the major account of successful recognitions corresponds to the Voice A, which corresponds to a person previously knew by the subjects before to perform the test. All subjects know and maintain a daily relationship over four years ago with that person. It indicates that the familiarization of a particular voice may improve the recognition task. 6. CONCLUSIONS From the tests results it was found that the chances to successfully recognize three of the six voices employed are strongly correlated with the distortion level. For the other three voices this conclusion cannot be affirmed. So, there is a need to increase the amount of subjects to perform more test and to obtain more subjective data in order to affirm or refute the hypothesis that the distortion level and the amount of successful recognitions have a strong correlation. The THD parameter of the distortion systems is not useful to predict how difficult is to perform a successful speaker recognition because there is not a strong correlation between THD and the amount of successful recognitions. A visual analysis of the formants printed in a sonogram allows to conclude that as the distortion level increase, the discriminations of vowels energy becomes even more difficult. The spectrograms are a valuable visual tool but its usefulness is limited to the dynamic quality of the records analyzed as both distortion and noise can impair the correct interpretation of the graphs. These findings were reflected in the tests, where only good results were achieved with the minimum distortion level. It may indicate that the subject who performed the test use some spectral information related mainly with the vowels energy from the original references tracks and them tries to characterize the different persons voices with that information. Then, listening both references and distorted tracks, the subjects try to find some common behavior related to the vowels energy distribution between speakers. In the filtered signal test, a strong dependence between the speaker recognition performance and the elimination of spectral information using a telephonelike filter could not be found. As it was explained, the test was designed to avoid any kind of temporal information that may help to the recognition process. However some subjects commented after the realization of the tests that they use some temporal characteristics related to the
pronunciation forms of the persons recorded to try to recognize them, especially with the two highest levels of distortion. 7. FUTURE WORK One of the most important needs that emerges from the results of this report is the increasing of amount of subjects to perform the test and the addition of more distortions levels, which will allow an increasing in the resolutions of the results. It is, the regression lines will be calculated with more data and the correlation coefficients will be more accurate. In addition, it is necessary to evaluate the influence of the reduction of the signal-to-noise ratio with the THD variations in order to design an optimal parameter that allows an accurate quantification of the degree of difficulty when a speaker recognition task is performed. It would be also interesting analyze the signals used for subjective tests and process them with automatic speaker recognition systems, which will not only allow to classify different degrees of precision of such systems but also to compare the number of successful recognitions with those obtained subjectively. This will allow forensic experts to determine to what extent they can rely on their interpretation. On the other hand, it is necessary to investigate and develop new technologies that allows a clear transmission of the human voice in telephone systems, as this is the primary means of communication used by kidnappers. It is also necessary that all offenders under police custody realize a voice recording session in order to store their vocal registers that could potentially be used if any of them re-offend. For further research, it would also be interesting to analyze what happens using female voices in order to analyze the influence of the different levels of distortion when both male and female voice are compared and also study the influence of the previously known voices. 8. REFERENCES [1] Delgado Romero C. Tcnicas digitales de anlisis audiovisual en acstica forense, Actas del 3 Congreso de Investigadores Audiovisuales. Vol. 1. Del Laberinto. Madrid, Spain. November 1999. [2] Lane C. Phase distortion in telephone apparatus, Bell System Technical Journal. Vol. 2, pp. 493-521. New York, US. May 1983. [3] Hirsch H. The Aurora experimental framework for the performance evaluation on speech recognition systems under noisy conditions. VI International Conference on Spoken Language Processing. Vol.4, pp. 29-32. ICSLP-2000. Beijing, China. October 2000. [4] Steinberg J. C. Effects of phase distortion telephone quality. Bell System Technical Journal. Vol. 2, pp. 550-555. New York, US. May 1983.
8
[5] Wang D., Lim J. The unimportance of phase in speech enhancement. Acoustics, Speech and Signal Processing. IEEE Transactions. Vol. 30, pp. 679 681. Cambridge, US. January 2003. [6] Dominguez, S. Estimated weight of evidence in forensic sound for statistical inference of identity of the speaker by Bayesian network application to acoustic features. Masters thesis. pp. 1-5. Madrid, Spain, October 2009 [7] Steinberg J. C. Effects of phase distortion telephone quality. Bell System Technical Journal. Vol. 2, pp. 555-566. New York, US. May 1983. [8] Ladefoged P. A Course in Phonetics. Fort Worth Harcourt Brace Jovanovich College Publishers. Vol. 1, 5th Ed. Boston, US. 2006. [9] Bohn D. Audio specifications. Rane Note. Vol. 12, pp. 1-12. Washington DC, US. 2000 [10] Farina. A. Advancements in Impulse Response Measurements by Sine Sweeps. AES E-Library. Parma, Italy. May 2007.
APPENDIX
Subjective Test
Acoustic Laboratory Name: _________________ Age: _____ Date: ___/___/________
6 different persons voices were recorded, each one saying the following list of 5 words: Dinero, Secuestro, Polica, Rehn, and Bomba. The first three words were filtered and distorted. The other two words were not modified You had first to listen the Reference 1 track which contains the two words that were not modify (Rehn Bomba) corresponding to one person voice. Then, listen the six Voice tracks (A, B, C, D, E and F) each one containing the three distorted words (Dinero Secuestro Polica) and decide which one of the six Voice tracks corresponds to the same person of the Reference 1 t rack. This procedure is repeated for the rest of the Reference tracks. Voice A Voice B Voice C Voice D Voice E Voice F
Distortion 4 Reference 1 Reference 2 Reference 3 Reference 4 Reference 5 Reference 6
10
11

Effect of Distortion On Speaker Recognition in Telephone Line Frequency Band PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Effect of Distortion On Speaker Recognition in Telephone Line Frequency Band PDF

Caricato da

Copyright:

Formati disponibili

Acoustic Laboratory

November 2013, Caseros, Buenos Aires Province, Argentina

EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN TELEPHONE LINE FREQUENCY BAND

Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina. agustin.arias@outlook.com

Filtered voice register

Figure 1. Telephone-like band pass filter

Figure 3 shows the transfer functions of these four configurations.

D4 distorted voice register

Figure 3. Distortion systems transfer functions

D1 distorted voice register

3. SIGNALS ADQUISITION AND

Figure 5. Block diagram of the entire signal processing.

BAND PASS FILTER

INVERSE FILTER CONVOLUTION

Figure 6. Impulse response of a filter + distortion system

Figure 8. Preliminary test. Amount of successful recognitions

Figure 7. Test table for the Distortion D1.

100% 93.33% 86.66%

Figure 10. Amount of successful recognitions for each

Voice F 10 7 5 4 0,976 0,316 0,952 0,024

Noise between syllables

Noise between syllables

Distortion 4 Reference 1 Reference 2 Reference 3 Reference 4 Reference 5 Reference 6

Distortion 3 Reference 1 Reference 2 Reference 3 Reference 4 Reference 5 Reference 6

Distortion 2 Reference 1 Reference 2 Reference 3 Reference 4 Reference 5 Reference 6

Distortion 1 Reference 1 Reference 2 Reference 3 Reference 4 Reference 5 Reference 6

Potrebbero piacerti anche