Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
IIT BHUBANESWAR
ANOMALY DETECTION
BASED PRONUNCIATION
VERIFICATION L
Presented by-
P.Naresh Babu
Roll No:18EC01003
l Page no:1/23
School of electrical science
OUTLINE IIT BHUBANESWAR
PRONUNCIATION VERIFICATION
Speaker level
Phoneme level
Confidence score based
Rule based
Classifier based
System
System overview
Speech attribute detectors
Deep neural network(DNN)
Bottleneck features
Multi-task learning
Dropout regularization
One class support vector machine(OCSVM)
Achievements
Conclusion and summary
References
EE2S001 Naresh Babu 2
School of electrical science IIT BHUBANESWAR
SPEAKER LEVEL
These methods compute a confidence score representing how close the pronunciation
is to the target phoneme and then compare it to a predefined threshold to accept or
reject the produced phoneme.
The most widely used confidence score is the ‘goodness of pronunciation (GOP)’.
As the decision threshold is determined using a mispronunciation dataset, it can be
error specific and thus very hard to generalize to different types of pronunciation
errors.
RULE BASED
It is a task specific method where a set of pre defined rules representing the
expected pronunciation errors for a particular task are estimated either
manually by a language expert or automatically data-driven method.
It requires a prior knowledge of expected mispronunciation rules they offer
the advantage of not only the position but also the type of error made by the
speaker.
As the rules are customized to a specific problem they can fail if the speaker
produces error not catered for the designed rules.
CLASSIFICATION BASED
Each phoneme is classified as either correctly (or) incorrectly pronounced
after prior training of the classifier using both correct and incorrect
pronunciation of the phoneme
The conventional classification methods such as SVM ,decision tree ,ANN ,etc.
have also been applied to this problem.
SYSTEM
SYSTEM OVERVIEW
SYSTEM OVERVIEW
FORCED ALIGNMENT:
The resultant phoneme sequence is then passed to the forced alignment stage
along with MFCC features extracted from the speech signal to determine the
time boundary of each phoneme.
[2]-Comparison amongst the four different DNN configurations used in our speech attribute detectors
Bottleneck features:
It is used to increase the dimensionality of the features extracted from each
attribute.
The size of the speech attribute feature vector of each frame was increased
from 26 to 260 values, where 10 values represented a single attribute.
Fig(1) Fig(2)
[3]-Fig(1)-The feed forward DNN architecture used in the speech attribute detectors
[4]-Fig(2)-The architecture of DNN with bottleneck layer of 2048 nodes
EE2S001 Naresh Babu 15
School of Electrical Science IIT BHUBANESWAR
[6]
Achievements
The DNN-based speech attribute detector achieved an average accuracy
of ‘90%±2.7%’ when 26 different manners and place of articulation were
utilized.
The combination of dropout and Multi task learning algorithms improved
the accuracy of the attributes to an average accuracy of ‘91.5%±2.5%’.
MTL with phoneme classification as a secondary task increased the
ability of the speech attribute DNN classifiers to further discriminate
between phonemes that shared the same attribute(s).
This anomaly detection approach outperforms the commonly used DNN-
GOP algorithm in all testing sets.
References
[1]Behravan,H., Hautamauki,v., Sinscalchi,S.M., Kinnunen,T., Lee, C.-h.,2014
Introducing attribute features to foreign accent recognition.
Acoustics,speech and signal processing(ICASSP),2014 IEEE International
Conference on.IEEE,(pp.5332-5336)
[2] Van Doremalen et al.,2009 J. Van Doremalen, C. Cucchiarini, H. Strik
Automatic Detection of vowel pronunciation errors using multipl information
sources
Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop
on, IEEE (2009), (pp. 580-585)
[3] Franco et al.,2014 H. Franco, L. Ferrer, H. Bratt
Adaptive Discriminative modeling for improved mispronunciation detection
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on, IEEE (2014), (pp. 7709-7713)
References
[4] Glorot and Bengio,2010 X. Glorot, Y. Bengio
Understanding the difficulty of training deep feedforward neural
networks
Proceedings of the thirteenth international conference on artificial
intelligence and statistics (2010), (pp. 249-256)
[5] Hu et al.,2013 W. Hu, Y. Qian, F.K. Soong
A new DNN-based high quality pronunciation evaluation for computer-
aided language learning (CALL)
Interspeech (2013), (pp. 1886-1890)
[6] Khan and Madden,2014 S.S. Khan, M.G. Madden
One-class classification: taxonomy of study and review of techniques
Knowl. Eng. Rev., 29 (3) (2014), pp. 345-374