Sei sulla pagina 1di 24

School of Electrical science

IIT BHUBANESWAR

ANOMALY DETECTION
BASED PRONUNCIATION
VERIFICATION L

Presented by-
P.Naresh Babu
Roll No:18EC01003

l Page no:1/23
School of electrical science
OUTLINE IIT BHUBANESWAR

PRONUNCIATION VERIFICATION
 Speaker level
 Phoneme level
 Confidence score based
 Rule based
 Classifier based
 System
 System overview
 Speech attribute detectors
 Deep neural network(DNN)
 Bottleneck features
 Multi-task learning
 Dropout regularization
 One class support vector machine(OCSVM)
 Achievements
 Conclusion and summary
 References
EE2S001 Naresh Babu 2
School of electrical science IIT BHUBANESWAR

SPEAKER LEVEL

 A Single score representing the speaker’s pronunciation fluency is


estimated from a few sentences, as commonly used in proficiency test

EE2S001 Naresh Babu 3


School of electrical science IIT BHUBANESWAR

PHONEME LEVEL(Detection algorithm)


 Phoneme-level pronunciation verification where evaluation is performed
for each individual phoneme.
 Phoneme-level evaluation can provide rich information about the position
and the type of error made by the user, which then can be used to
generate informative and corrective feedback to improve the learning
process

Classification of Phoneme level


 Confidence score based
 Rule based
 Classification based

EE2S001 Naresh Babu 4


School Of Electrical Science IIT BHUBANESWAR

CONFIDENCE SCORE BASED

 These methods compute a confidence score representing how close the pronunciation
is to the target phoneme and then compare it to a predefined threshold to accept or
reject the produced phoneme.
 The most widely used confidence score is the ‘goodness of pronunciation (GOP)’.
 As the decision threshold is determined using a mispronunciation dataset, it can be
error specific and thus very hard to generalize to different types of pronunciation
errors.

EE2S001 Naresh Babu 5


School Of Electrical Science IIT BHUBANESWAR

RULE BASED
 It is a task specific method where a set of pre defined rules representing the
expected pronunciation errors for a particular task are estimated either
manually by a language expert or automatically data-driven method.
 It requires a prior knowledge of expected mispronunciation rules they offer
the advantage of not only the position but also the type of error made by the
speaker.
 As the rules are customized to a specific problem they can fail if the speaker
produces error not catered for the designed rules.

EE2S001 Naresh Babu 6


School of Electrical Science IIT BHUBANESWAR

CLASSIFICATION BASED
 Each phoneme is classified as either correctly (or) incorrectly pronounced
after prior training of the classifier using both correct and incorrect
pronunciation of the phoneme
 The conventional classification methods such as SVM ,decision tree ,ANN ,etc.
have also been applied to this problem.

EE2S001 Naresh Babu 7


APPLICATIONS
School of Electrical Science IIT BHUBANESWAR

EE2S001 Naresh Babu 8


School of Electrical Science IIT BHUBANESWAR

SYSTEM
SYSTEM OVERVIEW

[1]-The System flow chart

EE2S001 Naresh Babu 9


School of Electrical Science IIT BHUBANESWAR
SYSTEM OVERVIEW
 The system consists of four main stages
 Pre processing
 Forced alignment
 Speech attribute detection
 OCSVM model building
PRE PROCESSING:
 The speech signal is framed with a hamming window of 25msec width and
sampling rate of 10msec
 From each window two types of features are obtained
1. 26 filter bank features extracted by applying triangular filters on Mel-
scale to the power spectrum
2. MFCC coefficients which are the decorelated and compressed version of
filter bank features produced by applying Discrete Cosine Transform
(DCT) on the filter banks and yield typically 12 coefficients.

EE2S001 Naresh Babu 10


School Of Electrical Science IIT BHUBANESWAR

EE2S001 Naresh Babu 11


School of Electrical Science IIT BHUBANESWAR

SYSTEM OVERVIEW
FORCED ALIGNMENT:
 The resultant phoneme sequence is then passed to the forced alignment stage
along with MFCC features extracted from the speech signal to determine the
time boundary of each phoneme.

SPEECH ATTRIBUTE DETECTORS:


 A separate binary DNN classifier is trained to recognize the existence or
absence of each attribute in the current frame.
 A mapping step is performed to map each phoneme to its corresponding
speech attributes
 Different DNN methods are examined to improve the performance of the
speech attribute detectors, including multi-task(MTL) and dropout
regularization(Dout).
 Bottleneck(BN) features were also employed to better represent the phonetic
variations within each speech attribute by increasing the dimensionality of the
extracted features

EE2S001 Naresh Babu 12


School Of Electrical Science IIT BHUBANESWAR

SPEECH ATTRIBUTE DETECTORS

[2]-Comparison amongst the four different DNN configurations used in our speech attribute detectors

EE2S001 Naresh Babu 13


School Of Electrical Science IIT BHUBANESWAR

SPEECH ATTRIBUTE DETECTORS


Deep Neural Network(DNN) architecture:
 A feed forward deep neural network is used as attribute detector
 The softmax function was then applied to the output to convert the arbitrary
output values to probabilities.
 The softmax output of the j-th neuron was calculated as follows:

Bottleneck features:
 It is used to increase the dimensionality of the features extracted from each
attribute.
 The size of the speech attribute feature vector of each frame was increased
from 26 to 260 values, where 10 values represented a single attribute.

EE2S001 Naresh Babu 14


School Of Electrical Science IIT BHUBANESWAR

SPEECH ATTRIBUTE DETECTORS

Fig(1) Fig(2)

[3]-Fig(1)-The feed forward DNN architecture used in the speech attribute detectors
[4]-Fig(2)-The architecture of DNN with bottleneck layer of 2048 nodes
EE2S001 Naresh Babu 15
School of Electrical Science IIT BHUBANESWAR

SPEECH ATTRIBUTE DETECTORS


Multi-task learning(MTL):
 MTL improves the generalization of the main classification task.
 Each DNN attribute is trained to achieve two tasks
 ‘Main task’-the binary classification of each frame’s attribute
 ‘Secondary task’-phoneme classification task, it has 120 output
neurons representing the states of the 40 phonemes,3 states per
phoneme.
 It allowed the network to learn variations amongst phonemes that shared
attribute class and hence produce more discriminative features.
Dropout regularization:
 It works by randomly selecting a percentage of nodes in input and hidden
layers to be inactive with probability ‘p’ during the training process.
 We choose dropout values of 0.2 for the input layer and 0.3 for all hidden
layers

EE2S001 Naresh Babu 16


School of Electrical Science IIT BHUBANESWAR

SPEECH ATTRIBUTE DETECTORS

[5]-The architecture of multi-task learning

EE2S001 Naresh Babu 17


School Of Electrical Science IIT BHUBANESWAR

One class support vector machine


(OCSVM)
 The ocsvm is trained on only the +ve samples to find hyperplane that
separate them from the origin
 Two standard correctly pronounced speech corpora, namely the TIMIT
and WSJ were utilised for the training.
 The ocsvm training was tuned using two parameters
 First, the regulation parameter v, which takes values in the interval [0,1] and
represent the maximum number of training samples allowed on the negative
side of the decision hyperplane
 Second, the minimum number of support vectors as a percentage of the total
number of the training samples

EE2S001 Naresh Babu 18


School Of Electrical Science IIT BHUBANESWAR

The one-class SVM decision regions

[6]

EE2S001 Naresh Babu 19


School of Electrical Science IIT BHUBANESWAR

Achievements
 The DNN-based speech attribute detector achieved an average accuracy
of ‘90%±2.7%’ when 26 different manners and place of articulation were
utilized.
 The combination of dropout and Multi task learning algorithms improved
the accuracy of the attributes to an average accuracy of ‘91.5%±2.5%’.
 MTL with phoneme classification as a secondary task increased the
ability of the speech attribute DNN classifiers to further discriminate
between phonemes that shared the same attribute(s).
 This anomaly detection approach outperforms the commonly used DNN-
GOP algorithm in all testing sets.

EE2S001 Naresh Babu 20


School Of Electrical Science IIT BHUBANESWAR

Conclusions and Summary


 A novel pronunciation verification approach that overcomes the need for the
annotated mispronounced data to model pronunciation errors.
 In anomaly detection, where only correctly pronounced speech is used to
train a phoneme-specific acoustic model that can detect any deviation from
the correct pronunciation.
 An ocsvm classifier as our anomaly detector which was fed by a set of speech
attribute features derived from a bank of DNN-based speech attribute
detectors modelling the manners and places of articulation.
 Multi-task learning and dropout approaches were used to alleviate the
overfitting problem in the DNN speech attribute detectors.
 This approach has reduced the false-acceptance and false-rejection rates by
‘26%’ and ‘39%’ respectively compared to the GOP method.

EE2S001 Naresh Babu 21


School Of Electrical Science IIT BHUBANESWAR

References
[1]Behravan,H., Hautamauki,v., Sinscalchi,S.M., Kinnunen,T., Lee, C.-h.,2014
Introducing attribute features to foreign accent recognition.
Acoustics,speech and signal processing(ICASSP),2014 IEEE International
Conference on.IEEE,(pp.5332-5336)
[2] Van Doremalen et al.,2009 J. Van Doremalen, C. Cucchiarini, H. Strik
Automatic Detection of vowel pronunciation errors using multipl information
sources
Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop
on, IEEE (2009), (pp. 580-585)
[3] Franco et al.,2014 H. Franco, L. Ferrer, H. Bratt
Adaptive Discriminative modeling for improved mispronunciation detection
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on, IEEE (2014), (pp. 7709-7713)

EE2S001 Naresh Babu 22


School Of Electrical Science IIT BHUBANESWAR

References
[4] Glorot and Bengio,2010 X. Glorot, Y. Bengio
Understanding the difficulty of training deep feedforward neural
networks
Proceedings of the thirteenth international conference on artificial
intelligence and statistics (2010), (pp. 249-256)
[5] Hu et al.,2013 W. Hu, Y. Qian, F.K. Soong
A new DNN-based high quality pronunciation evaluation for computer-
aided language learning (CALL)
Interspeech (2013), (pp. 1886-1890)
[6] Khan and Madden,2014 S.S. Khan, M.G. Madden
One-class classification: taxonomy of study and review of techniques
Knowl. Eng. Rev., 29 (3) (2014), pp. 345-374

EE2S001 Naresh Babu 23


Thank you

EE2S001 Naresh Babu 24

Potrebbero piacerti anche