Sei sulla pagina 1di 5

Front-End Factor Analysis for Speaker Verification: In this speaker verification system, factor analysis is used to define a new

ew lowdimensional space that models both speaker and channel variabilitys. Joint Factor analysis: In JFA a speaker utterance is represented by a super vector (M) that consists of additive components from a speaker and a channel/session subspace. Specifically, the speaker-dependent super vector is defined as M= m + Vy + Ux + Dz D=39k * 39k where m is a speaker- and session-independent supervector, V and D define a speaker subspace (eigenvoice matrix and diagonal residual, respectively), and U defines a session subspace (eigenchannel matrix). The vectors y, z, and x are the speaker and sessiondependent factors in their respective subspaces and each is assumed to be a random variable with a normal distribution N(0, I) . Within Speaker variability. Between speaker variability. Channel Variablity C = Ux. U is rectangular, low rank (eigen channels). x standard normal r.v. (channels factors). Speaker & channel dependent supervector M = m + Tw Where m is the speaker- and channel-independent supervector (which can be taken to be the UBM super vector), T is a rectangular matrix of low rank and w is a random vector having a standard normal distribution N(0,I). Estimate of i-vector w is given as, ) ) ) Where N(u) is super vector obtained by concatenating all , where is posterior th count (0 order statistics) of t5he utterance u wr.r.t. eth mixture of UBM. F(u) is a super vector obtained by concatenating all centered 1st order statistics Fe of the utterance u w.r.t. eth mixture of UBM. is mixture wise block-diagonal with entries , where is covariance matrix for the eth matrix of UBM. Baum-welch Statistics

Verification of claim using the cosine kernel

The advantage of this scoring is that no target speaker enrollment is required, unlike for support vector machines and classical joint factor analysis, where the target speaker-dependent super vector needs to be estimated in an enrollment step. LDA: LDA is a technique for dimensionality reduction that is widely used in the field of pattern recognition. The idea behind this approach is to seek new orthogonal axes to better discriminate between different classes. The axes found must satisfy the requirement of maximizing between-class variance and minimizing intra-class variance.

The main difference between the classical use of joint factor analysis for speaker verification and our approach is that we address the channel effects in this new lowdimensional i-vectors space rather than in the high-dimensional GMM mean super vector space.

Statistical Parametric Speech Synthesis:

The contexts used in the HTS English Phoneme: current phoneme preceding and succeeding two phonemes position of current phoneme within current syllable syllable: numbers of phonemes within preceding, current, and succeeding syllables stress3 and accent4 of preceding, current, and succeeding syllables positions of current syllable within current word and phrase numbers of preceding and succeeding stressed syllables within current phrase numbers of preceding and succeeding accented syllables within current phrase number of syllables from previous stressed syllable number of syllables to next stressed syllable number of syllables from previous accented syllable number of syllables to next accented syllable vowel identity within current syllable word: guess at part of speech of preceding, current, and succeeding words numbers of syllables within preceding, current, and succeedin words o position of current word within current phrase o numbers of preceding and succeeding content words within current phrase o number of words from previous content word o number of words to next content word phrase: o numbers of syllables within preceding, current, and succeeding

phrases o position of current phrase in major phrases utterance: numbers of syllables, words, and phrases in utterance The pure view of unit-selection synthesis requires very large databases to cover examples of all required prosodic, phonetic, and stylistic variations. In contrast, statistical parametric synthesis enables models to be combined and adapted and thus does not require instances of any possible combinations of contexts. Although the operation of statistical parametric speech synthesis is impressive, its naturalness is still far from that of natural speech. Advantage of HTS: Unit-selection systems typically select from a finite set of units in the database. They search for the best path throughout a given set of units. In contrast to unit-selection synthesis, statistical parametric synthesis uses statistics to generate speech. Thus, a much wider range of units is effectively available, as context affects the generation of speech parameters through constraining dynamic features, and smoother joins are possible. Multilingual support: multiple languages can easily be accomplished in statistical parametric speech synthesis because only the contextual factors to be used depend on each language. Drawbacks: The biggest drawback with statistical parametric synthesis against unitselection synthesis is the quality of synthesized speech. There seem to be three factors that degrade quality, i.e., vocoders, acoustic modeling accuracy, and over-smoothing.

Speech Enhancement: Reverberant Speech Enhancement Using Cepstral Processing: The complex cepstrum has several properties which make the technique a candidate for deconvolution. First, signals which are combined convolutionally in the time domain have complex cepstra which are combined additively. As a result, deconvolution is reduced to subtraction in the cepstrum. Second, the complex cepstrum is a measure of the frequency of variation in the log spectrum, and so signals which vary slowly in the log spectrum may be separated from quickly varying signals by windowing the complex cepstrum. Speech is usually considered to be primarily slowly varying in the log spectrum and has a complex cepstrum concentrated about the cepstral origin. Echoes which are delayed from the direct path speech can be represented by an impulse response which in the log spectrum is characterized by rapid ripples, and which in the complex cepstrum is composed of pulses concentrated far away from the cepstral origin.

Any cepstral windowing or peak removal operations on can not be expected to remove h(n) completely, and can be expected to distort the resulting estimate

Potrebbero piacerti anche