Parametric Speech Synthesis Vlsi Processor Design Using Application Specific Instruction

PARAMETRIC SPEECH SYNTHESIS VLSI PROCESSOR DESIGN USING APPLICATION SPECIFIC INSTRUCTION
ABSTRACT
The Emotion Recognition System recognizes the emotional state of a speaker by testing the
emotional patterns in his or her speech. The system determines the speaker’s gender before finding
the emotion, which makes the designed system accurate. This technology is one of the best
examples of the human-computer interaction system. Using MATLAB, the designed system
recognizes a number of emotions such as anger, disgust, boredom, happiness, sadness, and fear.
The Emotion Recognition system has two subsystems, named the Gender Recognition (GR)
subsystem and the Emotion Recognition (ER) subsystem. The achieved results show that the
knowledge of the speaker’s gender helps design a more accurate Emotion Recognition System.
The feature selection process is designed to reduce the features of the human speech and select the
important features to make the system less complicated. The designed ER system is so flexible
that it can be installed in future smartphone technology.
Guided by: Dr. Ashad Ullah Qureshi

Contact: 6260651575, 9179357477
Email: conceptsbookspublication@gmail.com
Web: https://researchitout.blogspot.com/
ACKNOWLEDGEMENTS
It gives me a great pleasure to present “Speech Emotion Recognition with Gender Recognition
Subsystem.” I am thankful to all those whom have offered their important time and uidance for
assisting me in completing this project. I would like to emphasize my gratitude to the professors,
Dr. Hen-Geul Yeh, Dr. Fumio Hamano, Dr. Aftab Ahmed, and other teaching faculty members for
sharing their knowledge and experience with me; without their support and guidance, I might not
have been able to achieve the results I was searching for. I am thankful to the members of
California State University, Long Beach Library, and Writing and Communication Resource
Center for their valuable input.

Contact: 6260651575, 9179357477
CONTENTS
Chapter 1. Introduction 1-2
Chapter 2. background 3-8
2.1 eature Extraction
2.2 Gender Recognition Features
2.3 Emotion Recognition Features
2.4 Feature Selection
2.5 Database
2.5 Classification Methods
Chapter 3. gender-driven emotional recognition system 9 - 17
3.1 Architecture
3.2 Gender Recognition (GR) Subsystem
3.3 Proposed Gender Recognition Algorithm
3.4 Pitch Frequency Estimation
Chapter 4. pdf estimation 18 - 20
Chapter 5. emotion recognition subsystem 21 - 23
5.1 Principal Emotion Features
5.2 Formants
Contact: 6260651575, 9179357477
5.3 Mel-Frequency Cepstrum Coefficients
5.4 Standard Deviation
5.5 Spectrum Central Moments
5.6 Skewness
5.7 Kurtosis
5.8 Glottal Pulses
Chapter 6. emotion features selection 24– 26
6.1 Emotion Classifiers
6.2 Employed Signal Dataset: Berlin Emotional Speech
Chapter 7. performance evaluation 27 – 33
7.1 Without Gender Recognition
7.2With Gender Recognition
7.3 Results of the ER System with
the Gender Recognition Subsystem
Chapter 8. Conclusion 38
References

Contact: 6260651575, 9179357477
CHAPTER 1
INTRODUCTION
Modern studies show that an interest in human-computer interaction is developing, which is

helpful in order to achieve the desired results in the Human - Computer Intelligent Interaction
(HCII) between computers and users. HCII is helping people build their smart homes and smart
offices including virtual reality. This technology plays an important role in observing elderly
people remotely. If someone becomes ill, the person can be monitored from a different location by
a doctor on the other side of world with the help of this technology. There are some areas such as
human psychology and anger management that deal with the human emotions. It is essential to
recognize the emotion of the person before any feedback is offered. Emotion recognition has
become an important research topic in both academics and industry. There is plentiful research
needed in this area, but some considerable results have also been achieved.
The Emotion Recognition (ER) system considers two features, which are facial recognition and
speech recognition. This project uses speech recognition feature to design the Emotion
Recognition (ER) system. The emphasis of the Speech Recognition system is to analyze the
emotional state of different people with the help of speech recognition. The problem starts when
the emotions change repeatedly and the designed system comes with a number of different
emotions at a time. Professionals are still working to improve this area. This project deals with the
types speeches, which do not change frequently.
The number of emotions that the system is designed to recognize are anger, boredom, disgust,
happiness, sadness, and fear. This technology is flexible enough to work with a smartphone
network in near future. The results of the system performance with a single emotion
is compared with the rest of the emotions in order to improve the accuracy. There are a number of
features that should be calculated before the system starts finding the actual emotion. This project
is designed by combining two subsystems such as Gender Recognition (GR) and Emotion
Recognition (ER). The GR subsystem finds the gender of the speaker and notifies the ER
subsystem about that. The Gender Recognition is based on an algorithm designed to find the pitch
of the speaker’s speech. The pitch extraction takes place once the pitch is calculated, which is
followed by a technique known as SWM-based emotion classifier. The technology takes the

Contact: 6260651575, 9179357477
gender details of the speaker as an input. So ultimately the output of the Gender Recognition
system works as an input of the Emotion Recognition system.
The ER system uses a popular database known as Berlin Emotional Speech Database (BESD) for
training and testing the speech. Once the system gets trained, it can be tested to find the required
results. The designed system is able to develop more accurate results than other described methods
in the literature. The feature selection process avoids the less beneficial features and focuses only
on the important features, leading towards designing a less complicated system. The reduced
number of operations makes the system use less processing power and makes it compatible with
small devices like smartphones. The database used in this project is called BESD, collection of
speeches created by German actors. The speeches in this database is a collection of different
emotional vocal expressions tested by a number of experts.

Contact: 6260651575, 9179357477
CHAPTER 2
BACKGROUND
Mobile phones are used for several applications including playing music and watching videos. It
would be considered a great development in the HCII technology, if the phone can play usic for
the mood of the user. If the listener is sad, the applications on the phone recognizes it and responds
by playing some energetic music to boost up the mood. Researchers have published a number of
technologies compatible with a smart phone [1], [2]. The people associated with mobile
applications development are trying to implement this technology with the smart phones. The
emotion recognition can be performed by using a microphone or a camera of any smart phone [3].
Some of the techniques use the body sensors to recognize emotions. The traditional method used
for emotion recognition is based on four major techniques such as Feature Extraction, Feature
Selection, Database, and Classification, which are briefly described in this chapter.
2.1 eature Extraction
Speech feature extraction is implemented by using a number of methods, completely different in

many ways, as a few of them focus on processing the human speech while others focus on the
amount of distortion. The human speech has a number of features such as pitch, accent and the
speaking rate, thoroughly dependent on the origin of the speaker. These features vary from place
to place. The selection of a suitable feature before designing a speech recognition application is an
important aspect. The selection of the feature is based on the required accuracy in the results. The
speech signal cannot be considered stationary, but it has to be stationary before entering the
Emotion Recognition system. The detailed study in the area of speech declares that the speech
signal is considered to be stationary for a 40ms time period. The speech signal has to be divided
into short segments of 40ms for the sake of making it stationary [4]. These short segments are
called frames. The feature selection takes place at each frame as well as the entire speech,
depending on the method being used. There are two methods for feature selection, known as local
and global. The local method performs the feature selection at each frame, while the global method
performs the feature selection for the entire speech. The best method for the feature selection is
yet to be decided.

Contact: 6260651575, 9179357477
2.2 Gender Recognition Features
In this project, the gender of the person is recognized before finding the emotion. The Emotion
Recognition system with the Gender Recognition increases the accuracy. The popular techniques
for the Gender Recognition are pitch detection, formant frequencies, energy between adjacent
formants, jitter, shimmer, harmonics to noise ratio and source spectral tilt correlates.The pitch
detection is the most reliable one among all of them because of the simplicity and accuracy offered
by it. The gender can be separated by the pitch as there is an enormous difference between the
pitch of male and female voices. There is a certain value called a threshold value to separate male
from female. The values below the decided threshold represent male and the upper values represent
female
2.3 Emotion Recognition Features
There are certain features of the speech that have to be tested in order to achieve the accurate
results. The recorded speech can be tested with different features, but it is very important to select
a suitable feature without affecting the accuracy of the Emotion Recognition system. The number
of features are listed as follows:
1. For the amplitude of the speech, the features include mean, median, minimum, maximum, and
range.
2. For the speech energy, there is only mean and variance.
3. For the pitch of signal, the system can have mean, variance, median, minimum, maximum, and
range.
4. For the first 4 formants, there are mean, variance, median, minimum, maximum, and range.
5. 22 Bark sub-bands are considered in terms of energy.
6. For the first 12 Mel-Frequency Cepstrum Coefficient, there is mean, variance, median,
minimum, maximum, and range.
7. There are some other features which are a part of the spectrum shape features, and those features
are center of gravity, skewness, and kurtosis.

Contact: 6260651575, 9179357477
2.4 Feature Selection
There are a number of emotion recognition systems as discussed in the previous section, but
selection of the most reliable technique is vital in order to achieve accurate results. Selecting
important features from all the available features is also significant in order to reduce the alculation
time and making the process less complicated. The system needs to have as simple features as
possible if planning to build for a real-time application. The performance of the system diminishes
if the number of features decrease as per a few experts but that opinion has yet to be proven. More
number of features come with great accuracy and complexity, making the system unsuitable for
the real-time applications. There is a case which scientists call the curse of dimensionality, where
the performance decreases even if the number of features increase.
2.5 Database
The central part for this system is the database, which stores all the tested speeches, often called a
dataset. The database is used for placing all the instances together in order to recognize the
emotion. The system has to be trained and tested. The database comes with a number of vocal
sentences performed by different artists. The examples are as follows:
1. Reading-Leeds Database: The database is a set of the speeches which are natural, and it does
not support the speeches, acted by the artists.
2. Belfast Database: The main idea behind designing this database is to design a system that
recognizes the facial expressions as well as vocal inputs.
3. Berlin Emotional Speech Database (BESD): The database has a number of speeches, acted by
different German artists. Every speech is tested by experts under various circumstances before it
becomes a part of the database and further it can be used for testing other speeches. This project
uses BESD database for training and testing the speeches of various speakers.

Contact: 6260651575, 9179357477
2.5 Classification Methods
This part is based on an algorithm that trains and tests the audio signals and compares the results.
Classification methods are used to select the method that is accurate enough to predict the emotion
of the input speech. The classifiers are designed with the help of two phases. The required task
gets trained in the first phase, known as an initial phase. The phase comes after the initial phase
known as a processing phase, used for testing the classifiers. These two phases need to be
combined in order to create new phase, which can finally be helpful to recognize the emotion of
the speaker. There are a number of techniques, used for combining these two phases such as:
1. Percentage Split: This technique divides the database into two different parts, and uses them for
training and testing the classifier.
2. K-fold cross-validation [5]: This technique can be used when the training sentence is large. The
database needs to be divided into k number of parts with equal size. The algorithm has different
steps to perform after that. The number of the steps deal with training and testing of the classifier
by working on these k number of parts. One of these parts is taken for the testing, and rest of them
are taken for the training. This procedure continues until each of the parts is used to test the
classifier. The results will be different at the end of the each procedure, so the best way is to find
an average and declare it as the final result.
3. Leave one out cross validation [5]: This technique is considered as a less used database, because
of the less accurate results provided by the database. It is still used with small applications,
requiring less accuracy.
The results show that these classification methods have both advantages and disadvantages due to
the variations in the input speech. The other techniques are Support Vector Machine (SVM),
Artificial Neural Network (ANN), and K-Nearest Neighbors (K-NN) for the classifications.

Contact: 6260651575, 9179357477
CHAPTER 3
GENDER-DRIVEN EMOTIONAL RECOGNITION SYSTEM
The designed Emotion Recognition system is designed to recognize different emotions of the
human speech such as anger, disgust, fear, boredom, happiness, and sadness. The gender has to be
recognized before the speech recognition. The architecture of the designed ER system is explained
in the next section.
3.1 Architecture
The architecture of the ER system as shown in Figure 1, consists of three parts. These parts are
front end, feature extraction, and gender-driven emotion recognizer. The original speech signal,
applied to the front end can be represented as s(t), as shown in Figure 1. The front-end block
receives the audio signal s(t) and converts it into a discrete sequence s(n) by sampling it with the
frequency called sampling frequency Fs = 16 KHz. This process is followed by another block
known as feature extraction block.
FIGURE 1. Architecture of the Emotion Recognition system
The feature vector Ω = {ΩGR, ΩER} is calculated by the feature extraction block and it is directly
applied to the gender-driven emotion recognition block. ΩGR represents the gender recognition

Contact: 6260651575, 9179357477
subsystem while ΩER represents the Emotion Recognition subsystem. ER system is divided in
two subsystems such as:
1. Gender Recognition (GR)
2. Emotion Recognition (ER)
3.2 Gender Recognition (GR) Subsystem
The gender of the person can be recognized in many ways, but the system is using the speech of
the person to find the gender, which is a part of the audio-based gender recognition [6]. This
technique is widely accepted in the applications like, smart human-computer interaction,voice
synthesis, multimedia indexing systems, and automatic speech recognition. As shown in Figure 1,
the recognized gender will be used as the input of the emotion recognizer block. The results show
that the accuracy is increased by finding the gender of the person before finding the emotion of
the person.
There are a number of classifiers to find the identity of the gender of the speaker such as Neural
Network, Support Vector Machine and Gaussian Mixture Model. The Support Vector Machine
and the Neural Network have near 100% accuracy, while others are less accurate. The results make
it clear that any of these classifiers can be used as both of them are precisely accurate, but reason
for using the Support Vector Machine classifier is the simplicity it offers to the ER system.
3.3 Proposed Gender Recognition Algorithm
The technique used to recognize the gender of the person is so flexible that it can be installed even
on smart phones as well. The method uses an algorithm to determine the pitch of the speaker. The
whole process is known as the pitch estimation which distinguishes the gender of the person by
providing an accurate pitch. The vocal folds are responsible for pitch. The value of the vocal fold
is longer and thicker for male speakers than female speakers, makes the pitch of male relatively
lower than female.
The accurate results can be achieved by using a single-feature threshold γthr in this project, so it
is not required to use more complicated classifiers. The future aim is to install the application on
smartphones as well. The method should be less complicated in order to deal with the real-time
Contact: 6260651575, 9179357477
results. The feature that will be selected is the Probability Density function (PDF). The signal can
be represented as s(n), n ∈ [1 … . N] which further goes to the GR block to identify the gender of
the speaker. The GR method follows some steps which are as follows: 1. Frame generation by
dividing the signal s(n).
2. Estimate the pitch frequency for each divided frame of the signal s(n).
3. Group the frames of s(n) into an odd number of blocks.
4. Find the pitch PDF for each generated block.
5. Find the PDFmean of each pitch PDF.
6. Compare the values of PDFmean for each block with the value of γthr and find the gender by
the results of the comparison.
7. The final decision of the gender of the person is taken by counting the number of blocks with
the results of the gender; if the majority of blocks says the gender is male then the final result
would be male, otherwise the final result would be female.
Figure 2. Average pitch frequency for male and female.
Figure 2 shows the average value of pitch frequency for both male and female. There is a threshold
value in order to individuate the two different genders. The values above the threshold represent
Contact: 6260651575, 9179357477
females, while the values below the threshold value represent males. There are some values that
cross the threshold value for male, but it can be ignored. There are other methods such as formant
frequencies, energy between adjacent formants, jitter, and shimmer. The pitch estimation provides
100% accuracy and also comes with less complexity, compared to other methods. The pitch
estimation is described as follows.
The lowest component of the frequency is called pitch, and it is the fundamental frequency of the
speech signal [7]. The pitch is defined as the vibration rate of the vocal folds. The Probability
Density Function (PDF) of the pitch needs to be calculated in order to differentiate male and female
speakers. The PDF shows the distribution of the calculated pitch. The distribution shows the pitch
values of male and female, which are clearly separated by the threshold value.
The lowest component of the frequency is called pitch, and it is the fundamental frequency of the
speech signal [7]. The pitch is defined as the vibration rate of the vocal folds. The Probability
Density Function (PDF) of the pitch needs to be calculated in order to differentiate male and female
speakers. The PDF shows the distribution of the calculated pitch. The distribution shows the pitch
values of male and female, which are clearly separated by the threshold value.
There are some algorithms in both time and frequency domain to estimate the pitch, but most of
them are somewhat complicated and not suitable for the real time applications. The methods are
typically designed for several applications and can be less accurate if the domain gets changed.
This project uses autocorrelation method, as the method is less complicated for a voice domain.
The autocorrelation is used to determine the correlation of the signal with itself. The next step after
the autocorrelation is called a down sampling of the signal. The discrete time signal can be
represented as s(n), n ∈ [1 … . N].
The autocorrelation is explained in Equation (1)
R(τ) = Σ s(n)s(n + τ)
N−1
Contact: 6260651575, 9179357477
n=0
(1)
Where, τ ∈ [0,1 … , N − 1]
Where, R(τ) is the autocorrelation and the lag is represented as τ. The audio signal of the speech
can be considered stationary over 40 ms time period. The system reduces the number of possible
samples by using a down sampling autocorrelation function. There has to be a certain range for
the pitch of the speech signal. The range can be written as [P1, P2], where the value of P1 = 50 Hz
and P2 = 500 Hz. Equation (2) and (3) introduce τ1 and τ2 which is the limit of lag τ.
τ1 = [Fs
⁄P2] (2)
τ2 = [Fs
⁄P1] (3)
The sampling frequency is represented by Fs, which needs to be applied to the originalspeech
signal s(t) in order to generate the discrete-time signal s(n). The autocorrelation can also be
represented by Equation R̂
(τ) = Σ s(n)s(n + τ)
N−1−τ
n=0
τ ∈ [τ1, τ1 + 1, τ1 + 2, … , τ2] (4)
The discrete time signal s(n) is divided into a number of frames in order to consider the speech
signal as a stationary signal over a certain time period. The number of samples N= 640 are achieved
by using the sampling frequency Fs= 16 KHz. The selected value of N= 640 helps to acquire the
frames of Lf= 40 ms, which makes it possible to consider the human speech as stationary signal
over that time period.

Contact: 6260651575, 9179357477
The entire autocorrelation function R(τ) shows 640 samples. The down sampled autocorrelation R̂
(τ) can be calculated by using the values of P1 = 50 Hz, P2 = 500 Hz, τ1 = 32 and τ2 = 320. The
value of R̂ (τ) will be τ2 − τ1 + 1 = 289 samples.
The autocorrelation helps to understand the correlation of the signal with itself. Once the
correlation of the signal is found, it is required to find the pitch period. The pitch period is
calculated as shown in Equation (5) and the frequency of pitch is calculated by using the Equation
(6).
τpitch = argτ maxR̂
(τ) (5)
ρpitch = Fs Τpitch (6)
These complicated calculations are reduced by using a technique called down sampling. The down
sampling factors can be represented as r < 1 and 1 R ∈ ℕ . The down sampled version of the
autocorrelation has K= rN samples, where N is the original set of the autocorrelation samples. The
down sampled autocorrelation is shown in the Equation (7).
N−1−τ
n=0
(7)
Where, τ ∈ [τ1, τ1 + 1 r, τ1 +2 r , … , τ2]
There will be just one sample of R̂ (τ) out of 1/r, considered in the interval of [τ1 … τ2]. The
reduced equations are shown in Equation (8) and (9). τ̂pitch = argτ maxR̂
(τ) (8)
ρ̂pitch =
Fs
τ̂pitch (9)

Contact: 6260651575, 9179357477
The designed GR subsystem uses autocorrelation technique for the pitch estimation. The number
of samples are more and they needed to be reduced in order to make the system less complicated.
The down sampled autocorrelation technique helps to reduce the number of samples and makes
the system faster with more accuracy.

Contact: 6260651575, 9179357477
CHAPTER 4
PDF ESTIMATION
The speech signal cannot be considered stationary. It needs to be divided into small samples in
order to be considered stationary. The detailed explanation is given in this section. The original
speech signal, represented as s(n), n = 1…N is divided into F number of frames with the help of a
mathematical F = [N ⁄L] frames [8]. Where, the number of samples in each frame can be
represented by L, defined as L= Fs Lf. Where, Fs is the sampling frequency and Lf is the frame
period. The generic i-th frame is defined in Equation (10).
fi = {s(n): n = (i − 1)L + 1, . . , iL}, i = 1, … , F (10)
The consecutive frames have to be grouped in order to generate the blocks. Each block is processed
in order to find the pitch PDF for each of the frames. The designed system makes these blocks
overlap each other by V number of frames. The last V frames will be considered as the first V
frames of the next block due to the overlap.
The classification process would be performed with the help of two individual blocks. There will
be total of B number of blocks, where the value of B is calculated by Equation (11).
B = [F − V
⁄D − V] (11)
The t-th block is defined in Equation (12).
bt = {fi: i = (t − 1)(D − V) + 1, … , tD − (t − 1)V}, t = 1, … , B (12)
There is V number of pitch values for each bt block. The frequency interval ranges from the
minimum pitch value to the maximum pitch value, which is further divided into H smaller
frequency bins with a size of Δp Hz. The PDF for each block bt can be calculated with the help of
Equation (13).
PDF (p) = Σ wh. rect (p − [(12+ h) Δp] Δp⁄ ) H−1 h=0 (13)
Equation (13) shows the associated coefficient to the h-th bin by wh. The estimated PDF can be
calculated by using a method known as Histogram Count. Feature Vector Definition and Gender
Contact: 6260651575, 9179357477
Classification Policy The purpose of using the Gender Recognition is to improve the accuracy of
the ER system. The accurate results can be achieved by evaluating different feature vectors with
the combination of different individual features. The number of individual features are as follows:
1. PDF maximum which is represented as PDFmax.
2. PDF mean which is represented as PDFmean.
3. PDF standard deviation which is represented as PDFstd.
4. PDF roll-off which is represented as PDFroll off.
Each block comes with a feature vector. The feature vector can be combined for each block and
the result is shown in Equation (14).
ΩGR = {w1
GR, … , wz
GR, … , wZ
GR} (14)
Where, z ϵ [1,Z]
In Equation (14), Z is the size of the defined feature vector. The pitch of the human voice is one
of the key features to separate male from female as female pitch is higher than male pitch. The
feature vector is called ΩGR. The feature vector is reduced to ΩGR = w1 GR = PDFmean, as all
the sufficient information necessary for separating two genders can be achieved by the PDF mean.
The PDFmean is defined with the help of Equation (15).
ΩGR = w1 GR = PDFmean = Σ ph. wh
GR = PDFmean = Σ ph. wh H−1 h=0
In Equation (15), ph shows the central frequency of the h-th bin. The value of g is 1 for
Male and -1 for Female. Equation (16) helps to find value of g.

Contact: 6260651575, 9179357477
g = g(ΩGR) = g(w1
GR) = −sgn (w1
GR − γthr)
= −sgn (Σ ph. wh − H−1 h=0 γthr)
The gender is calculated by the received value of g. If the system receives g=1, it decides
the speaker is a male. If the system receives g=0, it decides the speaker is a female. The experts
have decided to set the threshold γthr around 160 Hz. This method recognizes the gender of the
speaker 100% accurately.

Contact: 6260651575, 9179357477
CHAPTER 5
EMOTION RECOGNITION SUBSYSTEM
As discussed previously, the Emotion Recognition (ER) system can be designed with the help of
two inputs which are: Gender Recognition subsystem and Emotion Recognition subsystem. The
GR subsystem has a number of features to recognize the gender. The feature needs to be selected
without losing any useful information. The GR subsystem uses pitch to recognize the gender.
The ER subsystem can use a number of features, but there is no exact method to select individual
features like GR subsystem. It makes the designed system to avoid selecting a particular feature
for ER subsystem. There are more than 180 features employed by the experts for the ER subsystem,
but only a few of them will be discussed in this project.
5.1 Principal Emotion Features
The features useful for the ER subsystem are defined in this section. The terms such as Energy and
Amplitude are one of the basic features of the signals which do not have any other formal
definition. Other features are described as follows:
5.2 Formants
The resonance frequency of the vocal signal is known as formants. It is one of the most used
features for distinguishing the vowel sounds. Formants distinguish vowel sound with the help of
estimating the frequency and the bandwidth of -3 dB. The Linear Predictive Coding (LPC) is used
to compute the formants. The study shows that the speech signal s(n) has to be resampled with the
value Fmax = 5.5 Hz. This value can be considered as the maximum value of the frequency which
could be applied for searching the formants search.
The system uses a pre-emphasis filter. The signal needs to be divided with the size of 0.05 s. These
frames are separated by applying a Gaussian Window. The coefficients of LPC are calculated by
using the Burg method. The i-th root pair of the LPC can be defined as zi = rie±θi . Equation (17)
defines the frequency and Equation (18) defines the -3 dB bandwidth.
γi =Fs2πθi (17) Δi= −Fsπln ri
Where, γi is used to represent the frequency and Δi shows the -3 dB bandwidth for i-th formant.
This algorithm helps find all the possible formants. The values of the formants must be between 0
Hz to Fmax Hz. The detailed study shows that sometimes the algorithm finds formants near 0 Hz
Contact: 6260651575, 9179357477
and Fmax Hz, which can be considered as false formants. The formants below 50 Hz and over
(Fmax − 50) Hz should not be considered to avoid the false formants.
5.3 Mel-Frequency Cepstrum Coefficients
This method is used to deal with short-term audio signals. The Mel-Frequency Cepstrum
Coefficients (MFCC) method works by dividing the audio signal into the small frames, where the
size of the frames will be 30 ms. These small divided frames are known as analysis frames. The
center of each frame is 10 ms away from the center of the next frame, which can be achieved by
overlapping two adjacent frames. The MFCC should be calculated for each frame, which can be
achieved with the help of a function known as the Discrete Cosine Transform (DCT) function,
defined in Equation (19).
Where, i ϵ [1, M]
In Equation (19), Pj shows the power (dB) of the j-th filter, and the number of considered MFCC
are represented by M. Center of Gravity The Center of Gravity (COG) helps to find the
frequencies present in the considered spectrum. The COG later states how high the frequencies
are in the current spectrum. The COG helps define this distribution by finding the average value
of this distribution. The COG is represented by Equation (20).
In Equation (20), S(k) is the Discrete Fourier Transform of the input signal s(n) and the k-th
frequency composing the DFT is defined as k N , k = 0, … , N − 1.

Contact: 6260651575, 9179357477
5.4 Standard Deviation
There is some portion of the frequency that deviates from the COG, and that portion has to be
measured to catch the accurate results. The value of the frequency that deviates from the COG is
known as the Standard Deviation (SD) of the speech signal. Standard Deviation is defined as σ =
√μ2, where the second central moment is shown as μ2.
5.5 Spectrum Central Moments
If the considered sequence is s(n) then the value of the m-th central spectral moment can be
calculated with the help of Equation (21).
In Equation (21), S(k) is the Discrete Fourier Transform of the input signal s(n).
5.6 Skewness
Skewness is calculated by dividing the third central moment of s(n) by 1.5 power of the second
central moment. The basic definition states that the measure of symmetry is known as the kewness,
which can be written in the mathematical form as γ1 = μ3 √μ2
5.7 Kurtosis
The values of the fourth central moment and the second central moment need to be calculated
before defining Kurtosis. It can be defined as 𝛾2 = 𝜇4 𝜇2 2 − 3.
5.8 Glottal Pulses
The human speech can be considered periodic in the time domain. The periodic patterns are known
as a Cycle. All these patterns need to be identified and the patterns that cannot be identified are
ignored. The peak of each cycle is known as Glottal Pulse. These cycles have some duration,
known as the period of the glottal pulse. There are several methods for identifying the glottal pulse,
but a method called Dynamic Waveform Matching (DWM) is a widely used method. The equation
T̅ = 1 Q Σ Tq Q q=1 is used for finding the mean of the glottal period. The glottal pulses are
separated by the intervals, and the absolute difference for each interval is different. The average

Contact: 6260651575, 9179357477
value is calculated for them, and it is called the Jitter Local Absolute(JLA).The voice quality can
also be defined by Jitter Local Absolute (JLA). The representation of Jitter Local Absolute is
depicted in Equation (22).
Where, Tq is used to show the duration of the q-th glottal pulse period, while the number of
periods are represented by Q. This chapter defines all the important terms such as formants,
MFCC, COG, SD, spectrum central moments, skewness, kurtosis, and glottal pulses. All these
terms are useful in order to train and test each frame, which further helps decide the emotion of
the speaker.

Contact: 6260651575, 9179357477
CHAPTER 6
EMOTION FEATURES SELECTION
Human speech has a number of features as previously described. These features need to be tested
in order to recognize the associated emotion with the speech. These features require some complex
calculations, but the complexity can be reduced by focusing on the important features rather than
considering all the features, called the feature selection. This process avoids the less important
features without losing any information about the speech. The features used in speech recognition
are reduced with the help of an algorithm called Principal Components Analysis (PCA). This
algorithm reduces the vectors from high dimension to low dimension, without losing any
information. PCA calculates the mean value. Then, the mean value needs to be subtracted from
each of the features that are discovered by the algorithm. This subtraction creates a newer value,
known as the zero-mean feature set. The linear combination of the original features that is
calculated, which is known as Principal Components (PCs), allows the system to find the number
of PCs with different variance values. The results show that the first PC has the largest variance
value. The factor scores are defined as the value of PCs, which are nothing but the projection taken
on the principal components. The contribution lies between 0 and 1. If the value of contributions
increases, the feature contributions also increase.
There are several features with different amounts of contributions. The features with higher
contributions are given greater priority over the features with lower contributions, called the
feature selection. The selected features are placed in the numeric result section. The feature
selection process is introduced to reduce the complicated calculations by avoiding less useful
features.
6.1 Emotion Classifiers
The newly introduced algorithm, Support Vector Machine (SVM), is a machine learning algorithm
by Vapnik, used for achieving a high dimensional feature space from the audio signal. This method
uses a hyperplane between different features to separate them. There is a separate classifier
allocated to each gender as shown in Figure 3.

Contact: 6260651575, 9179357477
Figure 3. Emotion Recognition Subsystem
As shown in Figure 3, the ER system receives two different inputs such as g(.) and ΩER Where,
g(.) is the recognized gender of the speaker and ΩER represents the features required for
ecognizing the emotion. The ER system has two different SVMs allocated for each gender, known
as male-SVM and female-SVM. If the recognized gender by the GR subsystem is male, then the
male-SVM will be used. If the recognized gender by the GR subsystem is female, then the female-
SVM will be used. These classifiers are trained and tested to recognize each emotion. The
classifiers can be defined as shown in Equation (23).
minλgΓg(λg) =12Σ Σ yug yvg ϕ(xug, xvg)λug λvglgv=1lgu=1 – Σ λug
Where, the Lagrangian Multipliers vector can be represented as λg = {λg1 , … , λg u, … , λg lg
The feature vectors can be represented as xg1 , … , xgu, … , xglg, and the scalars are represented
as yg1 , … , ygu, … , yglg in Equation (23).
6.2 Employed Signal Dataset: Berlin Emotional Speech
As previously discussed, the BES database has a number of speeches acted by different German
actors in different conditions. The actors recorded themselves in different emotional conditions
in order to create the BES database. These speeches were later labeled with a number of emotions
such as anger, disgust, fear, boredom, happiness, sadness, and neutral state. The BES database
has 5 short and 5 long phrases with each of the utterance. Once each of them is labeled with the

Contact: 6260651575, 9179357477
appropriate emotion, each is tested by more than 15 listeners. Each labeled speech needs to be
recognized by the listeners with the certain recognition rate. If the speech fails to be recognized
with a certain recognition rate, it will be rejected and the experts will look for more accurate
speeches. The detailed study shows the recognition rate set by the experts is around 80% in the
BES database [10]. Two extra filters were also added to make the speeches more eliable in
strength and in judging the syllable stress. The sampling rate of these speeches is set to 16 KHz.

Contact: 6260651575, 9179357477
CHAPTER 7
PERFORMANCE EVALUATION
The ER system can be designed in two different ways, which are without Gender Recognition
subsystem and with Gender Recognition subsystem. Researchers claim that the ER system is
more accurate when used with the Gender Recognition subsystem. This section shows the
performance of the ER system with and without the Gender Recognition subsystem, so the results
of both can be compared. This comparison helps in effectively designing the ER system. The
emotions that are being focused on are anger, disgust, fear, boredom, happiness, sadness, and
neutral state.
7.1 Without Gender Recognition
The results of the ER system were tested without providing any prior information about the gender
of the speaker. The method uses a single SVM for training and testing both the genders. The input
signals are taken from the BES database. The BES signals are trained by using a technique called
the K-fold Cross Validation method. Then, the signals are divided into the small equally sized
subsets. The number of subsets are k.
The K-fold Cross Validation method tests a single subset at a moment and trains the rest of the
subsets. There will be a single subset for the testing and k-1 subsets for the training, as there are
a total of k number of subsets. This method will be repeated until every single subset is trained
and tested. The process will be repeated for k times and develops k different results. These k
results need to be averaged, and the combined average will be used for the comparison. The
system tries to obtain results of each emotion and shows the effectiveness of the ER system
without the gender recognition subsystem. Table 1 shows the performance of the ER system
without Gender Recognition subsystem.

Contact: 6260651575, 9179357477
Table 1 represents the recognized emotions in the first row and the actual emotions in the first
column, where AN: anger, BO: boredom, DI: disgust, FE: fear, HA: happiness, NE: neutral, and
SA: sadness (these abbreviations are also used to define the terms in Tables 2 and 3).Recognized
emotions are the emotions, which are recognized by the ER system. The actual emotions are the
real emotions that the speaker expresses. Table 1 shows 0.821 in the second row, meaning that
the actual emotion was anger and it was recognized with the rate of 82.1% by the designed system.
The system recognizes the other emotions a low rate in terms of recognition rate, which means
the system is not accurate to recognize the actual emotion. Other emotions in Table 1 can be
described in the same way, as none of them are recognized accurately by the ER system. The
accuracy of the actual system can be decided by finding the average of all the results. The average
is around 74%, which clearly says that the designed ER system is inaccurate.

Contact: 6260651575, 9179357477
7.2With Gender Recognition
The previous section shows the results of the ER system without using the Gender Recognition
subsystem. This section uses the ER system with the Gender Recognition subsystem. Then, both
the results are compared in order to decide more accurate system. The gender of the speaker is
recognized by the pitch value, as there is an enormous difference between a male pitch and a
female pitch. Then, the recognized gender of the person is placed at the input of the designed ER
system This system uses two separate SVMs for both male and female. Each SVM is trained and
tested by the K-fold Cross Validation method. This section shows the performance of the ER
ystem for each gender separately and compares the results. Table 2 shows the output of the ER
system for male speakers.
The emotion AN in Table 2 is recognized with the rate of 95.3%, which is significantly more
accurate than the values in Table 1. All other emotion are also recognized with the better
recognition rate, compared to the results in Table 1. The gender of the speaker is known in this
system, which helps in improving the accuracy. The ER system with the Gender Recognition
subsystem is relatively more accurate for female speakers. Table 3 shows the results of the ER
system for female speakers.

Contact: 6260651575, 9179357477
Table 3 shows that the emotion SA is recognized with 100% accuracy which is comparatively
larger than the number in Table 1. Other emotions such as anger, boredom, disgust, fear,
happiness, and neutral have improved recognition rates as compared to the results in Table 1. The
results in Table 2 and Table 3 support that the ER system works more accurately when used with
the Gender Recognition subsystem. The average value of the emotion recognition rate for male
speakers is 83% and for female, the emotion recognition rate is 91%. These rates are higher than
the recognition rate of the ER system without the Gender Recognition subsystem, which clearly
supports the ER system designed with the Gender Recognition subsystem. The ER system
identifies the SA emotion 100% accurately for the female speakers, which can never be achieved
without using the Gender Recognition subsystem
7.3 Results of the ER System with the Gender Recognition Subsystem
As discussed previously, the designed ER system recognizes the gender of the speaker and the
associated emotion. The ER system uses the Gender Recognition subsystem to recognize the
gender of the speaker. The gender recognition is performed by identifying the pitch of the speaker.
There is a significant difference between the pitch value of male and female, so the gender can be
identified by noticing the pitch value. Figure 4 shows the output of the ER system.

Contact: 6260651575, 9179357477
Figure 4. Output Of The Er System.
As shown in Figure 4, the ER system recognizes the emotion of the person with the gender. The
ER system uses two separate SVMs for each gender. Once the gender is recognized, the ER system
uses the appropriate SVM for recognizing the emotion. The original signal is taken from the BES
database. The signal is divided into the equal sized subsets and each of the subset is recognized.
Each subset is trained and tested by using the K-fold Cross Validation method. The system deals
with a k number of subsets. Figure 5 shows the graphical representation of the training process.
The training is performed by selecting the train tab in Figure 5. Once the training is performed, the
ER system can identify the emotion.

Contact: 6260651575, 9179357477
Figure 5. Training Of The Er System.
Figure 6 represents the output of the ER system in the Graphical User Interface (GUI) format. The
results are same as shown in Figure 4, but the results are in the GUI format.
Figure 6. Output Of The Er System In The Gui Format
Figure 7 shows the output of the ER system, where the gender of the speaker is male and the
recognized emotion is happiness. The output is in the GUI format.

Contact: 6260651575, 9179357477
Figure 7. Output Of The Er System In The Gui Format
The results in Table 1, Table 2, and Table 3 show that the ER system is significantly accurate
when designed with the Gender Recognition subsystem. The ER system recognizes the actual
motion with 81% accuracy. Figure 4 depicts the recognized gender and emotion of the speaker,
and Figure 6 illustrates the results shown in Figure 4 in the GUI format.

Contact: 6260651575, 9179357477
CHAPTER 8
CONCLUSION
The proposed Emotion Recognition (ER) system makes humans and computers interact with each
other, providing one of the best examples of an effective Human-Computer Intelligent Interaction.
The system is designed to recognize some of the human emotions such as anger, boredom, disgust,
fear, happiness, neutral and sadness. Furthermore, the accuracy of the emotion recognition system
can be increased if there is a prior knowledge of the gender of the speaker. The Emotion
Recognition system is made of two subsystems named Gender Recognition subsystem (GR) and
Emotion Recognition subsystem (ER). Because there is a significant difference between the pitch
of male and female, the GR subsystem uses the pitch of the speaker to differentiate the gender.
The recognized gender is used as an input to the ER subsystem. The ER subsystem is designed by
combining two separate SVMs for each gender. A number of features are used by the ER
subsystem to train and test the speech. The feature selection takes place in order to reduce the
complexity of the designed ER system and comes with the recognized emotion.
Prior knowledge of the speaker’s gender increases the accuracy of this system from 73% to 91%.
The ER technology is considerably flexible enough to be installed on smart phones, face
recognition systems, and other voice command operations
Future Works
Experts are working on including more speech recognition features in order to improve results,
resulting an effective communication between the people who do not speak the same language.
The system can also be useful in improving the vehicle safety applications by constantly checking
the emotional status of the driver. Customer service can also be improved by analyzing the speech
of customers.

Contact: 6260651575, 9179357477
REFERENCES
[1] J. Luo, Affective Computing and Intelligent Interaction, vol. 137. New York, SpringerVerlag,
2012.
[2] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of
Personality and Social Psychology, vol.70, pp. 614- 636, 1996.
[3] G. Chittaranjan, J. Blom, and D. Gatica-Perez, “Who's who with big-five: Analyzing and
classifying personality traits with smart phones,” in Proc. 15th Annu. International Symposium on
Wearable Computers, Jun. 2011, pp. 29-36.
[4] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Upper Saddle River; NJ:
Prentice-Hall, 1993.
[5] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model
selection,” in Proc. Int. Joint Conf. Artificial Intelligence, vol. 14. 1995, pp. 1137-1145.
[6] Y. M. Zeng, Z. Y. Wu, T. Falk, and W. Y. Chan, "Robust GMM based gender classification
using pitch and RASTA-PLP parameters of speech,'' in Proc. Int. Conf. Machine Learning and
Cybernetics, 2006, pp. 3376-3379.
[7] D. Talkin, “A robust algorithm for pitch tracking”, in Speech Coding & Synthesis, W. G. Klejin
and K. K. Paliwal, Eds. New York: Elsevier Science, 1995, pp. 518-595.
[8] W. H. Rogers and T. J. Wagner, “A one sample distribution-free performance bound for local
discrimination rules,” Ann. Stat., vol. 6, no. 3, pp. 506-514, 1978.
[9] R. Fagundes, A. Martins, F. Comparsi de Castro, and M. Felippetto de Castro, “Automatic

gender identification by speech signal using eigen filtering based on Hebbian learning,” paper
presented at 7th Brazilian Symposium on Neural Networks, 2002.
[10] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of

German emotional speech,” Interspeech, vol. 5, pp. 1517-1520, 2005.

Contact: 6260651575, 9179357477

Parametric Speech Synthesis Vlsi Processor Design Using Application Specific Instruction

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Parametric Speech Synthesis Vlsi Processor Design Using Application Specific Instruction

Caricato da

Copyright:

Formati disponibili

PARAMETRIC SPEECH SYNTHESIS VLSI PROCESSOR DESIGN USING APPLICATION SPECIFIC INSTRUCTION

that it can be installed in future smartphone technology.

Guided by: Dr. Ashad Ullah Qureshi

Center for their valuable input.

Guided by: Dr. Ashad Ullah Qureshi

Chapter 1. Introduction 1-2

Chapter 2. background 3-8

2.1 eature Extraction

2.2 Gender Recognition Features

2.3 Emotion Recognition Features

2.4 Feature Selection

2.5 Classification Methods

Chapter 3. gender-driven emotional recognition system 9 - 17

3.2 Gender Recognition (GR) Subsystem

3.3 Proposed Gender Recognition Algorithm

3.4 Pitch Frequency Estimation

3.5 Pitch Frequency Estimation

Chapter 4. pdf estimation 18 - 20

Chapter 5. emotion recognition subsystem 21 - 23

5.1 Principal Emotion Features

5.3 Mel-Frequency Cepstrum Coefficients

5.4 Standard Deviation

5.5 Spectrum Central Moments

5.8 Glottal Pulses

Chapter 6. emotion features selection 24– 26

6.1 Emotion Classifiers

6.2 Employed Signal Dataset: Berlin Emotional Speech

Chapter 7. performance evaluation 27 – 33

7.1 Without Gender Recognition

7.2With Gender Recognition

7.3 Results of the ER System with

the Gender Recognition Subsystem

Guided by: Dr. Ashad Ullah Qureshi

Modern studies show that an interest in human-computer interaction is developing, which is

Guided by: Dr. Ashad Ullah Qureshi

Guided by: Dr. Ashad Ullah Qureshi

2.1 eature Extraction

Speech feature extraction is implemented by using a number of methods, completely different in

Guided by: Dr. Ashad Ullah Qureshi

2.2 Gender Recognition Features

2.3 Emotion Recognition Features

2. For the speech energy, there is only mean and variance.

5. 22 Bark sub-bands are considered in terms of energy.

Guided by: Dr. Ashad Ullah Qureshi

2.4 Feature Selection

Guided by: Dr. Ashad Ullah Qureshi

2.5 Classification Methods

Guided by: Dr. Ashad Ullah Qureshi

GENDER-DRIVEN EMOTIONAL RECOGNITION SYSTEM

FIGURE 1. Architecture of the Emotion Recognition system

Guided by: Dr. Ashad Ullah Qureshi

1. Gender Recognition (GR)

2. Emotion Recognition (ER)

3.2 Gender Recognition (GR) Subsystem

3.3 Proposed Gender Recognition Algorithm

3. Group the frames of s(n) into an odd number of blocks.

4. Find the pitch PDF for each generated block.

5. Find the PDFmean of each pitch PDF.

Figure 2. Average pitch frequency for male and female.

3.4 Pitch Frequency Estimation

3.5 Pitch Frequency Estimation

The autocorrelation is explained in Equation (1)

τ ∈ [τ1, τ1 + 1, τ1 + 2, … , τ2] (4)

Guided by: Dr. Ashad Ullah Qureshi