Sei sulla pagina 1di 6



Mobile phones are used for several applications including playing music and watching videos. It
would be considered a great development in the HCII technology, if the phone can play usic for
the mood of the user. If the listener is sad, the applications on the phone recognizes it and
responds by playing some energetic music to boost up the mood. Researchers have published a
number of technologies compatible with a smart phone [1], [2]. The people associated with
mobile applications development are trying to implement this technology with the smart phones.
The emotion recognition can be performed by using a microphone or a camera of any smart
phone [3]. Some of the techniques use the body sensors to recognize emotions. The traditional
method used for emotion recognition is based on four major techniques such as Feature
Extraction, Feature Selection, Database, and Classification, which are briefly described in this

2.1 eature Extraction

Speech feature extraction is implemented by using a number of methods, completely different in

many ways, as a few of them focus on processing the human speech while others focus on the
amount of distortion. The human speech has a number of features such as pitch, accent and the
speaking rate, thoroughly dependent on the origin of the speaker. These features vary from place
to place. The selection of a suitable feature before designing a speech recognition application is
an important aspect. The selection of the feature is based on the required accuracy in the results.
The speech signal cannot be considered stationary, but it has to be stationary before entering the
Emotion Recognition system. The detailed study in the area of speech declares that the speech
signal is considered to be stationary for a 40ms time period. The speech signal has to be divided
into short segments of 40ms for the sake of making it stationary [4]. These short segments are
called frames. The feature selection takes place at each frame as well as the entire speech,
depending on the method being used. There are two methods for feature selection, known as
local and global. The local method performs the feature selection at each frame, while the global
method performs the feature selection for the entire speech. The best method for the feature
selection is yet to be decided.

2.2 Gender Recognition Features

In this project, the gender of the person is recognized before finding the emotion. The Emotion
Recognition system with the Gender Recognition increases the accuracy. The popular techniques
for the Gender Recognition are pitch detection, formant frequencies, energy between adjacent
formants, jitter, shimmer, harmonics to noise ratio and source spectral tilt correlates.The pitch
detection is the most reliable one among all of them because of the simplicity and accuracy
offered by it. The gender can be separated by the pitch as there is an enormous difference
between the pitch of male and female voices. There is a certain value called a threshold value to
separate male from female. The values below the decided threshold represent male and the upper
values represent female

2.3 Emotion Recognition Features

There are certain features of the speech that have to be tested in order to achieve the accurate
results. The recorded speech can be tested with different features, but it is very important to
select a suitable feature without affecting the accuracy of the Emotion Recognition system. The
number of features are listed as follows:

1. For the amplitude of the speech, the features include mean, median, minimum, maximum, and

2. For the speech energy, there is only mean and variance.

3. For the pitch of signal, the system can have mean, variance, median, minimum, maximum,
and range.

4. For the first 4 formants, there are mean, variance, median, minimum, maximum, and range.

5. 22 Bark sub-bands are considered in terms of energy.

6. For the first 12 Mel-Frequency Cepstrum Coefficient, there is mean, variance, median,
minimum, maximum, and range.

7. There are some other features which are a part of the spectrum shape features, and those
features are center of gravity, skewness, and kurtosis.

2.4 Feature Selection

There are a number of emotion recognition systems as discussed in the previous section, but
selection of the most reliable technique is vital in order to achieve accurate results. Selecting
important features from all the available features is also significant in order to reduce the
alculation time and making the process less complicated. The system needs to have as simple
features as possible if planning to build for a real-time application. The performance of the
system diminishes if the number of features decrease as per a few experts but that opinion has
yet to be proven. More number of features come with great accuracy and complexity, making the
system unsuitable for the real-time applications. There is a case which scientists call the curse of
dimensionality, where the performance decreases even if the number of features increase.

2.5 Database

The central part for this system is the database, which stores all the tested speeches, often called
a dataset. The database is used for placing all the instances together in order to recognize the
emotion. The system has to be trained and tested. The database comes with a number of vocal
sentences performed by different artists. The examples are as follows:

1. Reading-Leeds Database: The database is a set of the speeches which are natural, and it does
not support the speeches, acted by the artists.

2. Belfast Database: The main idea behind designing this database is to design a system that
recognizes the facial expressions as well as vocal inputs.

3. Berlin Emotional Speech Database (BESD): The database has a number of speeches, acted by
different German artists. Every speech is tested by experts under various circumstances before it
becomes a part of the database and further it can be used for testing other speeches. This project
uses BESD database for training and testing the speeches of various speakers.

2.5 Classification Methods

This part is based on an algorithm that trains and tests the audio signals and compares the results.
Classification methods are used to select the method that is accurate enough to predict the
emotion of the input speech. The classifiers are designed with the help of two phases. The
required task gets trained in the first phase, known as an initial phase. The phase comes after the
initial phase known as a processing phase, used for testing the classifiers. These two phases need
to be combined in order to create new phase, which can finally be helpful to recognize the
emotion of the speaker. There are a number of techniques, used for combining these two phases
such as:

1. Percentage Split: This technique divides the database into two different parts, and uses them
for training and testing the classifier.

2. K-fold cross-validation [5]: This technique can be used when the training sentence is large.
The database needs to be divided into k number of parts with equal size. The algorithm has
different steps to perform after that. The number of the steps deal with training and testing of the
classifier by working on these k number of parts. One of these parts is taken for the testing, and
rest of them are taken for the training. This procedure continues until each of the parts is used to
test the classifier. The results will be different at the end of the each procedure, so the best way is
to find an average and declare it as the final result.

3. Leave one out cross validation [5]: This technique is considered as a less used database,
because of the less accurate results provided by the database. It is still used with small
applications, requiring less accuracy.

The results show that these classification methods have both advantages and disadvantages due
to the variations in the input speech. The other techniques are Support Vector Machine (SVM),
Artificial Neural Network (ANN), and K-Nearest Neighbors (K-NN) for the classifications.